ukidss las datasets: Topics by Science.gov

Sample records for ukidss las datasets

The T dwarf population in the UKIDSS LAS .

NASA Astrophysics Data System (ADS)

Cardoso, C. V.; Burningham, B.; Smith, L.; Smart, R.; Pinfield, D.; Magazzù, A.; Ghinassi, F.; Lattanzi, M.

We present the most recent results from the UKIDSS Large Area Survey (LAS) census and follow up of new T brown dwarfs in the local field. The new brown dwarf candidates are identified using optical and infrared survey photometry (UKIDSS and SDSS) and followed up with narrow band methane photometry (TNG) and spectroscopy (Gemini and Subaru) to confirm their brown dwarf nature. Employing this procedure we have discovered several dozens of new T brown dwarfs in the field. Using methane differential photometry as a proxy for spectral type for T brown dwarfs has proved to be a very efficient technique. This method can be useful in the future to reliably identify brown dwarfs in deep surveys that produce large samples of faint targets where spectroscopy is not feasible for all candidates. With this statistical robust sample of the mid and late T brown dwarf field population we were also able to address the discrepancies between the observed field space density and the expected values given the most accepted forms of the IMF of young clusters.
VizieR Online Data Catalog: L and T dwarfs from UKIDSS LAS (Marocco+, 2015)

NASA Astrophysics Data System (ADS)

Marocco, F.; Jones, H. R. A.; Day-Jones, A. C.; Pinfield, D. J.; Lucas, P. W.; Burningham, B.; Zhang, Z. H.; Smart, R. L.; Gomes, J. I.; Smith, L.

2015-11-01

The objects presented here have been selected from the UKIDSS LAS 7th Data Release. The details of the selection criteria can be found in Day-Jones et al. (2013, Cat. J/MNRAS/430/1171, hereafter ADJ13). (5 data files).
VizieR Online Data Catalog: UKIDSS-DR8 LAS, GCS and DXS Surveys (Lawrence+ 2012)

NASA Astrophysics Data System (ADS)

Lawrence, A.; Warren, S. J.; Almaini, O.; Edge, A. C.; Hambly, N. C.; Jameson, R. F.; Lucas, P.; Casali, M.; Adamson, A.; Dye, S.; Emerson, J. P.; Foucaud, S.; Hewett, P.; Hirst, P.; Hodgkin, S. T.; Irwin, M. J.; Lodieu, N.; McMahon, R. G.; Simpson, C.; Smail, I.; Mortlock, D.; Folger, M.

2012-03-01

The UKIRT Infrared Deep Sky Survey (UKIDSS) is a large-scale near-IR survey which aim is to cover 7500 square degrees of the Northern sky. The survey is carried out using the Wide Field Camera (WFCAM), with a field of view of 0.21 square degrees, mounted on the 3.8m United Kingdom Infra-red Telescope (UKIRT) in Hawaii. The project comprises five surveys (LAS, GCS, DXS, GPS and UDS). The Large Area Survey (LAS) covers an area of 4000 square degrees in high Galactic latitudes (extragalactic) in the four bands Y(1.0um) J(1.2um) H(1.6um) and K(2.2um) to a depth of K=18.4. The Galactic Clusters Survey (GCS) aims to survey ten large open star clusters and star formation associations, covering a total of 1067 square degrees in the five bands Z (0.9um), Y(1.0um) J(1.2um) H(1.6um) and K(2.2um), plus a second pass in K for proper motions, to a depth of Z=20.4, Y=20.3, J=19.5, H=18.6, K=18.6. The Deep Extragalactic Survey (DXS) aims to map 35 square degrees of sky to a 5-σ point-source sensitivity of J=22.3 and K=20.8 in four carefully selected, multi-wavelength survey areas. The central regions of each field will also be mapped to H=21.8. The primary aim of the survey is to produce a photometric galaxy sample at a redshift of 1-2, within a volume comparable to that of the SDSS, selected in the same passband (rest frame optical). Details of the surveys can be found in the in the paper by Lawrence et al. (2007MNRAS.379.1599L), and at the UKIDSS Surveys site (http://www.ukidss.org/surveys/surveys.html). The data described here represent a subset of the UKIDSS data, limited to the public data and most representative columns. In the "Byte-by-byte Description" below the original names of the columns are given as bracketed names. (3 data files).
VizieR Online Data Catalog: Quasar from SDSS and UKIDSS (Wu+, 2010)

NASA Astrophysics Data System (ADS)

Wu, X.-B.; Jia, Z.

2011-02-01

We cross-identify all quasars in SDSS Data Release 7 (DR7) with the UKIDSS DR3, by finding the closest counterparts within 3-arcsec between the positions in two surveys and requesting all detections in both SDSS ugriz and UKIDSS YJHK bands for each quasar. To do the cross-identifications, we use the CrossID form available at the UKIDSS WFCAM Science Archive web site (http://surveys.roe.ac.uk:8080/wsa/crossID_form.jsp/) and use only the data in UKIDSS Large Area Survey (LAS) in order to avoid the misidentifications in the crowded fields with lower Galactic latitudes. This results in a sample of 8498 quasars with both SDSS and UKIDSS data. (1 data file).
The UKIDSS-2MASS proper motion survey - I. Ultracool dwarfs from UKIDSS DR4

NASA Astrophysics Data System (ADS)

Deacon, N. R.; Hambly, N. C.; King, R. R.; McCaughrean, M. J.

2009-04-01

The UK Infrared Telescope Infrared Deep Sky Survey (UKIDSS) is the first of a new generation of infrared surveys. Here, we combine the data from two UKIDSS components, the Large Area Survey (LAS) and the Galactic Cluster Survey (GCS), with Two-Micron All-Sky Survey (2MASS) data to produce an infrared proper motion survey for low-mass stars and brown dwarfs. In total, we detect 267 low-mass stars and brown dwarfs with significant proper motions. We recover all 10 known single L dwarfs and the one known T dwarf above the 2MASS detection limit in our LAS survey area and identify eight additional new candidate L dwarfs. We also find one new candidate L dwarf in our GCS sample. Our sample also contains objects from 11 potential common proper motion binaries. Finally, we test our proper motions and find that while the LAS objects have proper motions consistent with absolute proper motions, the GCS stars may have proper motions which are significantly underestimated. This is possibly due to the bulk motion of some of the local astrometric reference stars used in the proper motion determination.
Detection, Size, Measurement, and Structural Analysis Limits for the 2MASS, UKIDSS-LAS, and VISTA VIKING Surveys

NASA Astrophysics Data System (ADS)

Andrews, Stephen K.; Kelvin, Lee S.; Driver, Simon P.; Robotham, Aaron S. G.

2014-01-01

The 2MASS, UKIDSS-LAS, and VISTA VIKING surveys have all now observed the GAMA 9hr region in the Ks band. Here we compare the detection rates, photometry, basic size measurements, and single-component GALFIT structural measurements for a sample of 37 591 galaxies. We explore the sensitivity limits where the data agree for a variety of issues including: detection, star-galaxy separation, photometric measurements, size and ellipticity measurements, and Sérsic measurements. We find that 2MASS fails to detect at least 20% of the galaxy population within all magnitude bins, however for those that are detected we find photometry is robust (± 0.2 mag) to 14.7 AB mag and star-galaxy separation to 14.8 AB mag. For UKIDSS-LAS we find incompleteness starts to enter at a flux limit of 18.9 AB mag, star-galaxy separation is robust to 16.3 AB mag, and structural measurements are robust to 17.7 AB mag. VISTA VIKING data are complete to approximately 20.0 AB mag and structural measurements appear robust to 18.8 AB mag.
Ultracool Dwarfs in the Ukirt Infrared Deep Sky Survey (UKIDSS)

NASA Astrophysics Data System (ADS)

Burningham, Ben; Pinfield, D.; Leggett, S. K.; Lodieu, N.; Warren, S. J.; Lucas, P. W.; Tamura, M.; Mortlock, D.; Kendall, T. R.; Jones, H. R.; Jameson, R. F.; Richard, M.; Martin, E. L.; UKIDSS Cool Dwarf Science Working Group

2007-05-01

The UKIRT Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS) presents an unparallelled resource for the study of field brown dwarfs. The UKIDSS Cool Dwarf Science Working Group (CDSWG) is carrying out a search for the lowest temperature brown dwarfs ever discovered, with the possibility of identifying a new spectral class of ultracool dwarf: the Y dwarf. CDSWG members identified 10 new T dwarfs in the early and first data releases of the LAS, including 2 objects with spectral types later than T7.5. One of these is thought to be the coolest T dwarf ever found with a spectral type of T8.5, and an estimated temperature of 650K. Data release 2 (DR2) took place on 1st March 2007, and already the most promising objects have been selected and followed-up photometrically and spectroscopically. In this contribution I will discuss the capabilities of UKIDSS for identifying ultracool dwarfs and summarise our latest results.
Mapping the Milky Way's Halo out to 500 kpc: New M Giants selected from UKIDSS

NASA Astrophysics Data System (ADS)

Bochanski, John J.; Willman, B.; West, A. A.

2013-01-01

We present an analysis of photometrically identified halo M giants in the UKIDSS Large Area Survey (LAS). The UKIDSS LAS Data Release 8 covers 2700 square degrees with yJHK photometry, down to faint limits about 3 magnitudes deeper than 2MASS. UKIDSS LAS DR8 is thus >4 times larger in effective volume than the 2MASS halo map. Combined with ugriz photometry from SDSS, our M giant sample extends to 500 kpc, the first to extend beyond 100 kpc and the first to utilize SDSS photometry to discriminate against quasars. We use this sample to search for new tidal debris structures in the distant halo and to constrain the recent merger history of the Milky Way. Spectroscopic follow-up will facilitate the study of Milky Way halo kinematics. We acknowledge the financial support of NSF AST-1151462.
76 T dwarfs from the UKIDSS LAS: benchmarks, kinematics and an updated space density

NASA Astrophysics Data System (ADS)

Burningham, Ben; Cardoso, C. V.; Smith, L.; Leggett, S. K.; Smart, R. L.; Mann, A. W.; Dhital, S.; Lucas, P. W.; Tinney, C. G.; Pinfield, D. J.; Zhang, Z.; Morley, C.; Saumon, D.; Aller, K.; Littlefair, S. P.; Homeier, D.; Lodieu, N.; Deacon, N.; Marley, M. S.; van Spaandonk, L.; Baker, D.; Allard, F.; Andrei, A. H.; Canty, J.; Clarke, J.; Day-Jones, A. C.; Dupuy, T.; Fortney, J. J.; Gomes, J.; Ishii, M.; Jones, H. R. A.; Liu, M.; Magazzú, A.; Marocco, F.; Murray, D. N.; Rojas-Ayala, B.; Tamura, M.

2013-07-01

We report the discovery of 76 new T dwarfs from the UKIRT Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS). Near-infrared broad- and narrow-band photometry and spectroscopy are presented for the new objects, along with Wide-field Infrared Survey Explorer (WISE) and warm-Spitzer photometry. Proper motions for 128 UKIDSS T dwarfs are presented from a new two epoch LAS proper motion catalogue. We use these motions to identify two new benchmark systems: LHS 6176AB, a T8p+M4 pair and HD 118865AB, a T5.5+F8 pair. Using age constraints from the primaries and evolutionary models to constrain the radii, we have estimated their physical properties from their bolometric luminosity. We compare the colours and properties of known benchmark T dwarfs to the latest model atmospheres and draw two principal conclusions. First, it appears that the H - [4.5] and J - W2 colours are more sensitive to metallicity than has previously been recognized, such that differences in metallicity may dominate over differences in Teff when considering relative properties of cool objects using these colours. Secondly, the previously noted apparent dominance of young objects in the late-T dwarf sample is no longer apparent when using the new model grids and the expanded sample of late-T dwarfs and benchmarks. This is supported by the apparently similar distribution of late-T dwarfs and earlier type T dwarfs on reduced proper motion diagrams that we present. Finally, we present updated space densities for the late-T dwarfs, and compare our values to simulation predictions and those from WISE.
Optical+NIR Quasar Selection with the SDSS and UKIDSS

NASA Astrophysics Data System (ADS)

Mehta, Sajjan S.; Mahon, R. G.; Richards, G. T.; Hewett, P. C.

2010-01-01

We present the details of an optical+near-IR quasar selection technique, which utilizes near-IR data from the UKIDSS Large Area Survey and the optical data from the Sloan Digital Sky Survey in the SDSS's deep "Stripe 82" region, which covers over 200 deg2. Our selection methods primarily consist of isolating potential candidates in giK and gJK color space, in which there exists a significant separation of the stellar locus from the quasar locus. Additionally, we discuss secondary techniques such as comparison of catalog magnitudes with aperture photometry, analysis of SDSS and UKIDSS morphological type classifications, and flag cuts. Our primary color-cut selections include most quasars with redshifts below 3.4, significantly increasing the completeness both to dust reddened quasars and quasars with redshifts z 2.7 in the SDSS footprint. A simple color cut in the UKIDSS LAS Stripe 82 regions reveals 4200 quasar candidates down to K=18. These NIR selections have been used to contribute to the Baryon Oscillation Spectroscopic Survey (BOSS), which is one of the four surveys of the SDSS-III collaboration. We additionally intend to use our NIR techniques to perform an 8-dimensional optical+NIR Bayesian selection of quasars for the AAOmege UKIDSS SDSS (AUS) survey.
Identifying nearby field T dwarfs in the UKIDSS Galactic Clusters Survey

NASA Astrophysics Data System (ADS)

Lodieu, N.; Burningham, B.; Hambly, N. C.; Pinfield, D. J.

2009-07-01

We present the discovery of two new late-T dwarfs identified in the UKIRT Infrared Deep Sky Survey (UKIDSS) Galactic Clusters Survey (GCS) Data Release 2 (DR2). These T dwarfs are nearby old T dwarfs along the line of sight to star-forming regions and open clusters targeted by the UKIDSS GCS. They are found towards the αPer cluster and Orion complex, respectively, from a search in 54deg2 surveyed in five filters. Photometric candidates were picked up in two-colour diagrams, in a very similar manner to candidates extracted from the UKIDSS Large Area Survey (LAS) but taking advantage of the Z filter employed by the GCS. Both candidates exhibit near-infrared J-band spectra with strong methane and water absorption bands characteristic of late-T dwarfs. We derive spectral types of T6.5 +/- 0.5 and T7 +/- 1 and estimate photometric distances less than 50 pc for UGCS J030013.86+490142.5 and UGCS J053022.52-052447.4, respectively. The space density of T dwarfs found in the GCS seems consistent with discoveries in the larger areal coverage of the UKIDSS LAS, indicating one T dwarf in 6-11deg2. The final area surveyed by the GCS, 1000deg2 in five passbands, will allow expansion of the LAS search area by 25 per cent, increase the probability of finding ultracool brown dwarfs, and provide optimal estimates of contamination by old field brown dwarfs in deep surveys to identify such objects in open clusters and star-forming regions. Based on observations made with the United Kingdom Infrared Telescope, operated by the Joint Astronomy Centre on behalf of the U.K. Science Technology and Facility Council. E-mail: nlodieu@iac.es
Two T dwarfs from the UKIDSS early data release

NASA Astrophysics Data System (ADS)

Kendall, T. R.; Tamura, M.; Tinney, C. G.; Martín, E. L.; Ishii, M.; Pinfield, D. J.; Lucas, P. W.; Jones, H. R. A.; Leggett, S. K.; Dye, S.; Hewett, P. C.; Allard, F.; Baraffe, I.; Barrado Y Navascués, D.; Carraro, G.; Casewell, S. L.; Chabrier, G.; Chappelle, R. J.; Clarke, F.; Day-Jones, A.; Deacon, N.; Dobbie, P. D.; Folkes, S.; Hambly, N. C.; Hodgkin, S. T.; Nakajima, T.; Jameson, R. F.; Lodieu, N.; Magazzù, A.; McCaughrean, M. J.; Pavlenko, Y. V.; Tadashi, N.; Zapatero Osorio, M. R.

2007-05-01

Context: We report on the first ultracool dwarf discoveries from the UKIRT Infrared Deep Sky Survey (UKIDSS) Large Area Survey Early Data Release (LAS EDR), in particular the discovery of T dwarfs which are fainter and more distant than those found using the 2MASS and SDSS surveys. Aims: We aim to show that our methodologies for searching the ~27 deg2 of the LAS EDR are successful for finding both L and T dwarfs via cross-correlation with the Sloan Digital Sky Survey (SDSS) DR4 release. While the area searched so far is small, the numbers of objects found shows great promise for near-future releases of the LAS and great potential for finding large numbers of such dwarfs. Methods: Ultracool dwarfs are selected by combinations of their YJH(K) UKIDSS colours and SDSS DR4 z-J and i-z colours, or, lower limits on these red optical/infrared colours in the case of DR4 dropouts. After passing visual inspection tests, candidates have been followed up by methane imaging and spectroscopy at 4 m and 8 m-class facilities. Results: Our main result is the discovery following CH4 imaging and spectroscopy of a T4.5 dwarf, ULAS J 1452+0655, lying ~80 pc distant. A further T dwarf candidate, ULAS J 1301+0023, has very similar CH4 colours but has not yet been confirmed spectroscopically. We also report on the identification of a brighter L0 dwarf, and on the selection of a list of LAS objects designed to probe for T-like dwarfs to the survey J-band limit. Conclusions: Our findings indicate that the combination of the UKIDSS LAS and SDSS surveys provide an excellent tool for identifying L and T dwarfs down to much fainter limits than previously possible. Our discovery of one confirmed and one probable T dwarf in the EDR is consistent with expectations from the previously measured T dwarf density on the sky.
A 2 epoch proper motion catalogue from the UKIDSS Large Area Survey

NASA Astrophysics Data System (ADS)

Smith, Leigh; Lucas, Phil; Burningham, Ben; Jones, Hugh; Pinfield, David; Smart, Ricky; Andrei, Alexandre

2013-04-01

The UKIDSS Large Area Survey (LAS) began in 2005, with the start of the UKIDSS program as a 7 year effort to survey roughly 4000 square degrees at high galactic latitudes in Y, J, H and K bands. The survey also included a significant quantity of 2-epoch J band observations, with epoch baselines ranging from 2 to 7 years. We present a proper motion catalogue for the 1500 square degrees of the 2 epoch LAS data, which includes some 800,000 sources with motions detected above the 5σ level. We developed a bespoke proper motion pipeline which applies a source-unique second order polynomial transformation to UKIDSS array coordinates of each source to counter potential local non-uniformity in the focal plane. Our catalogue agrees well with the proper motion data supplied in the current WFCAM Science Archive (WSA) DR9 catalogue where there is overlap, and in various optical catalogues, but it benefits from some improvements. One improvement is that we provide absolute proper motions, using LAS galaxies for the relative to absolute correction. Also, by using unique, local, 2nd order polynomial tranformations, as opposed to the linear transformations in the WSA, we correct better for any local distortions in the focal plane, not including the radial distortion that is removed by their pipeline.
Finding Hidden Quasars with UKIDSS and AAOmega

NASA Astrophysics Data System (ADS)

Maddox, Natasha; Hewett, P. C.; Warren, S. J.; Croom, S. M.

2007-05-01

The number of luminous quasars that have thus far eluded optical surveys is a subject of ongoing debate. Dust reddening and significant host galaxy light tend to exclude candidates from traditional UV-excess selection. UKIDSS, the near-infrared counterpart to SDSS, has started to provide the large area NIR data required to quantify the number of quasars missing from optical surveys. The quasar candidate list was chosen from the Early Data Release of the UKIDSS Large Area Survey (LAS), which aims to cover 2000 square degrees in two years. Requiring each object to have K<17, J<19.5 (the detection limit of the LAS) and a detection in SDSS were the only restrictions imposed on the candidates. A simple cut in gJK colour space, exploiting the K-band excess of quasars compared to stars, then separates the quasar candidates from the stellar locus. Optical-NIR colour selection with relaxed restrictions on morphology is less sensitive to dust reddening, so provides a more complete candidate list, suitable for follow-up observation with the new AAOmega spectrograph on the Anglo-Australian Telescope. With spectroscopic observations covering nearly 20 square degrees taken at the AAT, this is by far the largest K-band selected quasar sample to date. Many new quasars have been identified, in addition to known quasars being recovered. Several of the newly discovered quasars lie in regions of colour space typically excluded by UV selection. This study highlights the effectiveness of the K-excess technique in selecting quasars that do not necessarily exhibit the classic UV excess, either due to intrinsic SED shape or dust reddening. Combining upcoming UKIDSS data releases with scheduled AAT observations will increase the area surveyed by several times, thus moving closer to fully quantifying the number of luminous, reddened quasars.
The sub-stellar birth rate from UKIDSS

NASA Astrophysics Data System (ADS)

Day-Jones, A. C.; Marocco, F.; Pinfield, D. J.; Zhang, Z. H.; Burningham, B.; Deacon, N.; Ruiz, M. T.; Gallardo, J.; Jones, H. R. A.; Lucas, P. W. L.; Jenkins, J. S.; Gomes, J.; Folkes, S. L.; Clarke, J. R. A.

2013-04-01

We present a new sample of mid-L to mid-T dwarfs with effective temperatures of 1100-1700 K selected from the UKIDSS Large Area Survey (LAS) and confirmed with infrared spectra from X-shooter/Very Large Telescope. This effective temperature range is especially sensitive to the formation history of Galactic brown dwarfs and allows us to constrain the form of the sub-stellar birth rate, with sensitivity to differentiate between a flat (stellar like) birth rate and an exponentially declining form. We present the discovery of 63 new L and T dwarfs from the UKIDSS LAS DR7, including the identification of 12 likely unresolved binaries, which form the first complete sub-set from our programme, covering 495 square degrees of sky, complete to J = 18.1. We compare our results for this sub-sample with simulations of differing birth rates for objects of masses 0.10-0.03 M⊙ and ages 1-10 Gyr. We find that the more extreme birth rates (e.g. a halo type form) can likely be excluded as the true form of the birth rate. In addition, we find that although there is substantial scatter we find a preference for a mass function, with a power-law index α in the range -1 < α < 0 that is consistent (within the errors) with the studies of late T dwarfs.
Dust-reddened Quasars In First And Ukidss

NASA Astrophysics Data System (ADS)

Glikman, Eilat; Lacy, M.; Urrutia, T.

2012-05-01

We recently identified a large population of dust-reddened quasars by matching radio sources detected in the FIRST survey to the 2MASS near-infrared catalog (F2M) and selecting sources with red topical-to-near-infrared colors. We find that dust-reddened quasars are intrinsically the most luminous quasars in the Universe. Further analysis suggests that red quasars represent an emergent phase in merger-driven quasar/galaxy co-evolution model where the obscured quasar is shedding its dusty shroud prior to becoming a "normal" quasar. Here we use the UKIDSS Large Area Survey (LAS) First Data Release (DR1; 190 deg2) to reach fainter K-band magnitudes and expand beyond the results of the F2M survey. The deeper K-band limit provided by UKIDSS enables the discovery of more heavily reddened quasars at higher redshifts. We selected 95 candidates in the UKIDSS DR1 that had matches in the FIRST catalog with K<17.0 and obeyed color criteria similar to the F2M survey (R-K>5, J-K > 1.5). We have obtained 54 near-infrared spectra as well as 12 optical spectra from SDSS. Preliminary analysis confirm 12 new obscured quasars, including at least two with z>2 reaching lower intrinsic luminosities than were found by the F2M survey. We find that despite being a luminous quasar phenomenon, the space density of red quasars continues to rise to fainter magnitudes, representing 20% of the overall quasar population.
Near-infrared Photometric Properties of 130,000 Quasars: An SDSS-UKIDSS-matched Catalog

NASA Astrophysics Data System (ADS)

Peth, Michael A.; Ross, Nicholas P.; Schneider, Donald P.

2011-04-01

We present a catalog of over 130,000 quasar candidates with near-infrared (NIR) photometric properties, with an areal coverage of approximately 1200 deg2. This is achieved by matching the Sloan Digital Sky Survey (SDSS) in the optical ugriz bands to the UKIRT Infrared Digital Sky Survey (UKIDSS) Large Area Survey (LAS) in the NIR YJHK bands. We match the ≈1 million SDSS DR6 Photometric Quasar catalog to Data Release 3 of the UKIDSS LAS (ULAS) and produce a catalog with 130,827 objects with detections in one or more NIR bands, of which 74,351 objects have optical and K-band detections and 42,133 objects have the full nine-band photometry. The majority (~85%) of the SDSS objects were not matched simply because these were not covered by the ULAS. The positional standard deviation of the SDSS Quasar to ULAS matches is δR.A. = 0farcs1370 and δdecl. = 0farcs1314. We find an absolute systematic astrometric offset between the SDSS Quasar catalog and the UKIDSS LAS, of |R.A.offset| = 0farcs025 and |decl.offset| = 0farcs040; we suggest the nature of this offset to be due to the matching of catalog, rather than image, level data. Our matched catalog has a surface density of ≈53 deg-2 for K <= 18.27 objects; tests using our matched catalog, along with data from the UKIDSS Deep Extragalactic Survey, imply that our limiting magnitude is i ≈ 20.6. Color-redshift diagrams, for the optical and NIR, show a close agreement between our matched catalog and recent quasar color models at redshift z <~ 2.0, while at higher redshifts, the models generally appear to be bluer than the mean observed quasar colors. The gJK and giK color spaces are used to examine methods of differentiating between stars and (mid-redshift) quasars, the key to currently ongoing quasar surveys. Finally, we report on the NIR photometric properties of high, z > 4.6, and very high, z > 5.7, redshift previously discovered quasars.
A new UKIDSS proper motion survey and key early results, including new benchmark systems

NASA Astrophysics Data System (ADS)

Smith, L.; Lucas, P.; Burningham, B.; Jones, H.; Pinfield, D.; Smart, R.; Andrei, A.

We present a proper motion catalogue for the 1500 deg2 of 2 epoch J-band UKIDSS Large Area Survey (LAS) data, which includes 120,000 stellar sources with motions detected above the 5sigma level. Our upper limit on proper motion detection is 3\\farcs3 yr-1 and typical uncertainties are of order 10 mas yr-1 for bright sources from data with a modest 1.8-7.0 year epoch baseline. We developed a bespoke proper motion pipeline which applies a source-unique second order polynomial transformation to UKIDSS array coordinates to counter potential local non-uniformity in the focal plane. Our catalogue agrees well with the proper motion data supplied in the current WFCAM Science Archive (WSA) tenth data release (DR10) catalogue where there is overlap, and in various optical catalogues, but it benefits from some improvements, such as a larger matching radius and relative to absolute proper motion correction. We present proper motion results for 128 T dwarfs in the UKIDSS LAS and key early results of projects utilising our catalogue, in particular searches for brown dwarf benchmark systems through cross matches with existing proper motion catalogues. We report the discovery of two new T dwarf benchmark systems.
High Redshift QSOs in the UKIDSS Large Area Survey

NASA Astrophysics Data System (ADS)

Venemans, B. P.

2007-12-01

In this proceeding, I will present the first results on our ongoing search for z⪆6 quasars in the UKIDSS Large Area Survey (LAS). The unique infrared sky coverage of the LAS combined with SDSS i and z observations allows us to efficiently search for high redshift quasars with minimal contamination from foreground objects, e.g. galactic cool stars. Analysis of 106 deg^2 of sky from UKIDSS Data Release 1 (DR1) has resulted in the discovery of ULAS J020332.38+001229.2, a luminous (J_{AB}=20.0, M_{1450}=-26.2) quasar at z=5.86. The quasar is not present in the SDSS DR5 catalogue and the continuum spectral index of α=-1.4 (F_{ν}∝ν^{α}) is redder than a composite of SDSS quasars at similar redshifts (α=-0.5). Although it is difficult to draw any strong conclusions regarding the space density of quasars from one object, the discovery of this quasar in ˜100 deg^2 in a complete sample within our selection criteria down to a median depth of Y_{AB}=20.4 (7σ) is consistent with existing SDSS results. Finally, I will present the expected number density of high redshift z>6.5 quasars using future infrared surveys with VISTA.
NEAR-INFRARED PHOTOMETRIC PROPERTIES OF 130,000 QUASARS: AN SDSS-UKIDSS-MATCHED CATALOG

DOE Office of Scientific and Technical Information (OSTI.GOV)

Peth, Michael A.; Ross, Nicholas P.; Schneider, Donald P., E-mail: npross@lbl.gov

2011-04-15

We present a catalog of over 130,000 quasar candidates with near-infrared (NIR) photometric properties, with an areal coverage of approximately 1200 deg{sup 2}. This is achieved by matching the Sloan Digital Sky Survey (SDSS) in the optical ugriz bands to the UKIRT Infrared Digital Sky Survey (UKIDSS) Large Area Survey (LAS) in the NIR YJHK bands. We match the {approx}1 million SDSS DR6 Photometric Quasar catalog to Data Release 3 of the UKIDSS LAS (ULAS) and produce a catalog with 130,827 objects with detections in one or more NIR bands, of which 74,351 objects have optical and K-band detections andmore » 42,133 objects have the full nine-band photometry. The majority ({approx}85%) of the SDSS objects were not matched simply because these were not covered by the ULAS. The positional standard deviation of the SDSS Quasar to ULAS matches is {delta}{sub R.A.} = 0.''1370 and {delta}{sub decl.} = 0.''1314. We find an absolute systematic astrometric offset between the SDSS Quasar catalog and the UKIDSS LAS, of |R.A.{sub offset}| = 0.''025 and |decl.{sub offset}| = 0.''040; we suggest the nature of this offset to be due to the matching of catalog, rather than image, level data. Our matched catalog has a surface density of {approx}53 deg{sup -2} for K {<=} 18.27 objects; tests using our matched catalog, along with data from the UKIDSS Deep Extragalactic Survey, imply that our limiting magnitude is i {approx} 20.6. Color-redshift diagrams, for the optical and NIR, show a close agreement between our matched catalog and recent quasar color models at redshift z {approx}< 2.0, while at higher redshifts, the models generally appear to be bluer than the mean observed quasar colors. The gJK and giK color spaces are used to examine methods of differentiating between stars and (mid-redshift) quasars, the key to currently ongoing quasar surveys. Finally, we report on the NIR photometric properties of high, z > 4.6, and very high, z > 5.7, redshift previously discovered quasars.« less

A 1500 deg2 near infrared proper motion catalogue from the UKIDSS Large Area Survey

NASA Astrophysics Data System (ADS)

Smith, Leigh; Lucas, P. W.; Burningham, B.; Jones, H. R. A.; Smart, R. L.; Andrei, A. H.; Catalán, S.; Pinfield, D. J.

2014-02-01

The United Kingdom Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS) began in 2005, with the start of the UKIDSS programme as a 7 year effort to survey roughly 4000 deg2 at high Galactic latitudes in Y, J, H and K bands. The survey also included a significant quantity of two epoch J band observations, with an epoch baseline greater than 2 years to calculate proper motions. We present a near-infrared proper motion catalogue for the 1500 deg2 of the two epoch LAS data, which includes 135 625 stellar sources and a further 88 324 with ambiguous morphological classifications, all with motions detected above the 5σ level. We developed a custom proper motion pipeline which we describe here. Our catalogue agrees well with the proper motion data supplied for a 300 deg2 subset in the current Wide Field Camera Science Archive (WSA) 10th data release (DR10) catalogue, and in various optical catalogues, but it benefits from a larger matching radius and hence a larger upper proper motion detection limit. We provide absolute proper motions, using LAS galaxies for the relative to absolute correction. By using local second-order polynomial transformations, as opposed to linear transformations in the WSA, we correct better for any local distortions in the focal plane, not including the radial distortion that is removed by the UKIDSS pipeline. We present the results of proper motion searches for new brown dwarfs and white dwarfs. We discuss 41 sources in the WSA DR10 overlap with our catalogue with proper motions >300 mas yr-1, several of which are new detections. We present 15 new candidate ultracool dwarf binary systems.
Four faint T dwarfs from the UKIRT Infrared Deep Sky Survey (UKIDSS) Southern Stripe

NASA Astrophysics Data System (ADS)

Chiu, Kuenley; Liu, Michael C.; Jiang, Linhua; Allers, Katelyn N.; Stark, Daniel P.; Bunker, Andrew; Fan, Xiaohui; Glazebrook, Karl; Dupuy, Trent J.

2008-03-01

We present the optical and near-infrared photometry and spectroscopy of four faint T dwarfs newly discovered from the UKIDSS first data release. The sample, drawn from an imaged area of ~136 deg2 to a depth of Y = 19.9 (5σ, Vega), is located in the Sloan Digital Sky Survey (SDSS) Southern Equatorial Stripe, a region of significant future deep imaging potential. We detail the selection and followup of these objects, three of which are spectroscopically confirmed brown dwarfs ranging from type T2.5 to T7.5, and one is photometrically identified as early T. Their magnitudes range from Y = 19.01 to 19.88 with derived distances from 34 to 98 pc, making these among the coldest and faintest brown dwarfs known. The T7.5 dwarf appears to be single based on 0.05-arcsec images from Keck laser guide star adaptive optics. The sample brings the total number of T dwarfs found or confirmed by UKIDSS data in this region to nine, and we discuss the projected numbers of dwarfs in the future survey data. We estimate that ~240 early and late T dwarfs are discoverable in the UKIDSS Large Area Survey (LAS) data, falling significantly short of published model projections and suggesting that initial mass functions and/or birth rates may be at the low end of possible models. Thus, deeper optical data have good potential to exploit the UKIDSS survey depth more fully, but may still find the potential Y dwarf sample to be extremely rare.
The UKIRT Infrared Deep Sky Survey (UKIDSS)

NASA Astrophysics Data System (ADS)

Lawrence, A.; Warren, S. J.; Almaini, O.; Edge, A. C.; Hambly, N. C.; Jameson, R. F.; Lucas, P.; Casali, M.; Adamson, A.; Dye, S.; Emerson, J. P.; Foucaud, S.; Hewett, P.; Hirst, P.; Hodgkin, S. T.; Irwin, M. J.; Lodieu, N.; McMahon, R. G.; Simpson, C.; Smail, I.; Mortlock, D.; Folger, M.

2007-08-01

We describe the goals, design, implementation, and initial progress of the UKIRT Infrared Deep Sky Survey (UKIDSS), a seven-year sky survey which began in 2005 May. UKIDSS is being carried out using the UKIRT Wide Field Camera (WFCAM), which has the largest étendue of any infrared astronomical instrument to date. It is a portfolio of five survey components covering various combinations of the filter set ZYJHK and H2. The Large Area Survey, the Galactic Clusters Survey, and the Galactic Plane Survey cover approximately 7000deg2 to a depth of K ~ 18; the Deep Extragalactic Survey covers 35deg2 to K ~ 21, and the Ultra Deep Survey covers 0.77deg2 to K ~ 23. Summed together UKIDSS is 12 times larger in effective volume than the 2MASS survey. The prime aim of UKIDSS is to provide a long-term astronomical legacy data base; the design is, however, driven by a series of specific goals - for example, to find the nearest and faintest substellar objects, to discover Population II brown dwarfs, if they exist, to determine the substellar mass function, to break the z = 7 quasar barrier; to determine the epoch of re-ionization, to measure the growth of structure from z = 3 to the present day, to determine the epoch of spheroid formation, and to map the Milky Way through the dust, to several kpc. The survey data are being uniformly processed. Images and catalogues are being made available through a fully queryable user interface - the WFCAM Science Archive (http://surveys.roe.ac.uk/wsa). The data are being released in stages. The data are immediately public to astronomers in all ESO member states, and available to the world after 18 months. Before the formal survey began, UKIRT and the UKIDSS consortia collaborated in obtaining and analysing a series of small science verification (SV) projects to complete the commissioning of the camera. We show some results from these SV projects in order to demonstrate the likely power of the eventual complete survey. Finally, using the data
New Evidence for a Large Local Void From the UKIDSS LAS + SDSS

NASA Astrophysics Data System (ADS)

Keenan, Ryan; Barger, A. J.

2013-01-01

Recent cosmological modeling efforts have shown that a local under-density on scales of a few hundred Mpc (out to z ~ 0.1) could produce the apparent acceleration of the expansion of the universe observed via type Ia supernovae. Several studies of galaxy counts in the near-infrared (NIR) have found that the local universe appears underdense by ~25 - 50% compared with regions a few hundred Mpc distant (e.g. Keenan et al., 2010). An accurate characterization of any such under-density will be important for studies seeking to understand the nature of dark energy. If the space density of galaxies is rising as a function of redshift, then the luminosity density, as measured via the NIR galaxy luminosity function (LF), should be rising as well. In Keenan et al. (2012), we presented a study of the NIR LF at z ~ 0.2 and found that the product φ*L* (the peak of the luminosity density distribution) at z ~ 0.2 is roughly ~ 30% higher than that measured at z ~ 0.05. Here we present the results from a study of the NIR LF derived from galaxies selected from the UKIRT Infrared Deep Sky Large Area Survey (UKIDSS LAS) combined with spectroscopy from the Sloan Digital Sky Survey (SDSS). We confirm the apparent rise in luminosity density found in Keenan et al. (2012) from z = 0.05 to z = 0.1 and provide the first self-consistent measurements of the NIR luminosity density out to z ~ 0.15.
Hunting For Wild Brown Dwarf Companions To White Dwarfs In UKIDSS And SDSS

NASA Astrophysics Data System (ADS)

Day-Jones, Avril; Pinfield, D. J.; Jones, H. R. A.; Napiwotzki, R.; Burningham, B.; Jenkins, J. S.; UKIDSS Cool Dwarf Science Working Group

2008-03-01

We present findings from our search of the latest releases of SDSS and UKIDSS LAS for very widely separated white dwarf - ultracool dwarf binaries. Ultracool dwarfs found in such binary systems could be used as benchmark objects, whose properties, such as age and distance can be inferred indirectly from the white dwarf primary (with no need to refer to atmospheric models) and can provide a test bed for theoretical models, they can therefore be used observationally pin down how physical properties affect ultracool dwarf spectra.
Investigating broad absorption line quasars with SDSS and UKIDSS .

NASA Astrophysics Data System (ADS)

Maddox, Natasha; Hewett, P. C.

The SDSS contains the largest set of spectroscopically confirmed broad line quasars ever compiled. Upon its completion, the UKIDSS LAS will provide a near-infrared counterpart to the SDSS, reaching 3 magnitudes deeper than 2MASS over a 4000 square degree area within the SDSS footprint. Combining the SDSS optical and UKIDSS near-infrared data, allows a new insight into the photometric and spectroscopic properties of broad absorption line quasars (BALQSOs) relative to the quasar population as a whole. An accurate estimate of the intrinsic BALQSO fraction is essential for determining the BAL cloud covering fraction and the implications for the co-evolution of accreting supermassive black holes and their host galaxies. Defining a K-band limited sample of quasars makes clear the significantly redder distribution of i-K colours of the BALQSOs. The BALQSO i-K colour distribution enables us to estimate a lower limit to the intrinsic BALQSO fraction, computed to be ˜ 30 percent, significantly larger than the optical fraction of 15-20 percent found by several authors. We combined the high-quality SDSS spectra of the quasar sample to make several composite spectra based on i-K colour, and the properties of these composites are compared to a composite spectrum of unreddened quasars. If the origin of the wavelength dependent differences between the red and unreddened objects is ascribed to attenuation by dust, we find that the extinction curve of the material is intermediate in form between the steep SMC-like extinction curve and the recent, empirically determined, extinction curve presented by Gaskell & Benker (2007).
Dust Reddened Quasars in FIRST and UKIDSS: Beyond the Tip of the Iceberg

NASA Astrophysics Data System (ADS)

Glikman, Eilat; Urrutia, Tanya; Lacy, Mark; Djorgovski, S. G.; Urry, Meg; Croom, Scott; Schneider, Donald P.; Mahabal, Ashish; Graham, Matthew; Ge, Jian

2013-12-01

We present the results of a pilot survey to find dust-reddened quasars by matching the Faint Images of the Radio Sky at Twenty-Centimeters (FIRST) radio catalog to the UKIDSS near-infrared survey and using optical data from Sloan Digital Sky Survey to select objects with very red colors. The deep K-band limit provided by UKIDSS allows for finding more heavily reddened quasars at higher redshifts as compared with previous work using FIRST and Two Micron All Sky Survey (2MASS). We selected 87 candidates with K <= 17.0 from the UKIDSS Large Area Survey (LAS) First Data Release (DR1), which covers 190 deg2. These candidates reach up to ~1.5 mag below the 2MASS limit and obey the color criteria developed to identify dust-reddened quasars. We have obtained 61 spectroscopic observations in the optical and/or near-infrared, as well as classifications in the literature, and have identified 14 reddened quasars with E(B - V) > 0.1, including 3 at z > 2. We study the infrared properties of the sample using photometry from the Wide-Field Infrared Survey Explorer and find that infrared colors improve the efficiency of red quasar selection, removing many contaminants in an infrared-to-optical color-selected sample alone. The highest-redshift quasars (z >~ 2) are only moderately reddened, with E(B - V) ~ 0.2-0.3. We find that the surface density of red quasars rises sharply with faintness, comprising up to 17% of blue quasars at the same apparent K-band flux limit. We estimate that to reach more heavily reddened quasars (i.e., E(B - V) >~ 0.5) at z > 2 and a depth of K = 17, we would need to survey at least ~2.5 times more area.
The K-Band Quasar Luminosity Function from an SDSS and UKIDSS Matched Catalog

NASA Astrophysics Data System (ADS)

Peth, Michael; Ross, N. P.; Schneider, D. P.

2010-01-01

We match the 1,015,082 quasars from the Sloan Digital Sky Survey (SDSS) DR6 Photometric Quasar catalog to the UKIRT Infrared Digital Sky Survey (UKIDSS) Large Area Survey (LAS) DR3 to produce a catalog of 130,827 objects with optical (ugriz) and infrared (YJHK) measurements over an area of 1,200 sq. deg. A matching radius of 1'’ is used; the positional standard deviations of SDSS DR6 quasars and UKIDSS LAS is δRA = 0.137'’ and δDec = 0.131''. The catalog contains 74,351 K-band detections and 42,133 objects have coverage in all four NIR bands. In addition to the catalog, we present optical and NIR color-redshift and color-color plots. The photometric vs. spectroscopic redshift plots demonstrate how unreliable high reported photometric redshifts can be. This forces us to focus on z4.6 quasars are compared to our highest redshift objects. The giK color-color plot demonstrates that stellar contamination only affects a small sample of the objects. Distributions for Y,J,H,K and i-bands reveal insights into the flux limits in each magnitude. We investigate the distribution of redshifts from different data sets and investigate the legitimacy of certain measured photometric redshift regions. For in-depth analysis, we focus on the 300 sq. deg area equatorial SDSS region designated as Stripe 82. We measure the observed K-band quasar luminosity function (QLF) for a subset of 9,872, z<2.2 objects. We find the shape of the K-band QLF is very similar to that of the optical QLF, over the considered redshift ranges. Our calculated K-Band QLFs broadly match previous optical QLFs calculated from the SDSS and 2SLAQ QSO surveys and should provide important constraints linking unobscured optical quasars to Mid-Infrared detected, dusty and obscured AGNs at high-redshift.
Luminosity and surface brightness distribution of K-band galaxies from the UKIDSS Large Area Survey

NASA Astrophysics Data System (ADS)

Smith, Anthony J.; Loveday, Jon; Cross, Nicholas J. G.

2009-08-01

We present luminosity and surface-brightness distributions of 40111 galaxies with K-band photometry from the United Kingdom Infrared Telescope (UKIRT) Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS), Data Release 3 and optical photometry from Data Release 5 of the Sloan Digital Sky Survey (SDSS). Various features and limitations of the new UKIDSS data are examined, such as a problem affecting Petrosian magnitudes of extended sources. Selection limits in K- and r-band magnitude, K-band surface brightness and K-band radius are included explicitly in the 1/Vmax estimate of the space density and luminosity function. The bivariate brightness distribution in K-band absolute magnitude and surface brightness is presented and found to display a clear luminosity-surface brightness correlation that flattens at high luminosity and broadens at low luminosity, consistent with similar analyses at optical wavelengths. Best-fitting Schechter function parameters for the K-band luminosity function are found to be M* - 5 logh = -23.19 +/- 0.04,α = -0.81 +/- 0.04 and φ* = (0.0166 +/- 0.0008)h3Mpc-3, although the Schechter function provides a poor fit to the data at high and low luminosity, while the luminosity density in the K band is found to be j = (6.305 +/- 0.067) × 108LsolarhMpc-3. However, we caution that there are various known sources of incompleteness and uncertainty in our results. Using mass-to-light ratios determined from the optical colours, we estimate the stellar mass function, finding good agreement with previous results. Possible improvements are discussed that could be implemented when extending this analysis to the full LAS.
New ultracool subdwarfs identified in large-scale surveys using Virtual Observatory tools. I. UKIDSS LAS DR5 vs. SDSS DR7

NASA Astrophysics Data System (ADS)

Lodieu, N.; Espinoza Contreras, M.; Zapatero Osorio, M. R.; Solano, E.; Aberasturi, M.; Martín, E. L.

2012-06-01

Aims: The aim of the project is to improve our knowledge of the low-mass and low-metallicity population to investigate the influence of metallicity on the stellar (and substellar) mass function. Methods: We present the results of a photometric and proper motion search aimed at discovering ultracool subdwarfs in large-scale surveys. We employed and combined the Fifth Data Release (DR5) of the UKIRT Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS) and the Sloan Digital Sky Survey (SDSS) Data Release 7 complemented with ancillary data from the Two Micron All-Sky Survey (2MASS), the DEep Near-Infrared Survey (DENIS) and the SuperCOSMOS Sky Surveys (SSS). Results: The SDSS DR7 vs. UKIDSS LAS DR5 search returned a total of 32 ultracool subdwarf candidates, only two of which are recognised as a subdwarf in the literature. Twenty-seven candidates, including the two known ones, were followed-up spectroscopically in the optical between 600 and 1000 nm, thus covering strong spectral features indicative of low metallicity (e.g., CaH), 21 with the Very Large Telescope, one with the Nordic Optical Telescope, and five were extracted from the Sloan spectroscopic database to assess (or refute) their low-metal content. We confirm 20 candidates as subdwarfs, extreme subdwarfs, or ultra-subdwarfs with spectral types later than M5; this represents a success rate of ≥ 60%. Among those 20 new subdwarfs, we identify two early-L subdwarfs that are very likely located within 100 pc, which we propose as templates for future searches because they are the first examples of their subclass. Another seven sources are solar-metallicity M dwarfs with spectral types between M4 and M7 without Hα emission, suggesting that they are old M dwarfs. The remaining five candidates do not have spectroscopic follow-up yet; only one remains as a bona-fide ultracool subdwarf after revision of their proper motions. We assigned spectral types based on the current classification schemes and, when
VizieR Online Data Catalog: UKIDSS-DR7 Large Area Survey (Lawrence+ 2011)

NASA Astrophysics Data System (ADS)

UKIDSS Consortium

2012-03-01

The UKIRT Infrared Deep Sky Survey (UKIDSS) is a large-scale near-IR survey which aim is to cover 7500 square degrees of the Northern sky. The survey is carried out using the Wide Field Camera (WFCAM), with a field of view of 0.21 square degrees, mounted on the 3.8m United Kingdom Infra-red Telescope (UKIRT) in Hawaii. The Large Area Survey (LAS) covers an area of 4000 square degrees in high Galactic latitudes (extragalactic) in the four bands Y(1.0um) J(1.2um) H(1.6um) and K(2.2um) to a depth of K = 18.4. Details of the survey can be found in the in the paper by Lawrence et al. (2007MNRAS.379.1599L) (1 data file).
Heavily reddened quasars at z ˜ 2 in the UKIDSS Large Area Survey: a transitional phase in AGN evolution

NASA Astrophysics Data System (ADS)

Banerji, Manda; McMahon, Richard G.; Hewett, Paul C.; Alaghband-Zadeh, Susannah; Gonzalez-Solares, Eduardo; Venemans, Bram P.; Hawthorn, Melanie J.

2012-12-01

We present a new sample of purely near-infrared-selected KVega < 16.5 [KAB < 18.4] extremely red [(J - K)Vega > 2.5] quasar candidates at z ˜ 2 from ≃900 deg2 of data in the UKIDSS Large Area Survey (LAS). Five of these are spectroscopically confirmed to be heavily reddened type 1 active galactic nuclei (AGN) with broad emission lines bringing our total sample of reddened quasars from the UKIDSS-LAS to 12 at z = 1.4-2.7. At these redshifts, Hα (6563 Å) is in the K band. However, the mean Hα equivalent width of the reddened quasars is only 10 per cent larger than that of the optically selected population and cannot explain the extreme colours. Instead, dust extinction of AV ˜ 2-6 mag is required to reproduce the continuum colours of our sources. This is comparable to the dust extinctions seen in submillimetre galaxies at similar redshifts. We argue that the AGN are likely being observed in a relatively short-lived breakout phase when they are expelling gas and dust following a massive starburst, subsequently turning into UV-luminous quasars. Some of our quasars show direct evidence for strong outflows (v ˜ 800-1000 km s-1) affecting the Hα line consistent with this scenario. We predict that a larger fraction of reddened quasar hosts are likely to be submillimetre bright compared to the UV-luminous quasar population. We use our sample to place new constraints on the fraction of obscured type 1 AGN likely to be missed in optical surveys. Taken at face value our findings suggest that the obscured fraction depends on quasar luminosity. The space density of obscured quasars is approximately five times that inferred for UV-bright quasars from the Sloan Digital Sky Survey (SDSS) luminosity function at Mi < -30 but seems to drop at lower luminosities even accounting for various sources of incompleteness in our sample. We find that at Mi ˜ -28 for example, this fraction is unlikely to be larger than ˜20 per cent although these fractions are highly uncertain at
A Large-Scale Super-Structure at z=0.65 in the UKIDSS Ultra-Deep Survey Field

NASA Astrophysics Data System (ADS)

Galametz, Audrey; Candels Clustering Working Group

2017-07-01

In hierarchical structure formation scenarios, galaxies accrete along high density filaments. Superclusters represent the largest density enhancements in the cosmic web with scales of 100 to 200 Mpc. As they represent the largest components of LSS, they are very powerful tools to constrain cosmological models. Since they also offer a wide range of density, from infalling group to high density cluster core, they are also the perfect laboratory to study the influence of environment on galaxy evolution. I will present a newly discovered large scale structure at z=0.65 in the UKIDSS UDS field. Although statistically predicted, the presence of such structure in UKIDSS, one of the most extensively covered and studied extragalactic field, remains a serendipity. Our follow-up confirmed more than 15 group members including at least three galaxy clusters with M200 10^14Msol . Deep spectroscopy of the quiescent core galaxies reveals that the most massive structure knots are at very different formation stage with a range of red sequence properties. Statistics allow us to map formation age across the structure denser knots and identify where quenching is most probably occurring across the LSS. Spectral diagnostics analysis also reveals an interesting population of transition galaxies we suspect are transforming from star-forming to quiescent galaxies.
Eight new T4.5-T7.5 dwarfs discovered in the UKIDSS Large Area Survey Data Release 1

NASA Astrophysics Data System (ADS)

Lodieu, N.; Pinfield, D. J.; Leggett, S. K.; Jameson, R. F.; Mortlock, D. J.; Warren, S. J.; Burningham, B.; Lucas, P. W.; Chiu, K.; Liu, M. C.; Venemans, B. P.; McMahon, R. G.; Allard, F.; Baraffe, I.; Barrado y Navascués, D.; Carraro, G.; Casewell, S. L.; Chabrier, G.; Chappelle, R. J.; Clarke, F.; Day-Jones, A. C.; Deacon, N. R.; Dobbie, P. D.; Folkes, S. L.; Hambly, N. C.; Hewett, P. C.; Hodgkin, S. T.; Jones, H. R. A.; Kendall, T. R.; Magazzù, A.; Martín, E. L.; McCaughrean, M. J.; Nakajima, T.; Pavlenko, Y.; Tamura, M.; Tinney, C. G.; Zapatero Osorio, M. R.

2007-08-01

We present eight new T4.5-T7.5 dwarfs identified in the UKIRT (United Kingdom Infrared Telescope) Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS) Data Release 1 (DR1). In addition we have recovered the T4.5 dwarf SDSSJ020742.91+000056.2 and the T8.5 dwarf ULASJ003402.77-005206.7. Photometric candidates were picked up in two-colour diagrams over 190deg2 (DR1) and selected in at least two filters. All candidates exhibit near-infrared spectra with strong methane and water absorption bands characteristic of T dwarfs and the derived spectral types follow the unified scheme of Burgasser et al.. We have found six new T4.5-T5.5 dwarfs, one T7 dwarf, one T7.5 dwarf and recovered a T4.5 dwarf and a T8.5 dwarf. We provide distance estimates which lie in the 15-85pc range; the T7.5 and T8.5 dwarfs are probably within 25pc of the Sun. We conclude with a discussion of the number of T dwarfs expected after completion of the LAS, comparing these initial results to theoretical simulations. Based on observations made with the United Kingdom Infrared Telescope, operated by the Joint Astronomy Centre on behalf of the UK Particle Physics and Astronomy Research Council. E-mail: nlodieu@iac.es ‡ Alfred P. Sloan Research Fellow.
A new benchmark T8-9 brown dwarf and a couple of new mid-T dwarfs from the UKIDSS DR5+ LAS

NASA Astrophysics Data System (ADS)

Goldman, B.; Marsat, S.; Henning, T.; Clemens, C.; Greiner, J.

2010-06-01

Benchmark brown dwarfs are those objects for which fiducial constraints are available, including effective temperature, parallax, age and metallicity. We searched for new cool brown dwarfs in 186deg2 of the new area covered by the data release DR5+ of the UKIRT Deep Infrared Sky Survey (UKIDSS) Large Area Survey. Follow-up optical and near-infrared broad-band photometry, and methane imaging of four promising candidates, revealed three objects with distinct methane absorption, typical of mid- to late-T dwarfs and one possibly T4 dwarf. The latest-type object, classified as T8-9, shares its large proper motion with Ross 458 (BD+13o2618), an active M0.5 binary which is 102arcsec away, forming a hierarchical low-mass star+brown dwarf system. Ross 458C has an absolute J-band magnitude of 16.4, and seems overluminous, particularly in the K band, compared to similar field brown dwarfs. We estimate the age of the system to be less than 1Gyr, and its mass to be as low as 14 Jupiter masses for the age of 1Gyr. At 11.4pc, this new late-T benchmark dwarf is a promising target to constrain the evolutionary and atmospheric models of very low-mass brown dwarfs. We present proper motion measurements for our targets and for 13 known brown dwarfs. Two brown dwarfs have velocities typical of the thick disc and may be old brown dwarfs. Based on observations collected at the German-Spanish Astronomical Center, Calar Alto, jointly operated by the Max-Planck Institut für Astronomie Heidelberg and the Instituto de Astrofísica de Andaluc'a (CSIC), and on observations made with ESO/MPG Telescope at the La Silla Observatory under programme ID 081.A-9012 and 081.A-9014. E-mail: goldman@mpia.de
A wide deep infrared look at the Pleiades with UKIDSS: new constraints on the substellar binary fraction and the low-mass initial mass function

NASA Astrophysics Data System (ADS)

Lodieu, N.; Dobbie, P. D.; Deacon, N. R.; Hodgkin, S. T.; Hambly, N. C.; Jameson, R. F.

2007-09-01

We present the results of a deep wide-field near-infrared survey of 12 deg2 of the Pleiades conducted as part of the United Kingdom Infrared Telescope (UKIRT) Infrared Deep Sky Survey (UKIDSS) Galactic Cluster Survey (GCS). We have extracted over 340 high-probability proper motion (PM) members down to 0.03 Msolar using a combination of UKIDSS photometry and PM measurements obtained by cross-correlating the GCS with data from the Two Micron All Sky Survey, the Isaac Newton Telescope and the Canada-France-Hawaii Telescope. Additionally, we have unearthed 73 new candidate brown dwarf (BD) members on the basis of five-band UKIDSS photometry alone. We have identified 23 substellar multiple system candidates out of 63 candidate BDs from the (Y - K, Y) and (J - K, J) colour-magnitude diagrams, yielding a binary frequency of 28-44 per cent in the 0.075-0.030 Msolar mass range. Our estimate is three times larger than the binary fractions reported from high-resolution imaging surveys of field ultracool dwarfs and Pleiades BDs. However, it is marginally consistent with our earlier `peculiar' photometric binary fraction of 50 +/- 10 per cent presented by Pinfield et al., in good agreement with the 32-45 per cent binary fraction derived from the recent Monte Carlo simulations of Maxted & Jeffries and compatible with the 26 +/- 10 per cent frequency recently estimated by Basri & Reiners. A tentative estimate of the mass ratios from photometry alone seems to support the hypothesis that binary BDs tend to reside in near equal-mass ratio systems. In addition, the recovery of four Pleiades members targeted by high-resolution imaging surveys for multiplicity studies suggests that half of the binary candidates may have separations below the resolution limit of the Hubble Space Telescope or current adaptive optics facilities at the distance of the Pleiades (a ~7 au). Finally, we have derived luminosity and mass functions from the sample of photometric candidates with membership
Searching for Ultra-cool Objects at the Limits of Large-scale Surveys

NASA Astrophysics Data System (ADS)

Pinfield, D. J.; Patel, K.; Zhang, Z.; Gomes, J.; Burningham, B.; Day-Jones, A. C.; Jenkins, J.

2011-12-01

We have made a search (to Y=19.6) of the UKIDSS Large Area Survey (LAS DR7) for objects detected only in the Y-band. We have identified and removed contamination due to solar system objects, dust specs in the WFCAM optical path, persistence in the WFCAM detectors, and other sources of spurious single source Y-detections in the UKIDSS LAS data-base. In addition to our automated selection procedure we have visually inspected the ˜600 automatically selected candidates to provide an additional level of quality filtering. This has resulted in 55 good candidates that await follow-up observations to confirm their nature. Ultra-cool LAS Y-only objects would have blue Y-J colours combined with very red optical-NIR SEDs - characteristics shared by Jupiter, and suggested by an extrapolation of the Y-J colour trend seen for the latest T dwarfs currently known.
Probabilistic selection of high-redshift quasars

NASA Astrophysics Data System (ADS)

Mortlock, Daniel J.; Patel, Mitesh; Warren, Stephen J.; Hewett, Paul C.; Venemans, Bram P.; McMahon, Richard G.; Simpson, Chris

2012-01-01

High-redshift quasars (HZQs) with redshifts of z ≳ 6 are so rare that any photometrically selected sample of sources with HZQ-like colours is likely to be dominated by Galactic stars and brown dwarfs scattered from the stellar locus. It is impractical to re-observe all such candidates, so an alternative approach was developed in which Bayesian model comparison techniques are used to calculate the probability that a candidate is a HZQ, Pq, by combining models of the quasar and star populations with the photometric measurements of the object. This method was motivated specifically by the large number of HZQ candidates identified by cross-matching the UKIRT (United Kingdom Infrared Telescope) Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS) to the Sloan Digital Sky Survey (SDSS): in the ? covered by the LAS in the UKIDSS Eighth Data Release (DR8) there are ˜9 × 103 real astronomical point sources with the measured colours of the target quasars, of which only ˜10 are expected to be HZQs. Applying Bayesian model comparison to the sample reveals that most sources with HZQ-like colours have Pq≲ 0.1 and can be confidently rejected without the need for any further observations. In the case of the UKIDSS DR8 LAS, there were just 107 candidates with Pq≥ 0.1; these objects were prioritized for re-observation by ranking according to Pq (and their likely redshift, which was also inferred from the photometric data). Most candidates were rejected after one or two (moderate-depth) photometric measurements by recalculating Pq using the new data. That left 12 confirmed HZQs, six of which were previously identified in the SDSS and six of which were new UKIDSS discoveries. The high efficiency of this Bayesian selection method suggests that it could usefully be extended to other HZQ surveys (e.g. searches by the Panoramic Survey Telescope And Rapid Response System, Pan-STARRS, or the Visible and Infrared Survey Telescope for Astronomy, VISTA) as well as to other
The SCUBA-2 Cosmology Legacy Survey: the clustering of submillimetre galaxies in the UKIDSS UDS field

NASA Astrophysics Data System (ADS)

Wilkinson, Aaron; Almaini, Omar; Chen, Chian-Chou; Smail, Ian; Arumugam, Vinodiran; Blain, Andrew; Chapin, Edward L.; Chapman, Scott C.; Conselice, Christopher J.; Cowley, William I.; Dunlop, James S.; Farrah, Duncan; Geach, James; Hartley, William G.; Ivison, Rob J.; Maltby, David T.; Michałowski, Michał J.; Mortlock, Alice; Scott, Douglas; Simpson, Chris; Simpson, James M.; van der Werf, Paul; Wild, Vivienne

2017-01-01

Submillimetre galaxies (SMGs) are among the most luminous dusty galaxies in the Universe, but their true nature remains unclear; are SMGs the progenitors of the massive elliptical galaxies we see in the local Universe, or are they just a short-lived phase among more typical star-forming galaxies? To explore this problem further, we investigate the clustering of SMGs identified in the SCUBA-2 Cosmology Legacy Survey. We use a catalogue of submillimetre (850 μm) source identifications derived using a combination of radio counterparts and colour/infrared selection to analyse a sample of 610 SMG counterparts in the United Kingdom Infrared Telescope (UKIRT) Infrared Deep Survey (UKIDSS) Ultra Deep Survey (UDS), making this the largest high-redshift sample of these galaxies to date. Using angular cross-correlation techniques, we estimate the halo masses for this large sample of SMGs and compare them with passive and star-forming galaxies selected in the same field. We find that SMGs, on average, occupy high-mass dark matter haloes (Mhalo > 1013 M⊙) at redshifts z > 2.5, consistent with being the progenitors of massive quiescent galaxies in present-day galaxy clusters. We also find evidence of downsizing, in which SMG activity shifts to lower mass haloes at lower redshifts. In terms of their clustering and halo masses, SMGs appear to be consistent with other star-forming galaxies at a given redshift.
X-UDS: The Chandra Legacy Survey of the UKIDSS Ultra Deep Survey Field

NASA Astrophysics Data System (ADS)

Kocevski, Dale D.; Hasinger, Guenther; Brightman, Murray; Nandra, Kirpal; Georgakakis, Antonis; Cappelluti, Nico; Civano, Francesca; Li, Yuxuan; Li, Yanxia; Aird, James; Alexander, David M.; Almaini, Omar; Brusa, Marcella; Buchner, Johannes; Comastri, Andrea; Conselice, Christopher J.; Dickinson, Mark A.; Finoguenov, Alexis; Gilli, Roberto; Koekemoer, Anton M.; Miyaji, Takamitsu; Mullaney, James R.; Papovich, Casey; Rosario, David; Salvato, Mara; Silverman, John D.; Somerville, Rachel S.; Ueda, Yoshihiro

2018-06-01

We present the X-UDS survey, a set of wide and deep Chandra observations of the Subaru-XMM Deep/UKIDSS Ultra Deep Survey (SXDS/UDS) field. The survey consists of 25 observations that cover a total area of 0.33 deg2. The observations are combined to provide a nominal depth of ∼600 ks in the central 100 arcmin2 region of the field that has been imaged with Hubble/WFC3 by the CANDELS survey and ∼200 ks in the remainder of the field. In this paper, we outline the survey’s scientific goals, describe our observing strategy, and detail our data reduction and point source detection algorithms. Our analysis has resulted in a total of 868 band-merged point sources detected with a false-positive Poisson probability of <1 × 10‑4. In addition, we present the results of an X-ray spectral analysis and provide best-fitting neutral hydrogen column densities, N H, as well as a sample of 51 Compton-thick active galactic nucleus candidates. Using this sample, we find the intrinsic Compton-thick fraction to be 30%–35% over a wide range in redshift (z = 0.1–3), suggesting the obscured fraction does not evolve very strongly with epoch. However, if we assume that the Compton-thick fraction is dependent on luminosity, as is seen for Compton-thin sources, then our results are consistent with a rise in the obscured fraction out to z ∼ 3. Finally, an examination of the host morphologies of our Compton-thick candidates shows a high fraction of morphological disturbances, in agreement with our previous results. All data products described in this paper are made available via a public website.

T dwarfs all the way to 550 K?

NASA Astrophysics Data System (ADS)

Burningham, Ben; Pinfield, D. J.; Leggett, S. K.; Tamura, M.; Lucas, P. W.; Homeier, D.

2009-02-01

We highlight recent results from the UKIDSS Large Area Survey (LAS) including a T dwarf with an estimated Teff = 550-600 K and new constraints on the substellar mass function in the field. We also define the T9 subtype as an extension to the T spectral sequence defined by Burgasser et al. (2006).
Search for Wide Planetary-Mass Companions in Young Star-Forming Regions with UKIDSS and Pan-STARRS

NASA Astrophysics Data System (ADS)

Aller, Kimberly M.; Kraus, A. L.; Liu, M. C.; Bowler, B. P.

2013-01-01

Over the past decade, planetary-mass (<15 MJup) companions have been discovered in very wide orbits (>100 AU) around young stars. It is unclear whether these objects formed like planets or like stars. If these are planets, then modifications to core accretion or disk instability models are needed to allow formation at such wide orbits, or planet scattering must be an important mechanism. On the other hand, if these objects formed like stars, we need to understand the frequency of these extremely low mass ratio binary companions which challenge brown dwarf formation models. Regardless of their origins, these wide companions are easier to observe than close-in planets and can be used as benchmarks to understand the properties of young planets. We have combined optical and NIR photometry from UKIDSS and Pan-STARRS-1 to search the young star-forming region of Upper Scorpius and Taurus for new planetary-mass objects, going ≈3 mag deeper than previous work with 2MASS. We identified several candidates with very wide separations (≈400-4000 AU) from known members using a combination of color selection and spectral energy distribution (SED) fitting to templates of known low-mass stars and brown dwarfs. Furthermore, we have obtained followup NIR spectra of several Upper Scorpius candidates to spectroscopically identify three new wide very low-mass companions (≈15-25 MJup spectral type of M8-L0).
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades.

PubMed

Orchard, Garrick; Jayawant, Ajinkya; Cohen, Gregory K; Thakor, Nitish

2015-01-01

Creating datasets for Neuromorphic Vision is a challenging task. A lack of available recordings from Neuromorphic Vision sensors means that data must typically be recorded specifically for dataset creation rather than collecting and labeling existing data. The task is further complicated by a desire to simultaneously provide traditional frame-based recordings to allow for direct comparison with traditional Computer Vision algorithms. Here we propose a method for converting existing Computer Vision static image datasets into Neuromorphic Vision datasets using an actuated pan-tilt camera platform. Moving the sensor rather than the scene or image is a more biologically realistic approach to sensing and eliminates timing artifacts introduced by monitor updates when simulating motion on a computer monitor. We present conversion of two popular image datasets (MNIST and Caltech101) which have played important roles in the development of Computer Vision, and we provide performance metrics on these datasets using spike-based recognition algorithms. This work contributes datasets for future use in the field, as well as results from spike-based algorithms against which future works can compare. Furthermore, by converting datasets already popular in Computer Vision, we enable more direct comparison with frame-based approaches.
Multiband Study of Radio Sources of the Rcr Catalogue with Virtual Observatory Tools

NASA Astrophysics Data System (ADS)

Zhelenkova, O. P.; Soboleva, N. S.; Majorova, E. K.; Temirova, A. V.

We present early results of our multiband study of the RATAN Cold Revised (RCR) catalogue obtained from seven cycles of the ``Cold'' survey carried with the RATAN-600 radio telescope at 7.6 cm in 1980--1999, at the declination of the SS 433 source. We used the 2MASS and LAS UKIDSS infrared surveys, the DSS-II and SDSS DR7 optical surveys, as well as the USNO-B1 and GSC-II catalogues, the VLSS, TXS, NVSS, FIRST and GB6 radio surveys to accumulate information about the sources. For radio sources that have no detectable optical candidate in optical or infrared catalogues, we additionally looked through images in several bands from the SDSS, LAS UKIDSS, DPOSS, 2MASS surveys and also used co-added frames in different bands. We reliably identified 76% of radio sources of the RCR catalogue. We used the ALADIN and SAOImage DS9 scripting capabilities, interoperability services of ALADIN and TOPCAT, and also other Virtual Observatory (VO) tools and resources, such as CASJobs, NED, Vizier, and WSA, for effective data access, visualization and analysis. Without VO tools it would have been problematic to perform our study.
Optical+Near-IR Bayesian Classification of Quasars

NASA Astrophysics Data System (ADS)

Mehta, Sajjan S.; Richards, G. T.; Myers, A. D.

2011-05-01

We describe the details of an optimal Bayesian classification of quasars with combined optical+near-IR photometry from the SDSS and UKIDSS LAS surveys. Using only deep co-added SDSS photometry from the "Stripe 82" region and requiring full four-band UKIDSS detections, we reliably identify 2665 quasar candidates with a computed efficiency in excess of 99%. Relaxing the data constraints to combinations of two-band detections yields up to 6424 candidates with minimal trade-off in completeness and efficiency. The completeness and efficiency of the sample are investigated with existing spectra from the SDSS, 2SLAQ, and AUS surveys in addition to recent single-slit observations from Palomar Observatory, which revealed 22 quasars from a subsample of 29 high-z candidates. SDSS-III/BOSS observations will allow further exploration of the completeness/efficiency of the sample over 2.2
Statistical Reference Datasets

National Institute of Standards and Technology Data Gateway

Statistical Reference Datasets (Web, free access) The Statistical Reference Datasets is also supported by the Standard Reference Data Program. The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software.
Dataset Lifecycle Policy

NASA Technical Reports Server (NTRS)

Armstrong, Edward; Tauer, Eric

2013-01-01

The presentation focused on describing a new dataset lifecycle policy that the NASA Physical Oceanography DAAC (PO.DAAC) has implemented for its new and current datasets to foster improved stewardship and consistency across its archive. The overarching goal is to implement this dataset lifecycle policy for all new GHRSST GDS2 datasets and bridge the mission statements from the GHRSST Project Office and PO.DAAC to provide the best quality SST data in a cost-effective, efficient manner, preserving its integrity so that it will be available and usable to a wide audience.
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses

PubMed Central

Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M.; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V.; Ma’ayan, Avi

2018-01-01

Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated ‘canned’ analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools. PMID:29485625
Datasets2Tools, repository and search engine for bioinformatics datasets, tools and canned analyses.

PubMed

Torre, Denis; Krawczuk, Patrycja; Jagodnik, Kathleen M; Lachmann, Alexander; Wang, Zichen; Wang, Lily; Kuleshov, Maxim V; Ma'ayan, Avi

2018-02-27

Biomedical data repositories such as the Gene Expression Omnibus (GEO) enable the search and discovery of relevant biomedical digital data objects. Similarly, resources such as OMICtools, index bioinformatics tools that can extract knowledge from these digital data objects. However, systematic access to pre-generated 'canned' analyses applied by bioinformatics tools to biomedical digital data objects is currently not available. Datasets2Tools is a repository indexing 31,473 canned bioinformatics analyses applied to 6,431 datasets. The Datasets2Tools repository also contains the indexing of 4,901 published bioinformatics software tools, and all the analyzed datasets. Datasets2Tools enables users to rapidly find datasets, tools, and canned analyses through an intuitive web interface, a Google Chrome extension, and an API. Furthermore, Datasets2Tools provides a platform for contributing canned analyses, datasets, and tools, as well as evaluating these digital objects according to their compliance with the findable, accessible, interoperable, and reusable (FAIR) principles. By incorporating community engagement, Datasets2Tools promotes sharing of digital resources to stimulate the extraction of knowledge from biomedical research data. Datasets2Tools is freely available from: http://amp.pharm.mssm.edu/datasets2tools.
Properties of galaxies around the most massive SMBHs

NASA Astrophysics Data System (ADS)

Shirasaki, Yuji; Komiya, Yutaka; Ohishi, Masatoshi; Mizumoto, Yoshihiko

2015-08-01

We present result of the clustering analysis performed between AGNs and galaxies. AGN samples with redshift 0.1 - 1.0 were extracted from AGN properties catalogs which contain virial mass estimates of SMBHs. Galaxy samples were extracted from SDSS DR8 catalog and UKIDSS DR9 LAS catalog. The catalogs of SDSS and UKIDSS were merged and used to estimate the IR-opt color and IR magnitude in the rest frame by SED fitting. As we had no redshift information on the galaxy samples, stacking method was applied. We investigated the BH mass dependence of cross correlation length, red galaxy fraction at their environment, and luminosity function of galaxies. We found that the cross correlation length increase above M_BH >= 10^{8.2} Msol, and red galaxies dominate the environment of AGNs with M_BH >= 10^{9} Msol. This result indicates that the most massive SMBHs are mainly fueled by accretion of hot halo gas.
Properties of galaxies around the most massive SMBHs

NASA Astrophysics Data System (ADS)

Shirasaki, Yuji; Komiya, Yutaka; Ohishi, Masatoshi; Mizumoto, Yoshihiko

We present result of the clustering analysis performed between AGNs and galaxies. AGN samples with redshift 0.1-1.0 were extracted from AGN properties catalogs which contain virial mass estimates of SMBHs. Galaxy samples were extracted from SDSS DR8 catalog and UKIDSS DR9 LAS catalog. The catalogs of SDSS and UKIDSS were merged and used to estimate the IR-opt color and IR magnitude in the rest frame by SED fitting. As we had no redshift information on the galaxy samples, stacking method was applied. We investigated the BH mass dependence of cross correlation length, red galaxy fraction at their environment, and luminosity function of galaxies. We found that the cross correlation length increase above M BH >= 108.2 M ⊙, and red galaxies dominate the environment of AGNs with M BH >= 109 M ⊙. This result indicates that the most massive SMBHs are mainly fueled by accretion of hot halo gas.
Learning to recognize rat social behavior: Novel dataset and cross-dataset application.

PubMed

Lorbach, Malte; Kyriakou, Elisavet I; Poppe, Ronald; van Dam, Elsbeth A; Noldus, Lucas P J J; Veltkamp, Remco C

2018-04-15

Social behavior is an important aspect of rodent models. Automated measuring tools that make use of video analysis and machine learning are an increasingly attractive alternative to manual annotation. Because machine learning-based methods need to be trained, it is important that they are validated using data from different experiment settings. To develop and validate automated measuring tools, there is a need for annotated rodent interaction datasets. Currently, the availability of such datasets is limited to two mouse datasets. We introduce the first, publicly available rat social interaction dataset, RatSI. We demonstrate the practical value of the novel dataset by using it as the training set for a rat interaction recognition method. We show that behavior variations induced by the experiment setting can lead to reduced performance, which illustrates the importance of cross-dataset validation. Consequently, we add a simple adaptation step to our method and improve the recognition performance. Most existing methods are trained and evaluated in one experimental setting, which limits the predictive power of the evaluation to that particular setting. We demonstrate that cross-dataset experiments provide more insight in the performance of classifiers. With our novel, public dataset we encourage the development and validation of automated recognition methods. We are convinced that cross-dataset validation enhances our understanding of rodent interactions and facilitates the development of more sophisticated recognition methods. Combining them with adaptation techniques may enable us to apply automated recognition methods to a variety of animals and experiment settings. Copyright © 2017 Elsevier B.V. All rights reserved.
Extreme infrared variables from UKIDSS - II. An end-of-survey catalogue of eruptive YSOs and unusual stars

NASA Astrophysics Data System (ADS)

Lucas, P. W.; Smith, L. C.; Contreras Peña, C.; Froebrich, D.; Drew, J. E.; Kumar, M. S. N.; Borissova, J.; Minniti, D.; Kurtev, R.; Monguió, M.

2017-12-01

We present a catalogue of 618 high-amplitude infrared variable stars (1 < ΔK < 5 mag) detected by the two widely separated epochs of 2.2 μm data in the UKIDSS Galactic plane survey, from searches covering ∼1470 deg2. Most were discovered by a search of all fields at 30 < l < 230°. Sources include new dusty Mira variables, three new cataclysmic variable candidates, a blazar and a peculiar source that may be an interacting binary system. However, ∼60 per cent are young stellar obbjects (YSOs), based on spatial association with star-forming regions at distances ranging from 300 pc to over 10 kpc. This confirms our initial result in Contreras Peña et al. (Paper I) that YSOs dominate the high-amplitude infrared variable sky in the Galactic disc. It is also supported by recently published VISTA Variables in the Via Lactea (VVV) results at 295 < l < 350°. The spectral energy distributions of the YSOs indicate class I or flat-spectrum systems in most cases, as in the VVV sample. A large number of variable YSOs are associated with the Cygnus X complex and other groups are associated with the North America/Pelican nebula, the Gemini OB1 molecular cloud, the Rosette complex, the Cone nebula, the W51 star-forming region and the S86 and S236 H II regions. Most of the YSO variability is likely due to variable/episodic accretion on time-scales of years, albeit usually less extreme than classical FUors and EXors. Luminosities at the 2010 Wide-field Infrared Survey Explorer epoch range from ∼0.1 to 103 L⊙ but only rarely exceed 102.5 L⊙.
The discovery of a very cool binary system

NASA Astrophysics Data System (ADS)

Burningham, Ben; Leggett, S. K.; Lucas, P. W.; Pinfield, D. J.; Smart, R. L.; Day-Jones, A. C.; Jones, H. R. A.; Murray, D.; Nickson, E.; Tamura, M.; Zhang, Z.; Lodieu, N.; Tinney, C. G.; Zapatero Osorio, M. R.

2010-06-01

We report the discovery of a very cool d/sdL7+T7.5p common proper motion binary system, SDSS J1416+13AB, found by cross-matching the United Kingdom Infrared Telescope (UKIRT) Infrared Deep Sky Survey (UKIDSS) Large Area Survey Data Release 5 (UKIDSS LAS DR4) against the Sloan Digital Sky Survey Data Release 7. The d/sdL7 is blue in J - H and H - K and has other features suggestive of low metallicity and/or high gravity. The T7.5p displays spectral peculiarity seen before in earlier type dwarfs discovered in UKIDSS LAS DR4, and referred to as CH4-J-early peculiarity, where the CH4-J index, based on the absorption to the red side of the J-band peak, suggests an earlier spectral type than the H2O-J index, based on the blue side of the J-band peak, by ~2 subtypes. We suggest that CH4-J-early peculiarity arises from low metallicity and/or high gravity, and speculate as to its use for classifying T dwarfs. UKIDSS and follow-up United Kingdom Infrared Telescope/Wide Field CAMera (UKIRT/WFCAM) photometry shows the T dwarf to have the bluest near-infrared colours yet seen for such an object with H - K = -1.31 +/- 0.17. Warm Spitzer IRAC photometry shows the T dwarf to have extremely red H - [4.5] = 4.86 +/- 0.04, which is the reddest yet seen for a substellar object. The lack of parallax measurement for the pair limits our ability to estimate parameters for the system. However, applying a conservative distance estimate of 5-15 pc suggests a projected separation in range 45-135 au. By comparing H - K:H - [4.5] colours of the T dwarf to spectral models, we estimate that Teff = 500 K and [M/H] ~ - 0.30, with logg ~ 5.0. This suggests a mass of ~30 MJupiter for the T dwarf and an age of ~10 Gyr for the system. The primary would then be a 75 MJupiter object with logg ~ 5.5 and a relatively dust-free Teff ~ 1500K atmosphere. Given the unusual properties of the system we caution that these estimates are uncertain. We eagerly await parallax measurements and high-resolution imaging
Segmentation of Unstructured Datasets

NASA Technical Reports Server (NTRS)

Bhat, Smitha

1996-01-01

Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.
Fifteen new T dwarfs discovered in the UKIDSS Large Area Survey

NASA Astrophysics Data System (ADS)

Pinfield, D. J.; Burningham, B.; Tamura, M.; Leggett, S. K.; Lodieu, N.; Lucas, P. W.; Mortlock, D. J.; Warren, S. J.; Homeier, D.; Ishii, M.; Deacon, N. R.; McMahon, R. G.; Hewett, P. C.; Osori, M. R. Zapatero; Martin, E. L.; Jones, H. R. A.; Venemans, B. P.; Day-Jones, A. C.; Dobbie, P. D.; Folkes, S. L.; Dye, S.; Allard, F.; Baraffe, I.; Barrado Y Navascués, D.; Casewell, S. L.; Chiu, K.; Chabrier, G.; Clarke, F.; Hodgkin, S. T.; Magazzù, A.; McCaughrean, M. J.; Nakajima, T.; Pavlenko, Y.; Tinney, C. G.

2008-10-01

We present the discovery of 15 new T2.5-T7.5 dwarfs (with estimated distances ~24-93pc), identified in the first three main data releases of the United Kingdom Infrared Telescope (UKIRT) Infrared Deep Sky Survey. This brings the total number of T dwarfs discovered in the Large Area Survey (LAS) (to date) to 28. These discoveries are confirmed by near-infrared spectroscopy, from which we derive spectral types on the unified scheme of Burgasser et al. Seven of the new T dwarfs have spectral types of T2.5-T4.5, five have spectral types of T5-T5.5, one is a T6.5p and two are T7-7.5. We assess spectral morphology and colours to identify T dwarfs in our sample that may have non-typical physical properties (by comparison to solar neighbourhood populations), and find that three of these new T dwarfs may have unusual metallicity, two may have low surface gravity, and one may have high surface gravity. The colours of the full sample of LAS T dwarfs show a possible trend to bluer Y - J with decreasing effective temperature, and some interesting colour changes in J - H and z - J (deserving further investigation) beyond T8. The LAS T dwarf sample from the first and second main data releases show good evidence for a good level of completion to J = 19. By accounting for the main sources of incompleteness (selection, follow-up and spatial) as well as the effects of unresolved binarity, Malmquist and Eddington bias, we estimate that there are 17 +/- 4 >= T 4 dwarfs in the J <= 19 volume of the LAS second data release. This value is most consistent with theoretical predictions if the substellar mass function exponent α (dN/dm ~ m-α) lies between -1.0 and 0. This is consistent with the latest 2-Micron All Sky Survey (2MASS)/Sloan Digital Sky Survey (SDSS) constraint (which is based on lower number statistics) and is significantly lower than the α ~ 1.0 suggested by L dwarf field populations, which is possibly a result of the lower mass range probed by the T dwarf class.
Fixing Dataset Search

NASA Technical Reports Server (NTRS)

Lynnes, Chris

2014-01-01

Three current search engines are queried for ozone data at the GES DISC. The results range from sub-optimal to counter-intuitive. We propose a method to fix dataset search by implementing a robust relevancy ranking scheme. The relevancy ranking scheme is based on several heuristics culled from more than 20 years of helping users select datasets.
FLUXNET2015 Dataset: Batteries included

NASA Astrophysics Data System (ADS)

Pastorello, G.; Papale, D.; Agarwal, D.; Trotta, C.; Chu, H.; Canfora, E.; Torn, M. S.; Baldocchi, D. D.

2016-12-01

The synthesis datasets have become one of the signature products of the FLUXNET global network. They are composed from contributions of individual site teams to regional networks, being then compiled into uniform data products - now used in a wide variety of research efforts: from plant-scale microbiology to global-scale climate change. The FLUXNET Marconi Dataset in 2000 was the first in the series, followed by the FLUXNET LaThuile Dataset in 2007, with significant additions of data products and coverage, solidifying the adoption of the datasets as a research tool. The FLUXNET2015 Dataset counts with another round of substantial improvements, including extended quality control processes and checks, use of downscaled reanalysis data for filling long gaps in micrometeorological variables, multiple methods for USTAR threshold estimation and flux partitioning, and uncertainty estimates - all of which accompanied by auxiliary flags. This "batteries included" approach provides a lot of information for someone who wants to explore the data (and the processing methods) in detail. This inevitably leads to a large number of data variables. Although dealing with all these variables might seem overwhelming at first, especially to someone looking at eddy covariance data for the first time, there is method to our madness. In this work we describe the data products and variables that are part of the FLUXNET2015 Dataset, and the rationale behind the organization of the dataset, covering the simplified version (labeled SUBSET), the complete version (labeled FULLSET), and the auxiliary products in the dataset.
Isfahan MISP Dataset

PubMed Central

Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

2017-01-01

An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled “biosigdata.com.” It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf). PMID:28487832
Isfahan MISP Dataset.

PubMed

Kashefpur, Masoud; Kafieh, Rahele; Jorjandi, Sahar; Golmohammadi, Hadis; Khodabande, Zahra; Abbasi, Mohammadreza; Teifuri, Nilufar; Fakharzadeh, Ali Akbar; Kashefpoor, Maryam; Rabbani, Hossein

2017-01-01

An online depository was introduced to share clinical ground truth with the public and provide open access for researchers to evaluate their computer-aided algorithms. PHP was used for web programming and MySQL for database managing. The website was entitled "biosigdata.com." It was a fast, secure, and easy-to-use online database for medical signals and images. Freely registered users could download the datasets and could also share their own supplementary materials while maintaining their privacies (citation and fee). Commenting was also available for all datasets, and automatic sitemap and semi-automatic SEO indexing have been set for the site. A comprehensive list of available websites for medical datasets is also presented as a Supplementary (http://journalonweb.com/tempaccess/4800.584.JMSS_55_16I3253.pdf).

Preprocessed Consortium for Neuropsychiatric Phenomics dataset.

PubMed

Gorgolewski, Krzysztof J; Durnez, Joke; Poldrack, Russell A

2017-01-01

Here we present preprocessed MRI data of 265 participants from the Consortium for Neuropsychiatric Phenomics (CNP) dataset. The preprocessed dataset includes minimally preprocessed data in the native, MNI and surface spaces accompanied with potential confound regressors, tissue probability masks, brain masks and transformations. In addition the preprocessed dataset includes unthresholded group level and single subject statistical maps from all tasks included in the original dataset. We hope that availability of this dataset will greatly accelerate research.
Preliminary AirMSPI Datasets

Atmospheric Science Data Center

2018-02-26

... Datasets The data files available through this web page and ftp links are preliminary AIrMSPI datasets from recent campaigns. ... and geometric corrections. Caution should be used for science analysis. At a later date, more qualified versions will be made public. ...
Open University Learning Analytics dataset.

PubMed

Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

2017-11-28

Learning Analytics focuses on the collection and analysis of learners' data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students' interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.
Open University Learning Analytics dataset

PubMed Central

Kuzilek, Jakub; Hlosta, Martin; Zdrahal, Zdenek

2017-01-01

Learning Analytics focuses on the collection and analysis of learners’ data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behaviour, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license. PMID:29182599
Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

PubMed

Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

2009-07-01

Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.
Background qualitative analysis of the European Reference Life Cycle Database (ELCD) energy datasets - part I: fuel datasets.

PubMed

Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

2015-01-01

The aim of this study is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) fuel datasets. The revision is based on the data quality indicators described by the ILCD Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD fuel datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the fuel-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD fuel datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall DQR of databases.
Design of an audio advertisement dataset

NASA Astrophysics Data System (ADS)

Fu, Yutao; Liu, Jihong; Zhang, Qi; Geng, Yuting

2015-12-01

Since more and more advertisements swarm into radios, it is necessary to establish an audio advertising dataset which could be used to analyze and classify the advertisement. A method of how to establish a complete audio advertising dataset is presented in this paper. The dataset is divided into four different kinds of advertisements. Each advertisement's sample is given in *.wav file format, and annotated with a txt file which contains its file name, sampling frequency, channel number, broadcasting time and its class. The classifying rationality of the advertisements in this dataset is proved by clustering the different advertisements based on Principal Component Analysis (PCA). The experimental results show that this audio advertisement dataset offers a reliable set of samples for correlative audio advertisement experimental studies.
Background qualitative analysis of the European reference life cycle database (ELCD) energy datasets - part II: electricity datasets.

PubMed

Garraín, Daniel; Fazio, Simone; de la Rúa, Cristina; Recchioni, Marco; Lechón, Yolanda; Mathieux, Fabrice

2015-01-01

The aim of this paper is to identify areas of potential improvement of the European Reference Life Cycle Database (ELCD) electricity datasets. The revision is based on the data quality indicators described by the International Life Cycle Data system (ILCD) Handbook, applied on sectorial basis. These indicators evaluate the technological, geographical and time-related representativeness of the dataset and the appropriateness in terms of completeness, precision and methodology. Results show that ELCD electricity datasets have a very good quality in general terms, nevertheless some findings and recommendations in order to improve the quality of Life-Cycle Inventories have been derived. Moreover, these results ensure the quality of the electricity-related datasets to any LCA practitioner, and provide insights related to the limitations and assumptions underlying in the datasets modelling. Giving this information, the LCA practitioner will be able to decide whether the use of the ELCD electricity datasets is appropriate based on the goal and scope of the analysis to be conducted. The methodological approach would be also useful for dataset developers and reviewers, in order to improve the overall Data Quality Requirements of databases.
47 new T dwarfs from the UKIDSS Large Area Survey

NASA Astrophysics Data System (ADS)

Burningham, Ben; Pinfield, D. J.; Lucas, P. W.; Leggett, S. K.; Deacon, N. R.; Tamura, M.; Tinney, C. G.; Lodieu, N.; Zhang, Z. H.; Huelamo, N.; Jones, H. R. A.; Murray, D. N.; Mortlock, D. J.; Patel, M.; Barrado Y Navascués, D.; Zapatero Osorio, M. R.; Ishii, M.; Kuzuhara, M.; Smart, R. L.

2010-08-01

We report the discovery of 47 new T dwarfs in the Fourth Data Release (DR4) from the Large Area Survey (LAS) of the United Kingdom Infrared Telescope (UKIRT) Infrared Deep Sky Survey with spectral types ranging from T0 to T8.5. These bring the total sample of LAS T dwarfs to 80 as of DR4. In assigning spectral types to our objects we have identified eight new spectrally peculiar objects, and divide seven of them into two classes. H2O-H-early have a H2O-H index that differs with the H2O-J index by at least two subtypes. CH4-J-early have a CH4-J index that disagrees with the H20-J index by at least two subtypes. We have ruled out binarity as a sole explanation for both types of peculiarity, and suggest that they may represent hitherto unrecognized tracers of composition and/or gravity. Clear trends in z'(AB) - J and Y - J are apparent for our sample, consistent with weakening absorption in the red wing of the KI line at 0.77μm with decreasing effective temperature. We have used our sample to estimate space densities for T6-T9 dwarfs. By comparing our sample to Monte Carlo simulations of field T dwarfs for various mass functions of the form ψ(M) ~M-αpc-3M-1solar, we have placed weak constraints on the form of the field mass function. Our analysis suggests that the substellar mass function is declining at lower masses, with negative values of α preferred. This is at odds with results for young clusters that have been generally found to have α > 0.
Benchmark Dataset for Whole Genome Sequence Compression.

PubMed

C L, Biji; S Nair, Achuthsankar

2017-01-01

The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.
Subsampling for dataset optimisation

NASA Astrophysics Data System (ADS)

Ließ, Mareike

2017-04-01

Soil-landscapes have formed by the interaction of soil-forming factors and pedogenic processes. In modelling these landscapes in their pedodiversity and the underlying processes, a representative unbiased dataset is required. This concerns model input as well as output data. However, very often big datasets are available which are highly heterogeneous and were gathered for various purposes, but not to model a particular process or data space. As a first step, the overall data space and/or landscape section to be modelled needs to be identified including considerations regarding scale and resolution. Then the available dataset needs to be optimised via subsampling to well represent this n-dimensional data space. A couple of well-known sampling designs may be adapted to suit this purpose. The overall approach follows three main strategies: (1) the data space may be condensed and de-correlated by a factor analysis to facilitate the subsampling process. (2) Different methods of pattern recognition serve to structure the n-dimensional data space to be modelled into units which then form the basis for the optimisation of an existing dataset through a sensible selection of samples. Along the way, data units for which there is currently insufficient soil data available may be identified. And (3) random samples from the n-dimensional data space may be replaced by similar samples from the available dataset. While being a presupposition to develop data-driven statistical models, this approach may also help to develop universal process models and identify limitations in existing models.
Decibel: The Relational Dataset Branching System

PubMed Central

Maddox, Michael; Goehring, David; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol

2017-01-01

As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs. PMID:28149668
A global distributed basin morphometric dataset

NASA Astrophysics Data System (ADS)

Shen, Xinyi; Anagnostou, Emmanouil N.; Mei, Yiwen; Hong, Yang

2017-01-01

Basin morphometry is vital information for relating storms to hydrologic hazards, such as landslides and floods. In this paper we present the first comprehensive global dataset of distributed basin morphometry at 30 arc seconds resolution. The dataset includes nine prime morphometric variables; in addition we present formulas for generating twenty-one additional morphometric variables based on combination of the prime variables. The dataset can aid different applications including studies of land-atmosphere interaction, and modelling of floods and droughts for sustainable water management. The validity of the dataset has been consolidated by successfully repeating the Hack's law.
Growing up in a megalopolis: environmental effects on galaxy evolution in a supercluster at z ˜ 0.65 in UKIDSS UDS

NASA Astrophysics Data System (ADS)

Galametz, Audrey; Pentericci, Laura; Castellano, Marco; Mendel, Trevor; Hartley, Will G.; Fossati, Matteo; Finoguenov, Alexis; Almaini, Omar; Beifiori, Alessandra; Fontana, Adriano; Grazian, Andrea; Scodeggio, Marco; Kocevski, Dale D.

2018-04-01

We present a large-scale galaxy structure Cl J021734-0513 at z ˜ 0.65 discovered in the UKIDSS UDS field, made of ˜20 galaxy groups and clusters, spreading over 10 Mpc. We report on a VLT/VIMOS spectroscopic follow-up program that, combined with past spectroscopy, allowed us to confirm four galaxy clusters (M200 ˜ 1014 M⊙) and a dozen associated groups and star-forming galaxy overdensities. Two additional filamentary structures at z ˜ 0.62 and 0.69 and foreground and background clusters at 0.6 < z < 0.7 were also confirmed along the line of sight. The structure subcomponents are at different formation stages. The clusters have a core dominated by passive galaxies and an established red sequence. The remaining structures are a mix of star-forming galaxy overdensities and forming groups. The presence of quiescent galaxies in the core of the latter shows that `pre-processing' has already happened before the groups fall into their more massive neighbours. Our spectroscopy allows us to derive spectral index measurements e.g. emission/absorption line equivalent widths, strength of the 4000 Å break, valuable to investigate the star formation history of structure members. Based on these line measurements, we select a population of `post-starburst' galaxies. These galaxies are preferentially found within the virial radius of clusters, supporting a scenario in which their recent quenching could be prompted by gas stripping by the dense intracluster medium. We derive stellar age estimates using Markov Chain Monte Carlo-based spectral fitting for quiescent galaxies and find a correlation between ages and colours/stellar masses which favours a top-down formation scenario of the red sequence. A catalogue of ˜650 redshifts in UDS is released alongside the paper (via MNRAS online data).
Bayesian correlated clustering to integrate multiple datasets

PubMed Central

Kirk, Paul; Griffin, Jim E.; Savage, Richard S.; Ghahramani, Zoubin; Wild, David L.

2012-01-01

Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct—but often complementary—information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. Results: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI’s performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation–chip and protein–protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques—as well as to non-integrative approaches—demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Availability: A Matlab implementation of MDI is available from http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/. Contact: D.L.Wild@warwick.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID
The stellar masses of ˜ 40 000 UV selected Galaxies from the WiggleZ survey at 0.3

NASA Astrophysics Data System (ADS)

Banerji, Manda; Glazebrook, Karl; Blake, Chris; Brough, Sarah; Colless, Matthew; Contreras, Carlos; Couch, Warrick; Croton, Darren J.; Croom, Scott; Davis, Tamara M.; Drinkwater, Michael J.; Forster, Karl; Gilbank, David; Gladders, Mike; Jelliffe, Ben; Jurek, Russell J.; Li, I.-hui; Madore, Barry; Martin, D. Christopher; Pimbblet, Kevin; Poole, Gregory B.; Pracy, Michael; Sharp, Rob; Wisnioski, Emily; Woods, David; Wyder, Ted K.; Yee, H. K. C.

2013-05-01

We characterize the stellar masses and star formation rates in a sample of ˜40 000 spectroscopically confirmed UV-luminous galaxies at 0.3 < z < 1.0 selected from within the WiggleZ Dark Energy Survey. In particular, we match this UV bright population to wide-field infrared surveys such as the near-infrared (NIR) UKIDSS Large Area Survey (LAS) and the mid-infrared Wide-Field Infrared Survey Explorer (WISE) All-Sky Survey. We find that ˜30 per cent of the UV-luminous WiggleZ galaxies, corresponding to the brightest and reddest subset, are detected at >5σ in the UKIDSS-LAS at all redshifts. An even more luminous subset of 15 per cent are also detected in the WISE 3.4 and 4.6 μm bands. In addition, 22 of the WiggleZ galaxies are extremely luminous at 12 and 22 μm and have colours consistent with being star formation dominated. We compute stellar masses for this very large sample of extremely blue galaxies and quantify the sensitivity of the stellar mass estimates to various assumptions made during the spectral energy distribution (SED) fitting. The median stellar masses are log10(M*/M⊙) = 9.6 ± 0.7, 10.2 ± 0.5 and 10.4 ± 0.4 for the IR undetected, UKIDSS detected and UKIDSS+WISE detected galaxies, respectively. We demonstrate that the inclusion of NIR photometry can lead to tighter constraints on the stellar masses by bringing down the upper bound on the stellar mass estimate. The mass estimates are found to be most sensitive to the inclusion of secondary bursts of star formation as well as changes in the stellar population synthesis models, both of which can lead to median discrepancies of the order of 0.3 dex in the stellar masses. We conclude that even for these extremely blue galaxies, different SED fitting codes therefore produce extremely robust stellar mass estimates. We find, however, that the best-fitting M/LK is significantly lower than that predicted by simple optical colour-based estimators for many of the WiggleZ galaxies. The simple colour
Handwritten mathematical symbols dataset.

PubMed

Chajri, Yassine; Bouikhalene, Belaid

2016-06-01

Due to the technological advances in recent years, paper scientific documents are used less and less. Thus, the trend in the scientific community to use digital documents has increased considerably. Among these documents, there are scientific documents and more specifically mathematics documents. In this context, we present our own dataset of handwritten mathematical symbols composed of 10,379 images. This dataset gathers Arabic characters, Latin characters, Arabic numerals, Latin numerals, arithmetic operators, set-symbols, comparison symbols, delimiters, etc.
77 FR 15052 - Dataset Workshop-U.S. Billion Dollar Disasters Dataset (1980-2011): Assessing Dataset Strengths...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-03-14

... and related methodology. Emphasis will be placed on dataset accuracy and time-dependent biases. Pathways to overcome accuracy and bias issues will be an important focus. Participants will consider...] Guidance for improving these methods. [cir] Recommendations for rectifying any known time-dependent biases...
Cool White Dwarfs Found in the UKIRT Infrared Deep Sky Survey

NASA Astrophysics Data System (ADS)

Leggett, S. K.; Lodieu, N.; Tremblay, P.-E.; Bergeron, P.; Nitta, A.

2011-07-01

We present the results of a search for cool white dwarfs in the United Kingdom InfraRed Telescope (UKIRT) Infrared Deep Sky Survey (UKIDSS) Large Area Survey (LAS). The UKIDSS LAS photometry was paired with the Sloan Digital Sky Survey to identify cool hydrogen-rich white dwarf candidates by their neutral optical colors and blue near-infrared colors, as well as faint reduced proper motion magnitudes. Optical spectroscopy was obtained at Gemini Observatory and showed the majority of the candidates to be newly identified cool degenerates, with a small number of G- to K-type (sub)dwarf contaminants. Our initial search of 280 deg2 of sky resulted in seven new white dwarfs with effective temperature T eff ≈ 6000 K. The current follow-up of 1400 deg2 of sky has produced 13 new white dwarfs. Model fits to the photometry show that seven of the newly identified white dwarfs have 4120 K <=T eff <= 4480 K, and cooling ages between 7.3 Gyr and 8.7 Gyr; they have 40 km s-1 <= v tan <= 85 km s-1 and are likely to be thick disk 10-11 Gyr-old objects. The other half of the sample has 4610 K <=T eff <= 5260 K, cooling ages between 4.3 Gyr and 6.9 Gyr, and 60 km s-1 <= v tan <= 100 km s-1. These are either thin disk remnants with unusually high velocities, or lower-mass remnants of thick disk or halo late-F or G stars.
NP-PAH Interaction Dataset

EPA Pesticide Factsheets

Dataset presents concentrations of organic pollutants, such as polyaromatic hydrocarbon compounds, in water samples. Water samples of known volume and concentration were allowed to equilibrate with known mass of nanoparticles. The mixture was then ultracentrifuged and sampled for analysis. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).

Handwritten mathematical symbols dataset

PubMed Central

Chajri, Yassine; Bouikhalene, Belaid

2016-01-01

Due to the technological advances in recent years, paper scientific documents are used less and less. Thus, the trend in the scientific community to use digital documents has increased considerably. Among these documents, there are scientific documents and more specifically mathematics documents. In this context, we present our own dataset of handwritten mathematical symbols composed of 10,379 images. This dataset gathers Arabic characters, Latin characters, Arabic numerals, Latin numerals, arithmetic operators, set-symbols, comparison symbols, delimiters, etc. PMID:27006975
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

PubMed Central

Wernisch, Lorenz

2017-01-01

Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

PubMed

Gabasova, Evelina; Reid, John; Wernisch, Lorenz

2017-10-01

Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
TRI Preliminary Dataset

EPA Pesticide Factsheets

The TRI preliminary dataset includes the most current TRI data available and reflects toxic chemical releases and pollution prevention activities that occurred at TRI facilities during the each calendar year.
Comparison of recent SnIa datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S., E-mail: jbueno@cc.uoi.gr, E-mail: nesseris@nbi.ku.dk, E-mail: leandros@uoi.gr

2009-11-01

We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w{sub 0}+w{sub 1}(1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U),more » (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w{sub 0},w{sub 1}) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample.« less
A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge

PubMed Central

Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila

2017-01-01

Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453
[Spatial domain display for interference image dataset].

PubMed

Wang, Cai-Ling; Li, Yu-Shan; Liu, Xue-Bin; Hu, Bing-Liang; Jing, Juan-Juan; Wen, Jia

2011-11-01

The requirements of imaging interferometer visualization is imminent for the user of image interpretation and information extraction. However, the conventional researches on visualization only focus on the spectral image dataset in spectral domain. Hence, the quick show of interference spectral image dataset display is one of the nodes in interference image processing. The conventional visualization of interference dataset chooses classical spectral image dataset display method after Fourier transformation. In the present paper, the problem of quick view of interferometer imager in image domain is addressed and the algorithm is proposed which simplifies the matter. The Fourier transformation is an obstacle since its computation time is very large and the complexion would be even deteriorated with the size of dataset increasing. The algorithm proposed, named interference weighted envelopes, makes the dataset divorced from transformation. The authors choose three interference weighted envelopes respectively based on the Fourier transformation, features of interference data and human visual system. After comparing the proposed with the conventional methods, the results show the huge difference in display time.
Comparison of CORA and EN4 in-situ datasets validation methods, toward a better quality merged dataset.

NASA Astrophysics Data System (ADS)

Szekely, Tanguy; Killick, Rachel; Gourrion, Jerome; Reverdin, Gilles

2017-04-01

CORA and EN4 are both global delayed time mode validated in-situ ocean temperature and salinity datasets distributed by the Met Office (http://www.metoffice.gov.uk/) and Copernicus (www.marine.copernicus.eu). A large part of the profiles distributed by CORA and EN4 in recent years are Argo profiles from the ARGO DAC, but profiles are also extracted from the World Ocean Database and TESAC profiles from GTSPP. In the case of CORA, data coming from the EUROGOOS Regional operationnal oserving system( ROOS) operated by European institutes no managed by National Data Centres and other datasets of profiles povided by scientific sources can also be found (Sea mammals profiles from MEOP, XBT datasets from cruises ...). (EN4 also takes data from the ASBO dataset to supplement observations in the Arctic). First advantage of this new merge product is to enhance the space and time coverage at global and european scales for the period covering 1950 till a year before the current year. This product is updated once a year and T&S gridded fields are alos generated for the period 1990-year n-1. The enhancement compared to the revious CORA product will be presented Despite the fact that the profiles distributed by both datasets are mostly the same, the quality control procedures developed by the Met Office and Copernicus teams differ, sometimes leading to different quality control flags for the same profile. Started in 2016 a new study started that aims to compare both validation procedures to move towards a Copernicus Marine Service dataset with the best features of CORA and EN4 validation.A reference data set composed of the full set of in-situ temperature and salinity measurements collected by Coriolis during 2015 is used. These measurements have been made thanks to wide range of instruments (XBTs, CTDs, Argo floats, Instrumented sea mammals,...), covering the global ocean. The reference dataset has been validated simultaneously by both teams.An exhaustive comparison of the
Secondary analysis of national survey datasets.

PubMed

Boo, Sunjoo; Froelicher, Erika Sivarajan

2013-06-01

This paper describes the methodological issues associated with secondary analysis of large national survey datasets. Issues about survey sampling, data collection, and non-response and missing data in terms of methodological validity and reliability are discussed. Although reanalyzing large national survey datasets is an expedient and cost-efficient way of producing nursing knowledge, successful investigations require a methodological consideration of the intrinsic limitations of secondary survey analysis. Nursing researchers using existing national survey datasets should understand potential sources of error associated with survey sampling, data collection, and non-response and missing data. Although it is impossible to eliminate all potential errors, researchers using existing national survey datasets must be aware of the possible influence of errors on the results of the analyses. © 2012 The Authors. Japan Journal of Nursing Science © 2012 Japan Academy of Nursing Science.
U.S. Datasets

Cancer.gov

Datasets for U.S. mortality, U.S. populations, standard populations, county attributes, and expected survival. Plus SEER-linked databases (SEER-Medicare, SEER-Medicare Health Outcomes Survey [SEER-MHOS], SEER-Consumer Assessment of Healthcare Providers and Systems [SEER-CAHPS]).
Dataset of Scientific Inquiry Learning Environment

ERIC Educational Resources Information Center

Ting, Choo-Yee; Ho, Chiung Ching

2015-01-01

This paper presents the dataset collected from student interactions with INQPRO, a computer-based scientific inquiry learning environment. The dataset contains records of 100 students and is divided into two portions. The first portion comprises (1) "raw log data", capturing the student's name, interfaces visited, the interface…
Simulation of Smart Home Activity Datasets

PubMed Central

Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

2015-01-01

A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371
Simulation of Smart Home Activity Datasets.

PubMed

Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

2015-06-16

A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Providing Geographic Datasets as Linked Data in Sdi

NASA Astrophysics Data System (ADS)

Hietanen, E.; Lehto, L.; Latvala, P.

2016-06-01

In this study, a prototype service to provide data from Web Feature Service (WFS) as linked data is implemented. At first, persistent and unique Uniform Resource Identifiers (URI) are created to all spatial objects in the dataset. The objects are available from those URIs in Resource Description Framework (RDF) data format. Next, a Web Ontology Language (OWL) ontology is created to describe the dataset information content using the Open Geospatial Consortium's (OGC) GeoSPARQL vocabulary. The existing data model is modified in order to take into account the linked data principles. The implemented service produces an HTTP response dynamically. The data for the response is first fetched from existing WFS. Then the Geographic Markup Language (GML) format output of the WFS is transformed on-the-fly to the RDF format. Content Negotiation is used to serve the data in different RDF serialization formats. This solution facilitates the use of a dataset in different applications without replicating the whole dataset. In addition, individual spatial objects in the dataset can be referred with URIs. Furthermore, the needed information content of the objects can be easily extracted from the RDF serializations available from those URIs. A solution for linking data objects to the dataset URI is also introduced by using the Vocabulary of Interlinked Datasets (VoID). The dataset is divided to the subsets and each subset is given its persistent and unique URI. This enables the whole dataset to be explored with a web browser and all individual objects to be indexed by search engines.
VizieR Online Data Catalog: 76 T dwarfs from the UKIDSS LAS (Burningham+, 2013)

NASA Astrophysics Data System (ADS)

Burningham, B.; Cardoso, C. V.; Smith, L.; Leggett, S. K.; Smart, R. L.; Mann, A. W.; Dhital, S.; Lucas, P. W.; Tinney, C. G.; Pinfield, D. J.; Zhang, Z.; Morley, C.; Saumon, D.; Aller, K.; Littlefair, S. P.; Homeier, D.; Lodieu, N.; Deacon, N.; Marley, M. S.; van Spaandonk, L.; Baker, D.; Allard, F.; Andrei, A. H.; Canty, J.; Clarke, J.; Day-Jones, A. C.; Dupuy, T.; Fortney, J. J.; Gomes, J.; Ishii, M.; Jones, H. R. A.; Liu, M.; Magazzu, A.; Marocco, F.; Murray, D. N.; Rojas-Ayala, B.; Tamura, M.

2014-07-01

Our broad-band NIR photometry was obtained using the UKIRT Fast Track Imager (UFTI) and WFCAM, both mounted on UKIRT across a number of observing runs spanning 2009 to the end of 2010. Differential methane photometry was obtained using the Near Infrared Camera Spectrometer (NICS) mounted on the TNG under programme AOT22 TAC 96 spanning from 2010 to 2012. (5 data files).
VizieR Online Data Catalog: NIR proper motion catalogue from UKIDSS-LAS (Smith+, 2014)

NASA Astrophysics Data System (ADS)

Smith, L.; Lucas, P. W.; Burningham, B.; Jones, H. R. A.; Smart, R. L.; Andrei, A. H.; Catalan, S.; Pinfield, D. J.

2015-07-01

We constructed two epoch catalogues for each pointing by matching sources within the pairs of multiframes using the Starlink Tables Infrastructure Library Tool Set (STILTS; Taylor 2006, ASP conf. Ser. 351, 666). We required pairs of sources to be uniquely paired to their closest match within 6-arcsec, and we required the J band magnitudes for the two epochs to agree within 0.5mag, to minimize mismatches. (1 data file).
NATIONAL HYDROGRAPHY DATASET

EPA Science Inventory

Resource Purpose:The National Hydrography Dataset (NHD) is a comprehensive set of digital spatial data that contains information about surface water features such as lakes, ponds, streams, rivers, springs and wells. Within the NHD, surface water features are combined to fo...
The Optimum Dataset method - examples of the application

NASA Astrophysics Data System (ADS)

Błaszczak-Bąk, Wioleta; Sobieraj-Żłobińska, Anna; Wieczorek, Beata

2018-01-01

Data reduction is a procedure to decrease the dataset in order to make their analysis more effective and easier. Reduction of the dataset is an issue that requires proper planning, so after reduction it meets all the user's expectations. Evidently, it is better if the result is an optimal solution in terms of adopted criteria. Within reduction methods, which provide the optimal solution there is the Optimum Dataset method (OptD) proposed by Błaszczak-Bąk (2016). The paper presents the application of this method for different datasets from LiDAR and the possibility of using the method for various purposes of the study. The following reduced datasets were presented: (a) measurement of Sielska street in Olsztyn (Airbrone Laser Scanning data - ALS data), (b) measurement of the bas-relief that is on the building in Gdańsk (Terrestrial Laser Scanning data - TLS data), (c) dataset from Biebrza river measurment (TLS data).
Relevancy Ranking of Satellite Dataset Search Results

NASA Technical Reports Server (NTRS)

Lynnes, Christopher; Quinn, Patrick; Norton, James

2017-01-01

As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
Development of a global historic monthly mean precipitation dataset

NASA Astrophysics Data System (ADS)

Yang, Su; Xu, Wenhui; Xu, Yan; Li, Qingxiang

2016-04-01

Global historic precipitation dataset is the base for climate and water cycle research. There have been several global historic land surface precipitation datasets developed by international data centers such as the US National Climatic Data Center (NCDC), European Climate Assessment & Dataset project team, Met Office, etc., but so far there are no such datasets developed by any research institute in China. In addition, each dataset has its own focus of study region, and the existing global precipitation datasets only contain sparse observational stations over China, which may result in uncertainties in East Asian precipitation studies. In order to take into account comprehensive historic information, users might need to employ two or more datasets. However, the non-uniform data formats, data units, station IDs, and so on add extra difficulties for users to exploit these datasets. For this reason, a complete historic precipitation dataset that takes advantages of various datasets has been developed and produced in the National Meteorological Information Center of China. Precipitation observations from 12 sources are aggregated, and the data formats, data units, and station IDs are unified. Duplicated stations with the same ID are identified, with duplicated observations removed. Consistency test, correlation coefficient test, significance t-test at the 95% confidence level, and significance F-test at the 95% confidence level are conducted first to ensure the data reliability. Only those datasets that satisfy all the above four criteria are integrated to produce the China Meteorological Administration global precipitation (CGP) historic precipitation dataset version 1.0. It contains observations at 31 thousand stations with 1.87 × 107 data records, among which 4152 time series of precipitation are longer than 100 yr. This dataset plays a critical role in climate research due to its advantages in large data volume and high density of station network, compared to

A dataset on tail risk of commodities markets.

PubMed

Powell, Robert J; Vo, Duc H; Pham, Thach N; Singh, Abhay K

2017-12-01

This article contains the datasets related to the research article "The long and short of commodity tails and their relationship to Asian equity markets"(Powell et al., 2017) [1]. The datasets contain the daily prices (and price movements) of 24 different commodities decomposed from the S&P GSCI index and the daily prices (and price movements) of three share market indices including World, Asia, and South East Asia for the period 2004-2015. Then, the dataset is divided into annual periods, showing the worst 5% of price movements for each year. The datasets are convenient to examine the tail risk of different commodities as measured by Conditional Value at Risk (CVaR) as well as their changes over periods. The datasets can also be used to investigate the association between commodity markets and share markets.
Wind Integration National Dataset Toolkit | Grid Modernization | NREL

Science.gov Websites

information, share tips The WIND Toolkit includes meteorological conditions and turbine power for more than Integration National Dataset Toolkit Wind Integration National Dataset Toolkit The Wind Integration National Dataset (WIND) Toolkit is an update and expansion of the Eastern Wind Integration Data Set and
Control Measure Dataset

EPA Pesticide Factsheets

The EPA Control Measure Dataset is a collection of documents describing air pollution control available to regulated facilities for the control and abatement of air pollution emissions from a range of regulated source types, whether directly through the use of technical measures, or indirectly through economic or other measures.
Comparison of Shallow Survey 2012 Multibeam Datasets

NASA Astrophysics Data System (ADS)

Ramirez, T. M.

2012-12-01

The purpose of the Shallow Survey common dataset is a comparison of the different technologies utilized for data acquisition in the shallow survey marine environment. The common dataset consists of a series of surveys conducted over a common area of seabed using a variety of systems. It provides equipment manufacturers the opportunity to showcase their latest systems while giving hydrographic researchers and scientists a chance to test their latest algorithms on the dataset so that rigorous comparisons can be made. Five companies collected data for the Common Dataset in the Wellington Harbor area in New Zealand between May 2010 and May 2011; including Kongsberg, Reson, R2Sonic, GeoAcoustics, and Applied Acoustics. The Wellington harbor and surrounding coastal area was selected since it has a number of well-defined features, including the HMNZS South Seas and HMNZS Wellington wrecks, an armored seawall constructed of Tetrapods and Akmons, aquifers, wharves and marinas. The seabed inside the harbor basin is largely fine-grained sediment, with gravel and reefs around the coast. The area outside the harbor on the southern coast is an active environment, with moving sand and exposed reefs. A marine reserve is also in this area. For consistency between datasets, the coastal research vessel R/V Ikatere and crew were used for all surveys conducted for the common dataset. Using Triton's Perspective processing software multibeam datasets collected for the Shallow Survey were processed for detail analysis. Datasets from each sonar manufacturer were processed using the CUBE algorithm developed by the Center for Coastal and Ocean Mapping/Joint Hydrographic Center (CCOM/JHC). Each dataset was gridded at 0.5 and 1.0 meter resolutions for cross comparison and compliance with International Hydrographic Organization (IHO) requirements. Detailed comparisons were made of equipment specifications (transmit frequency, number of beams, beam width), data density, total uncertainty, and
Two ultraviolet radiation datasets that cover China

NASA Astrophysics Data System (ADS)

Liu, Hui; Hu, Bo; Wang, Yuesi; Liu, Guangren; Tang, Liqin; Ji, Dongsheng; Bai, Yongfei; Bao, Weikai; Chen, Xin; Chen, Yunming; Ding, Weixin; Han, Xiaozeng; He, Fei; Huang, Hui; Huang, Zhenying; Li, Xinrong; Li, Yan; Liu, Wenzhao; Lin, Luxiang; Ouyang, Zhu; Qin, Boqiang; Shen, Weijun; Shen, Yanjun; Su, Hongxin; Song, Changchun; Sun, Bo; Sun, Song; Wang, Anzhi; Wang, Genxu; Wang, Huimin; Wang, Silong; Wang, Youshao; Wei, Wenxue; Xie, Ping; Xie, Zongqiang; Yan, Xiaoyuan; Zeng, Fanjiang; Zhang, Fawei; Zhang, Yangjian; Zhang, Yiping; Zhao, Chengyi; Zhao, Wenzhi; Zhao, Xueyong; Zhou, Guoyi; Zhu, Bo

2017-07-01

Ultraviolet (UV) radiation has significant effects on ecosystems, environments, and human health, as well as atmospheric processes and climate change. Two ultraviolet radiation datasets are described in this paper. One contains hourly observations of UV radiation measured at 40 Chinese Ecosystem Research Network stations from 2005 to 2015. CUV3 broadband radiometers were used to observe the UV radiation, with an accuracy of 5%, which meets the World Meteorology Organization's measurement standards. The extremum method was used to control the quality of the measured datasets. The other dataset contains daily cumulative UV radiation estimates that were calculated using an all-sky estimation model combined with a hybrid model. The reconstructed daily UV radiation data span from 1961 to 2014. The mean absolute bias error and root-mean-square error are smaller than 30% at most stations, and most of the mean bias error values are negative, which indicates underestimation of the UV radiation intensity. These datasets can improve our basic knowledge of the spatial and temporal variations in UV radiation. Additionally, these datasets can be used in studies of potential ozone formation and atmospheric oxidation, as well as simulations of ecological processes.
The Harvard organic photovoltaic dataset

DOE PAGES

Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; ...

2016-09-27

Presented in this work is the Harvard Organic Photovoltaic Dataset (HOPV15), a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
The Harvard organic photovoltaic dataset.

PubMed

Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán

2016-09-27

The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang Yanxia; Ma He; Peng Nanbo

We apply one of the lazy learning methods, the k-nearest neighbor (kNN) algorithm, to estimate the photometric redshifts of quasars based on various data sets from the Sloan Digital Sky Survey (SDSS), the UKIRT Infrared Deep Sky Survey (UKIDSS), and the Wide-field Infrared Survey Explorer (WISE; the SDSS sample, the SDSS-UKIDSS sample, the SDSS-WISE sample, and the SDSS-UKIDSS-WISE sample). The influence of the k value and different input patterns on the performance of kNN is discussed. kNN performs best when k is different with a special input pattern for a special data set. The best result belongs to the SDSS-UKIDSS-WISEmore » sample. The experimental results generally show that the more information from more bands, the better performance of photometric redshift estimation with kNN. The results also demonstrate that kNN using multiband data can effectively solve the catastrophic failure of photometric redshift estimation, which is met by many machine learning methods. Compared with the performance of various other methods of estimating the photometric redshifts of quasars, kNN based on KD-Tree shows superiority, exhibiting the best accuracy.« less
Genomics dataset of unidentified disclosed isolates.

PubMed

Rekadwad, Bhagwan N

2016-09-01

Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.
Application of Huang-Hilbert Transforms to Geophysical Datasets

NASA Technical Reports Server (NTRS)

Duffy, Dean G.

2003-01-01

The Huang-Hilbert transform is a promising new method for analyzing nonstationary and nonlinear datasets. In this talk I will apply this technique to several important geophysical datasets. To understand the strengths and weaknesses of this method, multi- year, hourly datasets of the sea level heights and solar radiation will be analyzed. Then we will apply this transform to the analysis of gravity waves observed in a mesoscale observational net.
The Harvard organic photovoltaic dataset

PubMed Central

Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán

2016-01-01

The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312
VizieR Online Data Catalog: Spectra of low-mass stars in Upper Sco (Lodieu+, 2011)

NASA Astrophysics Data System (ADS)

Lodieu, N.; Dobbie, P. D.; Hambly, N. C.

2010-11-01

Coordinates (J2000), ZYJHK photometry from the UKIDSS Galactic Clusters Survey, and proper motions derived from the UKIDSS/2MASS cross-match (in arcsec/yr) of stars in the AAOmega field-of-view ordered by increasing Z magnitude. The last column provides a tentative estimate of the spectral type. Data obtained with the AAOmega spectrograph on the Anglo-Australian telescope in May 2007. (4 data files).
Interpolation of diffusion weighted imaging datasets.

PubMed

Dyrby, Tim B; Lundell, Henrik; Burke, Mark W; Reislev, Nina L; Paulson, Olaf B; Ptito, Maurice; Siebner, Hartwig R

2014-12-01

Diffusion weighted imaging (DWI) is used to study white-matter fibre organisation, orientation and structural connectivity by means of fibre reconstruction algorithms and tractography. For clinical settings, limited scan time compromises the possibilities to achieve high image resolution for finer anatomical details and signal-to-noise-ratio for reliable fibre reconstruction. We assessed the potential benefits of interpolating DWI datasets to a higher image resolution before fibre reconstruction using a diffusion tensor model. Simulations of straight and curved crossing tracts smaller than or equal to the voxel size showed that conventional higher-order interpolation methods improved the geometrical representation of white-matter tracts with reduced partial-volume-effect (PVE), except at tract boundaries. Simulations and interpolation of ex-vivo monkey brain DWI datasets revealed that conventional interpolation methods fail to disentangle fine anatomical details if PVE is too pronounced in the original data. As for validation we used ex-vivo DWI datasets acquired at various image resolutions as well as Nissl-stained sections. Increasing the image resolution by a factor of eight yielded finer geometrical resolution and more anatomical details in complex regions such as tract boundaries and cortical layers, which are normally only visualized at higher image resolutions. Similar results were found with typical clinical human DWI dataset. However, a possible bias in quantitative values imposed by the interpolation method used should be considered. The results indicate that conventional interpolation methods can be successfully applied to DWI datasets for mining anatomical details that are normally seen only at higher resolutions, which will aid in tractography and microstructural mapping of tissue compartments. Copyright © 2014. Published by Elsevier Inc.
[German national consensus on wound documentation of leg ulcer : Part 1: Routine care - standard dataset and minimum dataset].

PubMed

Heyer, K; Herberger, K; Protz, K; Mayer, A; Dissemond, J; Debus, S; Augustin, M

2017-09-01

Standards for basic documentation and the course of treatment increase quality assurance and efficiency in health care. To date, no standards for the treatment of patients with leg ulcers are available in Germany. The aim of the study was to develop standards under routine conditions in the documentation of patients with leg ulcers. This article shows the recommended variables of a "standard dataset" and a "minimum dataset". Consensus building among experts from 38 scientific societies, professional associations, insurance and supply networks (n = 68 experts) took place. After conducting a systematic international literature research, available standards were reviewed and supplemented with our own considerations of the expert group. From 2012-2015 standards for documentation were defined in multistage online visits and personal meetings. A consensus was achieved for 18 variables for the minimum dataset and 48 variables for the standard dataset in a total of seven meetings and nine online Delphi visits. The datasets involve patient baseline data, data on the general health status, wound characteristics, diagnostic and therapeutic interventions, patient reported outcomes, nutrition, and education status. Based on a multistage continuous decision-making process, a standard in the measurement of events in routine care in patients with a leg ulcer was developed.
EEG datasets for motor imagery brain-computer interface.

PubMed

Cho, Hohyun; Ahn, Minkyu; Ahn, Sangtae; Kwon, Moonyoung; Jun, Sung Chan

2017-07-01

Most investigators of brain-computer interface (BCI) research believe that BCI can be achieved through induced neuronal activity from the cortex, but not by evoked neuronal activity. Motor imagery (MI)-based BCI is one of the standard concepts of BCI, in that the user can generate induced activity by imagining motor movements. However, variations in performance over sessions and subjects are too severe to overcome easily; therefore, a basic understanding and investigation of BCI performance variation is necessary to find critical evidence of performance variation. Here we present not only EEG datasets for MI BCI from 52 subjects, but also the results of a psychological and physiological questionnaire, EMG datasets, the locations of 3D EEG electrodes, and EEGs for non-task-related states. We validated our EEG datasets by using the percentage of bad trials, event-related desynchronization/synchronization (ERD/ERS) analysis, and classification analysis. After conventional rejection of bad trials, we showed contralateral ERD and ipsilateral ERS in the somatosensory area, which are well-known patterns of MI. Finally, we showed that 73.08% of datasets (38 subjects) included reasonably discriminative information. Our EEG datasets included the information necessary to determine statistical significance; they consisted of well-discriminated datasets (38 subjects) and less-discriminative datasets. These may provide researchers with opportunities to investigate human factors related to MI BCI performance variation, and may also achieve subject-to-subject transfer by using metadata, including a questionnaire, EEG coordinates, and EEGs for non-task-related states. © The Authors 2017. Published by Oxford University Press.
A high-resolution European dataset for hydrologic modeling

NASA Astrophysics Data System (ADS)

Ntegeka, Victor; Salamon, Peter; Gomes, Goncalo; Sint, Hadewij; Lorini, Valerio; Thielen, Jutta

2013-04-01

There is an increasing demand for large scale hydrological models not only in the field of modeling the impact of climate change on water resources but also for disaster risk assessments and flood or drought early warning systems. These large scale models need to be calibrated and verified against large amounts of observations in order to judge their capabilities to predict the future. However, the creation of large scale datasets is challenging for it requires collection, harmonization, and quality checking of large amounts of observations. For this reason, only a limited number of such datasets exist. In this work, we present a pan European, high-resolution gridded dataset of meteorological observations (EFAS-Meteo) which was designed with the aim to drive a large scale hydrological model. Similar European and global gridded datasets already exist, such as the HadGHCND (Caesar et al., 2006), the JRC MARS-STAT database (van der Goot and Orlandi, 2003) and the E-OBS gridded dataset (Haylock et al., 2008). However, none of those provide similarly high spatial resolution and/or a complete set of variables to force a hydrologic model. EFAS-Meteo contains daily maps of precipitation, surface temperature (mean, minimum and maximum), wind speed and vapour pressure at a spatial grid resolution of 5 x 5 km for the time period 1 January 1990 - 31 December 2011. It furthermore contains calculated radiation, which is calculated by using a staggered approach depending on the availability of sunshine duration, cloud cover and minimum and maximum temperature, and evapotranspiration (potential evapotranspiration, bare soil and open water evapotranspiration). The potential evapotranspiration was calculated using the Penman-Monteith equation with the above-mentioned meteorological variables. The dataset was created as part of the development of the European Flood Awareness System (EFAS) and has been continuously updated throughout the last years. The dataset variables are used as
ASSISTments Dataset from Multiple Randomized Controlled Experiments

ERIC Educational Resources Information Center

Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil

2016-01-01

In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets.

PubMed

McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr

2017-01-01

Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.
Extension of research data repository system to support direct compute access to biomedical datasets: enhancing Dataverse to support large datasets

PubMed Central

McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr

2016-01-01

Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010
Estimating parameters for probabilistic linkage of privacy-preserved datasets.

PubMed

Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

2017-07-10

Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher

Viking Seismometer PDS Archive Dataset

NASA Astrophysics Data System (ADS)

Lorenz, R. D.

2016-12-01

The Viking Lander 2 seismometer operated successfully for over 500 Sols on the Martian surface, recording at least one likely candidate Marsquake. The Viking mission, in an era when data handling hardware (both on board and on the ground) was limited in capability, predated modern planetary data archiving, and ad-hoc repositories of the data, and the very low-level record at NSSDC, were neither convenient to process nor well-known. In an effort supported by the NASA Mars Data Analysis Program, we have converted the bulk of the Viking dataset (namely the 49,000 and 270,000 records made in High- and Event- modes at 20 and 1 Hz respectively) into a simple ASCII table format. Additionally, since wind-generated lander motion is a major component of the signal, contemporaneous meteorological data are included in summary records to facilitate correlation. These datasets are being archived at the PDS Geosciences Node. In addition to brief instrument and dataset descriptions, the archive includes code snippets in the freely-available language 'R' to demonstrate plotting and analysis. Further, we present examples of lander-generated noise, associated with the sampler arm, instrument dumps and other mechanical operations.
National Elevation Dataset

USGS Publications Warehouse

,

1999-01-01

The National Elevation Dataset (NED) is a new raster product assembled by the U.S. Geological Survey (USGS). The NED is designed to provide national elevation data in a seamless form with a consistent datum, elevation unit, and projection. Data corrections were made in the NED assembly process to minimize artifacts, permit edge matching, and fill sliver areas of missing data.
A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.

PubMed

Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai

2017-10-01

This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.
Dataset-Driven Research to Support Learning and Knowledge Analytics

ERIC Educational Resources Information Center

Verbert, Katrien; Manouselis, Nikos; Drachsler, Hendrik; Duval, Erik

2012-01-01

In various research areas, the availability of open datasets is considered as key for research and application purposes. These datasets are used as benchmarks to develop new algorithms and to compare them to other algorithms in given settings. Finding such available datasets for experimentation can be a challenging task in technology enhanced…
Method of generating features optimal to a dataset and classifier

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bruillard, Paul J.; Gosink, Luke J.; Jarman, Kenneth D.

A method of generating features optimal to a particular dataset and classifier is disclosed. A dataset of messages is inputted and a classifier is selected. An algebra of features is encoded. Computable features that are capable of describing the dataset from the algebra of features are selected. Irredundant features that are optimal for the classifier and the dataset are selected.
Querying Patterns in High-Dimensional Heterogenous Datasets

ERIC Educational Resources Information Center

Singh, Vishwakarma

2012-01-01

The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
A hybrid organic-inorganic perovskite dataset

NASA Astrophysics Data System (ADS)

Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi

2017-05-01

Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.
Dynamically Close Pairs of Galaxies Selected in the NIR

NASA Astrophysics Data System (ADS)

Keenan, Ryan C.; Foucaud, Sebastien; De Propris, Roberto; Lin, Jing-Hua

2013-07-01

Studies of dynamically close pairs of galaxies can serve as a powerful probe of the galaxy merger rate and its evolution. Here we present a large sample of dynamically close pairs of galaxies selected in the K-band from the UKIDSS LAS. These data span ~ 175 deg2 on the sky in the 2dFGRS equatorial region (10 h < RA < 14h). Combining the 2dFGRS redshifts with those from the SDSS, our K-band selected catalog is > 90% spectroscopically complete at K AB < 16.4. In this study, we focus on quantifying the relative contributions of wet, dry, and mixed mergers to the stellar mass buildup of galaxies over the past 1-2 Gyr.
The LANDFIRE Refresh strategy: updating the national dataset

USGS Publications Warehouse

Nelson, Kurtis J.; Connot, Joel A.; Peterson, Birgit E.; Martin, Charley

2013-01-01

The LANDFIRE Program provides comprehensive vegetation and fuel datasets for the entire United States. As with many large-scale ecological datasets, vegetation and landscape conditions must be updated periodically to account for disturbances, growth, and natural succession. The LANDFIRE Refresh effort was the first attempt to consistently update these products nationwide. It incorporated a combination of specific systematic improvements to the original LANDFIRE National data, remote sensing based disturbance detection methods, field collected disturbance information, vegetation growth and succession modeling, and vegetation transition processes. This resulted in the creation of two complete datasets for all 50 states: LANDFIRE Refresh 2001, which includes the systematic improvements, and LANDFIRE Refresh 2008, which includes the disturbance and succession updates to the vegetation and fuel data. The new datasets are comparable for studying landscape changes in vegetation type and structure over a decadal period, and provide the most recent characterization of fuel conditions across the country. The applicability of the new layers is discussed and the effects of using the new fuel datasets are demonstrated through a fire behavior modeling exercise using the 2011 Wallow Fire in eastern Arizona as an example.
Omicseq: a web-based search engine for exploring omics datasets

PubMed Central

Sun, Xiaobo; Pittard, William S.; Xu, Tianlei; Chen, Li; Zwick, Michael E.; Jiang, Xiaoqian; Wang, Fusheng

2017-01-01

Abstract The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve ‘findability’ of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. PMID:28402462
Usefulness of DARPA dataset for intrusion detection system evaluation

NASA Astrophysics Data System (ADS)

Thomas, Ciza; Sharma, Vishwas; Balakrishnan, N.

2008-03-01

The MIT Lincoln Laboratory IDS evaluation methodology is a practical solution in terms of evaluating the performance of Intrusion Detection Systems, which has contributed tremendously to the research progress in that field. The DARPA IDS evaluation dataset has been criticized and considered by many as a very outdated dataset, unable to accommodate the latest trend in attacks. Then naturally the question arises as to whether the detection systems have improved beyond detecting these old level of attacks. If not, is it worth thinking of this dataset as obsolete? The paper presented here tries to provide supporting facts for the use of the DARPA IDS evaluation dataset. The two commonly used signature-based IDSs, Snort and Cisco IDS, and two anomaly detectors, the PHAD and the ALAD, are made use of for this evaluation purpose and the results support the usefulness of DARPA dataset for IDS evaluation.
Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications

NASA Astrophysics Data System (ADS)

Maskey, M.; Ramachandran, R.; Miller, J.

2017-12-01

Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.
Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets

NASA Astrophysics Data System (ADS)

Day-Lewis, F. D.; Slater, L. D.; Johnson, T.

2012-12-01

Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.
Using Multiple Big Datasets and Machine Learning to Produce a New Global Particulate Dataset: A Technology Challenge Case Study

NASA Astrophysics Data System (ADS)

Lary, D. J.

2013-12-01

A BigData case study is described where multiple datasets from several satellites, high-resolution global meteorological data, social media and in-situ observations are combined using machine learning on a distributed cluster using an automated workflow. The global particulate dataset is relevant to global public health studies and would not be possible to produce without the use of the multiple big datasets, in-situ data and machine learning.To greatly reduce the development time and enhance the functionality a high level language capable of parallel processing has been used (Matlab). A key consideration for the system is high speed access due to the large data volume, persistence of the large data volumes and a precise process time scheduling capability.
BigNeuron dataset V.0.0

DOE Data Explorer

Ramanathan, Arvind

2016-01-01

The cleaned bench testing reconstructions for the gold166 datasets have been put online at github https://github.com/BigNeuron/Events-and-News/wiki/BigNeuron-Events-and-News https://github.com/BigNeuron/Data/releases/tag/gold166_bt_v1.0 The respective image datasets were released a while ago from other sites (major pointer is available at github as well https://github.com/BigNeuron/Data/releases/tag/Gold166_v1 but since the files were big, the actual downloading was distributed at 3 continents separately)
Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.

PubMed

Murata, Hiroshi; Zangwill, Linda M; Fujino, Yuri; Matsuura, Masato; Miki, Atsuya; Hirasawa, Kazunori; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki; Asaoka, Ryo

2018-04-01

To validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset. The training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic. OLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05). VBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings.
Squish: Near-Optimal Compression for Archival of Relational Datasets

PubMed Central

Gao, Yihan; Parameswaran, Aditya

2017-01-01

Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: Squish achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets. PMID:28180028
Omicseq: a web-based search engine for exploring omics datasets.

PubMed

Sun, Xiaobo; Pittard, William S; Xu, Tianlei; Chen, Li; Zwick, Michael E; Jiang, Xiaoqian; Wang, Fusheng; Qin, Zhaohui S

2017-07-03

The development and application of high-throughput genomics technologies has resulted in massive quantities of diverse omics data that continue to accumulate rapidly. These rich datasets offer unprecedented and exciting opportunities to address long standing questions in biomedical research. However, our ability to explore and query the content of diverse omics data is very limited. Existing dataset search tools rely almost exclusively on the metadata. A text-based query for gene name(s) does not work well on datasets wherein the vast majority of their content is numeric. To overcome this barrier, we have developed Omicseq, a novel web-based platform that facilitates the easy interrogation of omics datasets holistically to improve 'findability' of relevant data. The core component of Omicseq is trackRank, a novel algorithm for ranking omics datasets that fully uses the numerical content of the dataset to determine relevance to the query entity. The Omicseq system is supported by a scalable and elastic, NoSQL database that hosts a large collection of processed omics datasets. In the front end, a simple, web-based interface allows users to enter queries and instantly receive search results as a list of ranked datasets deemed to be the most relevant. Omicseq is freely available at http://www.omicseq.org. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
ISRUC-Sleep: A comprehensive public dataset for sleep researchers.

PubMed

Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano

2016-02-01

To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Quantifying uncertainty in observational rainfall datasets

NASA Astrophysics Data System (ADS)

Lennard, Chris; Dosio, Alessandro; Nikulin, Grigory; Pinto, Izidine; Seid, Hussen

2015-04-01

The CO-ordinated Regional Downscaling Experiment (CORDEX) has to date seen the publication of at least ten journal papers that examine the African domain during 2012 and 2013. Five of these papers consider Africa generally (Nikulin et al. 2012, Kim et al. 2013, Hernandes-Dias et al. 2013, Laprise et al. 2013, Panitz et al. 2013) and five have regional foci: Tramblay et al. (2013) on Northern Africa, Mariotti et al. (2014) and Gbobaniyi el al. (2013) on West Africa, Endris et al. (2013) on East Africa and Kalagnoumou et al. (2013) on southern Africa. There also are a further three papers that the authors know about under review. These papers all use an observed rainfall and/or temperature data to evaluate/validate the regional model output and often proceed to assess projected changes in these variables due to climate change in the context of these observations. The most popular reference rainfall data used are the CRU, GPCP, GPCC, TRMM and UDEL datasets. However, as Kalagnoumou et al. (2013) point out there are many other rainfall datasets available for consideration, for example, CMORPH, FEWS, TAMSAT & RIANNAA, TAMORA and the WATCH & WATCH-DEI data. They, with others (Nikulin et al. 2012, Sylla et al. 2012) show that the observed datasets can have a very wide spread at a particular space-time coordinate. As more ground, space and reanalysis-based rainfall products become available, all which use different methods to produce precipitation data, the selection of reference data is becoming an important factor in model evaluation. A number of factors can contribute to a uncertainty in terms of the reliability and validity of the datasets such as radiance conversion algorithims, the quantity and quality of available station data, interpolation techniques and blending methods used to combine satellite and guage based products. However, to date no comprehensive study has been performed to evaluate the uncertainty in these observational datasets. We assess 18 gridded

Internal Consistency of the NVAP Water Vapor Dataset

NASA Technical Reports Server (NTRS)

Suggs, Ronnie J.; Jedlovec, Gary J.; Arnold, James E. (Technical Monitor)

2001-01-01

The NVAP (NASA Water Vapor Project) dataset is a global dataset at 1 x 1 degree spatial resolution consisting of daily, pentad, and monthly atmospheric precipitable water (PW) products. The analysis blends measurements from the Television and Infrared Operational Satellite (TIROS) Operational Vertical Sounder (TOVS), the Special Sensor Microwave/Imager (SSM/I), and radiosonde observations into a daily collage of PW. The original dataset consisted of five years of data from 1988 to 1992. Recent updates have added three additional years (1993-1995) and incorporated procedural and algorithm changes from the original methodology. Since each of the PW sources (TOVS, SSM/I, and radiosonde) do not provide global coverage, each of these sources compliment one another by providing spatial coverage over regions and during times where the other is not available. For this type of spatial and temporal blending to be successful, each of the source components should have similar or compatible accuracies. If this is not the case, regional and time varying biases may be manifested in the NVAP dataset. This study examines the consistency of the NVAP source data by comparing daily collocated TOVS and SSM/I PW retrievals with collocated radiosonde PW observations. The daily PW intercomparisons are performed over the time period of the dataset and for various regions.
Topic modeling for cluster analysis of large biological and medical datasets

PubMed Central

2014-01-01

Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than
Topic modeling for cluster analysis of large biological and medical datasets.

PubMed

Zhao, Weizhong; Zou, Wen; Chen, James J

2014-01-01

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting
Food Recognition: A New Dataset, Experiments, and Results.

PubMed

Ciocca, Gianluigi; Napoletano, Paolo; Schettini, Raimondo

2017-05-01

We propose a new dataset for the evaluation of food recognition algorithms that can be used in dietary monitoring applications. Each image depicts a real canteen tray with dishes and foods arranged in different ways. Each tray contains multiple instances of food classes. The dataset contains 1027 canteen trays for a total of 3616 food instances belonging to 73 food classes. The food on the tray images has been manually segmented using carefully drawn polygonal boundaries. We have benchmarked the dataset by designing an automatic tray analysis pipeline that takes a tray image as input, finds the regions of interest, and predicts for each region the corresponding food class. We have experimented with three different classification strategies using also several visual descriptors. We achieve about 79% of food and tray recognition accuracy using convolutional-neural-networks-based features. The dataset, as well as the benchmark framework, are available to the research community.
A reanalysis dataset of the South China Sea.

PubMed

Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

2014-01-01

Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992-2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability.
Dataset definition for CMS operations and physics analyses

NASA Astrophysics Data System (ADS)

Franzoni, Giovanni; Compact Muon Solenoid Collaboration

2016-04-01

Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets and secondary datasets/dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concepts of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the LHC run I, and we discuss the plans for the run II.
Network Intrusion Dataset Assessment

DTIC Science & Technology

2013-03-01

Security, 6(1):173–180, October 2009. abs/0911.0787. 70 • Jungsuk Song, Hiroki Takakura, Yasuo Okabe, and Koji Nakao. “Toward a more practical...Inoue, and Koji Nakao. “Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation”. BADGERS ’11: Proceedings of
Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

PubMed

Kohli, Marc D; Summers, Ronald M; Geis, J Raymond

2017-08-01

At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities.
National Hydrography Dataset Plus (NHDPlus)

EPA Pesticide Factsheets

The NHDPlus Version 1.0 is an integrated suite of application-ready geospatial data sets that incorporate many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,000-scale NHD), improved networking, naming, and value-added attributes (VAA's). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainageenforcement technique first broadly applied in New England, and thus dubbed The New-England Method. This technique involves burning-in the 1:100,000-scale NHD and when available building walls using the national WatershedBoundary Dataset (WBD). The resulting modified digital elevation model(HydroDEM) is used to produce hydrologic derivatives that agree with the NHDand WBD. An interdisciplinary team from the U. S. Geological Survey (USGS), U.S. Environmental Protection Agency (USEPA), and contractors, over the lasttwo years has found this method to produce the best quality NHD catchments using an automated process.The VAAs include greatly enhanced capabilities for upstream and downstream navigation, analysis and modeling. Examples include: retrieve all flowlines (predominantly confluence-to-confluence stream segments) and catchments upstream of a given flowline using queries rather than by slower flowline-by flowline navigation; retrieve flowlines by stream order; subset a stream level path sorted in hydrologic order for st
Visualization of conserved structures by fusing highly variable datasets.

PubMed

Silverstein, Jonathan C; Chhadia, Ankur; Dech, Fred

2002-01-01

Skill, effort, and time are required to identify and visualize anatomic structures in three-dimensions from radiological data. Fundamentally, automating these processes requires a technique that uses symbolic information not in the dynamic range of the voxel data. We were developing such a technique based on mutual information for automatic multi-modality image fusion (MIAMI Fuse, University of Michigan). This system previously demonstrated facility at fusing one voxel dataset with integrated symbolic structure information to a CT dataset (different scale and resolution) from the same person. The next step of development of our technique was aimed at accommodating the variability of anatomy from patient to patient by using warping to fuse our standard dataset to arbitrary patient CT datasets. A standard symbolic information dataset was created from the full color Visible Human Female by segmenting the liver parenchyma, portal veins, and hepatic veins and overwriting each set of voxels with a fixed color. Two arbitrarily selected patient CT scans of the abdomen were used for reference datasets. We used the warping functions in MIAMI Fuse to align the standard structure data to each patient scan. The key to successful fusion was the focused use of multiple warping control points that place themselves around the structure of interest automatically. The user assigns only a few initial control points to align the scans. Fusion 1 and 2 transformed the atlas with 27 points around the liver to CT1 and CT2 respectively. Fusion 3 transformed the atlas with 45 control points around the liver to CT1 and Fusion 4 transformed the atlas with 5 control points around the portal vein. The CT dataset is augmented with the transformed standard structure dataset, such that the warped structure masks are visualized in combination with the original patient dataset. This combined volume visualization is then rendered interactively in stereo on the ImmersaDesk in an immersive Virtual
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset

DTIC Science & Technology

2014-12-23

publications for benchmarking prognostics algorithms. The turbofan degradation datasets have received over seven thousand unique downloads in the last five...approaches that researchers have taken to implement prognostics using these turbofan datasets. Some unique characteristics of these datasets are also...Description of the five turbofan degradation datasets available from NASA repository. Datasets #Fault Modes #Conditions #Train Units #Test Units
SisFall: A Fall and Movement Dataset

PubMed Central

Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco

2017-01-01

Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark. PMID:28117691
SisFall: A Fall and Movement Dataset.

PubMed

Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco

2017-01-20

Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark.
Generation and evaluation of typical meteorological year datasets for greenhouse and external conditions on the Mediterranean coast.

PubMed

Fernández, M D; López, J C; Baeza, E; Céspedes, A; Meca, D E; Bailey, B

2015-08-01

A typical meteorological year (TMY) represents the typical meteorological conditions over many years but still contains the short term fluctuations which are absent from long-term averaged data. Meteorological data were measured at the Experimental Station of Cajamar 'Las Palmerillas' (Cajamar Foundation) in Almeria, Spain, over 19 years at the meteorological station and in a reference greenhouse which is typical of those used in the region. The two sets of measurements were subjected to quality control analysis and then used to create TMY datasets using three different methodologies proposed in the literature. Three TMY datasets were generated for the external conditions and two for the greenhouse. They were assessed by using each as input to seven horticultural models and comparing the model results with those obtained by experiment in practical trials. In addition, the models were used with the meteorological data recorded during the trials. A scoring system was used to identify the best performing TMY in each application and then rank them in overall performance. The best methodology was that of Argiriou for both greenhouse and external conditions. The average relative errors between the seasonal values estimated using the 19-year dataset and those using the Argiriou greenhouse TMY were 2.2 % (reference evapotranspiration), -0.45 % (pepper crop transpiration), 3.4 % (pepper crop nitrogen uptake) and 0.8 % (green bean yield). The values obtained using the Argiriou external TMY were 1.8 % (greenhouse reference evapotranspiration), 0.6 % (external reference evapotranspiration), 4.7 % (greenhouse heat requirement) and 0.9 % (loquat harvest date). Using the models with the 19 individual years in the historical dataset showed that the year to year weather variability gave results which differed from the average values by ± 15 %. By comparison with results from other greenhouses it was shown that the greenhouse TMY is applicable to greenhouses which have a solar
Generation and evaluation of typical meteorological year datasets for greenhouse and external conditions on the Mediterranean coast

NASA Astrophysics Data System (ADS)

Fernández, M. D.; López, J. C.; Baeza, E.; Céspedes, A.; Meca, D. E.; Bailey, B.

2015-08-01

A typical meteorological year (TMY) represents the typical meteorological conditions over many years but still contains the short term fluctuations which are absent from long-term averaged data. Meteorological data were measured at the Experimental Station of Cajamar `Las Palmerillas' (Cajamar Foundation) in Almeria, Spain, over 19 years at the meteorological station and in a reference greenhouse which is typical of those used in the region. The two sets of measurements were subjected to quality control analysis and then used to create TMY datasets using three different methodologies proposed in the literature. Three TMY datasets were generated for the external conditions and two for the greenhouse. They were assessed by using each as input to seven horticultural models and comparing the model results with those obtained by experiment in practical trials. In addition, the models were used with the meteorological data recorded during the trials. A scoring system was used to identify the best performing TMY in each application and then rank them in overall performance. The best methodology was that of Argiriou for both greenhouse and external conditions. The average relative errors between the seasonal values estimated using the 19-year dataset and those using the Argiriou greenhouse TMY were 2.2 % (reference evapotranspiration), -0.45 % (pepper crop transpiration), 3.4 % (pepper crop nitrogen uptake) and 0.8 % (green bean yield). The values obtained using the Argiriou external TMY were 1.8 % (greenhouse reference evapotranspiration), 0.6 % (external reference evapotranspiration), 4.7 % (greenhouse heat requirement) and 0.9 % (loquat harvest date). Using the models with the 19 individual years in the historical dataset showed that the year to year weather variability gave results which differed from the average values by ± 15 %. By comparison with results from other greenhouses it was shown that the greenhouse TMY is applicable to greenhouses which have a solar
Viability of Controlling Prosthetic Hand Utilizing Electroencephalograph (EEG) Dataset Signal

NASA Astrophysics Data System (ADS)

Miskon, Azizi; A/L Thanakodi, Suresh; Raihan Mazlan, Mohd; Mohd Haziq Azhar, Satria; Nooraya Mohd Tawil, Siti

2016-11-01

This project presents the development of an artificial hand controlled by Electroencephalograph (EEG) signal datasets for the prosthetic application. The EEG signal datasets were used as to improvise the way to control the prosthetic hand compared to the Electromyograph (EMG). The EMG has disadvantages to a person, who has not used the muscle for a long time and also to person with degenerative issues due to age factor. Thus, the EEG datasets found to be an alternative for EMG. The datasets used in this work were taken from Brain Computer Interface (BCI) Project. The datasets were already classified for open, close and combined movement operations. It served the purpose as an input to control the prosthetic hand by using an Interface system between Microsoft Visual Studio and Arduino. The obtained results reveal the prosthetic hand to be more efficient and faster in response to the EEG datasets with an additional LiPo (Lithium Polymer) battery attached to the prosthetic. Some limitations were also identified in terms of the hand movements, weight of the prosthetic, and the suggestions to improve were concluded in this paper. Overall, the objective of this paper were achieved when the prosthetic hand found to be feasible in operation utilizing the EEG datasets.
CIFAR10-DVS: An Event-Stream Dataset for Object Classification

PubMed Central

Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

2017-01-01

Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582
CIFAR10-DVS: An Event-Stream Dataset for Object Classification.

PubMed

Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

2017-01-01

Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.
Finding Spatio-Temporal Patterns in Large Sensor Datasets

ERIC Educational Resources Information Center

McGuire, Michael Patrick

2010-01-01

Spatial or temporal data mining tasks are performed in the context of the relevant space, defined by a spatial neighborhood, and the relevant time period, defined by a specific time interval. Furthermore, when mining large spatio-temporal datasets, interesting patterns typically emerge where the dataset is most dynamic. This dissertation is…
Heuristics for Relevancy Ranking of Earth Dataset Search Results

NASA Astrophysics Data System (ADS)

Lynnes, C.; Quinn, P.; Norton, J.

2016-12-01

As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.

Heuristics for Relevancy Ranking of Earth Dataset Search Results

NASA Technical Reports Server (NTRS)

Lynnes, Christopher; Quinn, Patrick; Norton, James

2016-01-01

As the Variety of Earth science datasets increases, science researchers find it more challenging to discover and select the datasets that best fit their needs. The most common way of search providers to address this problem is to rank the datasets returned for a query by their likely relevance to the user. Large web page search engines typically use text matching supplemented with reverse link counts, semantic annotations and user intent modeling. However, this produces uneven results when applied to dataset metadata records simply externalized as a web page. Fortunately, data and search provides have decades of experience in serving data user communities, allowing them to form heuristics that leverage the structure in the metadata together with knowledge about the user community. Some of these heuristics include specific ways of matching the user input to the essential measurements in the dataset and determining overlaps of time range and spatial areas. Heuristics based on the novelty of the datasets can prioritize later, better versions of data over similar predecessors. And knowledge of how different user types and communities use data can be brought to bear in cases where characteristics of the user (discipline, expertise) or their intent (applications, research) can be divined. The Earth Observing System Data and Information System has begun implementing some of these heuristics in the relevancy algorithm of its Common Metadata Repository search engine.
The National Hydrography Dataset

USGS Publications Warehouse

,

1999-01-01

The National Hydrography Dataset (NHD) is a newly combined dataset that provides hydrographic data for the United States. The NHD is the culmination of recent cooperative efforts of the U.S. Environmental Protection Agency (USEPA) and the U.S. Geological Survey (USGS). It combines elements of USGS digital line graph (DLG) hydrography files and the USEPA Reach File (RF3). The NHD supersedes RF3 and DLG files by incorporating them, not by replacing them. Users of RF3 or DLG files will find the same data in a new, more flexible format. They will find that the NHD is familiar but greatly expanded and refined. The DLG files contribute a national coverage of millions of features, including water bodies such as lakes and ponds, linear water features such as streams and rivers, and also point features such as springs and wells. These files provide standardized feature types, delineation, and spatial accuracy. From RF3, the NHD acquires hydrographic sequencing, upstream and downstream navigation for modeling applications, and reach codes. The reach codes provide a way to integrate data from organizations at all levels by linking the data to this nationally consistent hydrographic network. The feature names are from the Geographic Names Information System (GNIS). The NHD provides comprehensive coverage of hydrographic data for the United States. Some of the anticipated end-user applications of the NHD are multiuse hydrographic modeling and water-quality studies of fish habitats. Although based on 1:100,000-scale data, the NHD is planned so that it can incorporate and encourage the development of the higher resolution data that many users require. The NHD can be used to promote the exchange of data between users at the national, State, and local levels. Many users will benefit from the NHD and will want to contribute to the dataset as well.
Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

PubMed

Spjuth, Ola; Willighagen, Egon L; Guha, Rajarshi; Eklund, Martin; Wikberg, Jarl Es

2010-06-30

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets and hence work collectively, but
Towards interoperable and reproducible QSAR analyses: Exchange of datasets

PubMed Central

2010-01-01

Background QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data. Results We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a dataset described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. Conclusions Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes is easy to join, extend, combine datasets
Genomic Datasets for Cancer Research

Cancer.gov

A variety of datasets from genome-wide association studies of cancer and other genotype-phenotype studies, including sequencing and molecular diagnostic assays, are available to approved investigators through the Extramural National Cancer Institute Data Access Committee.
NHDPlus (National Hydrography Dataset Plus)

EPA Pesticide Factsheets

NHDPlus is a geospatial, hydrologic framework dataset that is intended for use by geospatial analysts and modelers to support water resources related applications. NHDPlus was developed by the USEPA in partnership with the US Geologic Survey
VideoWeb Dataset for Multi-camera Activities and Non-verbal Communication

NASA Astrophysics Data System (ADS)

Denina, Giovanni; Bhanu, Bir; Nguyen, Hoang Thanh; Ding, Chong; Kamal, Ahmed; Ravishankar, Chinya; Roy-Chowdhury, Amit; Ivers, Allen; Varda, Brenda

Human-activity recognition is one of the most challenging problems in computer vision. Researchers from around the world have tried to solve this problem and have come a long way in recognizing simple motions and atomic activities. As the computer vision community heads toward fully recognizing human activities, a challenging and labeled dataset is needed. To respond to that need, we collected a dataset of realistic scenarios in a multi-camera network environment (VideoWeb) involving multiple persons performing dozens of different repetitive and non-repetitive activities. This chapter describes the details of the dataset. We believe that this VideoWeb Activities dataset is unique and it is one of the most challenging datasets available today. The dataset is publicly available online at http://vwdata.ee.ucr.edu/ along with the data annotation.
Toward Computational Cumulative Biology by Combining Models of Biological Datasets

PubMed Central

Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

2014-01-01

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database. PMID:25427176
Toward computational cumulative biology by combining models of biological datasets.

PubMed

Faisal, Ali; Peltonen, Jaakko; Georgii, Elisabeth; Rung, Johan; Kaski, Samuel

2014-01-01

A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations-for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.

PubMed

Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J

2017-03-01

Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
A dataset of forest biomass structure for Eurasia.

PubMed

Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

2017-05-16

The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.
A reanalysis dataset of the South China Sea

PubMed Central

Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu

2014-01-01

Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803
A dataset of forest biomass structure for Eurasia

NASA Astrophysics Data System (ADS)

Schepaschenko, Dmitry; Shvidenko, Anatoly; Usoltsev, Vladimir; Lakyda, Petro; Luo, Yunjian; Vasylyshyn, Roman; Lakyda, Ivan; Myklush, Yuriy; See, Linda; McCallum, Ian; Fritz, Steffen; Kraxner, Florian; Obersteiner, Michael

2017-05-01

The most comprehensive dataset of in situ destructive sampling measurements of forest biomass in Eurasia have been compiled from a combination of experiments undertaken by the authors and from scientific publications. Biomass is reported as four components: live trees (stem, bark, branches, foliage, roots); understory (above- and below ground); green forest floor (above- and below ground); and coarse woody debris (snags, logs, dead branches of living trees and dead roots), consisting of 10,351 unique records of sample plots and 9,613 sample trees from ca 1,200 experiments for the period 1930-2014 where there is overlap between these two datasets. The dataset also contains other forest stand parameters such as tree species composition, average age, tree height, growing stock volume, etc., when available. Such a dataset can be used for the development of models of biomass structure, biomass extension factors, change detection in biomass structure, investigations into biodiversity and species distribution and the biodiversity-productivity relationship, as well as the assessment of the carbon pool and its dynamics, among many others.
Optimizing tertiary storage organization and access for spatio-temporal datasets

NASA Technical Reports Server (NTRS)

Chen, Ling Tony; Rotem, Doron; Shoshani, Arie; Drach, Bob; Louis, Steve; Keating, Meridith

1994-01-01

We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and interactive visualization. We have developed data partitioning techniques that partition datasets into 'clusters' based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.
Sparse Group Penalized Integrative Analysis of Multiple Cancer Prognosis Datasets

PubMed Central

Liu, Jin; Huang, Jian; Xie, Yang; Ma, Shuangge

2014-01-01

SUMMARY In cancer research, high-throughput profiling studies have been extensively conducted, searching for markers associated with prognosis. Because of the “large d, small n” characteristic, results generated from the analysis of a single dataset can be unsatisfactory. Recent studies have shown that integrative analysis, which simultaneously analyzes multiple datasets, can be more effective than single-dataset analysis and classic meta-analysis. In most of existing integrative analysis, the homogeneity model has been assumed, which postulates that different datasets share the same set of markers. Several approaches have been designed to reinforce this assumption. In practice, different datasets may differ in terms of patient selection criteria, profiling techniques, and many other aspects. Such differences may make the homogeneity model too restricted. In this study, we assume the heterogeneity model, under which different datasets are allowed to have different sets of markers. With multiple cancer prognosis datasets, we adopt the AFT (accelerated failure time) model to describe survival. This model may have the lowest computational cost among popular semiparametric survival models. For marker selection, we adopt a sparse group MCP (minimax concave penalty) approach. This approach has an intuitive formulation and can be computed using an effective group coordinate descent algorithm. Simulation study shows that it outperforms the existing approaches under both the homogeneity and heterogeneity models. Data analysis further demonstrates the merit of heterogeneity model and proposed approach. PMID:23938111
Assessment of the NASA-USGS Global Land Survey (GLS) Datasets

USGS Publications Warehouse

Gutman, Garik; Huang, Chengquan; Chander, Gyanesh; Noojipady, Praveen; Masek, Jeffery G.

2013-01-01

The Global Land Survey (GLS) datasets are a collection of orthorectified, cloud-minimized Landsat-type satellite images, providing near complete coverage of the global land area decadally since the early 1970s. The global mosaics are centered on 1975, 1990, 2000, 2005, and 2010, and consist of data acquired from four sensors: Enhanced Thematic Mapper Plus, Thematic Mapper, Multispectral Scanner, and Advanced Land Imager. The GLS datasets have been widely used in land-cover and land-use change studies at local, regional, and global scales. This study evaluates the GLS datasets with respect to their spatial coverage, temporal consistency, geodetic accuracy, radiometric calibration consistency, image completeness, extent of cloud contamination, and residual gaps. In general, the three latest GLS datasets are of a better quality than the GLS-1990 and GLS-1975 datasets, with most of the imagery (85%) having cloud cover of less than 10%, the acquisition years clustered much more tightly around their target years, better co-registration relative to GLS-2000, and better radiometric absolute calibration. Probably, the most significant impediment to scientific use of the datasets is the variability of image phenology (i.e., acquisition day of year). This paper provides end-users with an assessment of the quality of the GLS datasets for specific applications, and where possible, suggestions for mitigating their deficiencies.
Brown CA et al 2016 Dataset

EPA Pesticide Factsheets

This dataset contains the research described in the following publication:Brown, C.A., D. Sharp, and T. Mochon Collura. 2016. Effect of Climate Change on Water Temperature and Attainment of Water Temperature Criteria in the Yaquina Estuary, Oregon (USA). Estuarine, Coastal and Shelf Science. 169:136-146, doi: 10.1016/j.ecss.2015.11.006.This dataset is associated with the following publication:Brown , C., D. Sharp, and T. MochonCollura. Effect of Climate Change on Water Temperature and Attainment of Water Temperature Criteria in the Yaquina Estuary, Oregon (USA). ESTUARINE, COASTAL AND SHELF SCIENCE. Elsevier Science Ltd, New York, NY, USA, 169: 136-146, (2016).
Conducting high-value secondary dataset analysis: an introductory guide and resources.

PubMed

Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A

2011-08-01

Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.
Generation of openEHR Test Datasets for Benchmarking.

PubMed

El Helou, Samar; Karvonen, Tuukka; Yamamoto, Goshiro; Kume, Naoto; Kobayashi, Shinji; Kondo, Eiji; Hiragi, Shusuke; Okamoto, Kazuya; Tamura, Hiroshi; Kuroda, Tomohiro

2017-01-01

openEHR is a widely used EHR specification. Given its technology-independent nature, different approaches for implementing openEHR data repositories exist. Public openEHR datasets are needed to conduct benchmark analyses over different implementations. To address their current unavailability, we propose a method for generating openEHR test datasets that can be publicly shared and used.
Parallel task processing of very large datasets

NASA Astrophysics Data System (ADS)

Romig, Phillip Richardson, III

This research concerns the use of distributed computer technologies for the analysis and management of very large datasets. Improvements in sensor technology, an emphasis on global change research, and greater access to data warehouses all are increase the number of non-traditional users of remotely sensed data. We present a framework for distributed solutions to the challenges of datasets which exceed the online storage capacity of individual workstations. This framework, called parallel task processing (PTP), incorporates both the task- and data-level parallelism exemplified by many image processing operations. An implementation based on the principles of PTP, called Tricky, is also presented. Additionally, we describe the challenges and practical issues in modeling the performance of parallel task processing with large datasets. We present a mechanism for estimating the running time of each unit of work within a system and an algorithm that uses these estimates to simulate the execution environment and produce estimated runtimes. Finally, we describe and discuss experimental results which validate the design. Specifically, the system (a) is able to perform computation on datasets which exceed the capacity of any one disk, (b) provides reduction of overall computation time as a result of the task distribution even with the additional cost of data transfer and management, and (c) in the simulation mode accurately predicts the performance of the real execution environment.

Wind and wave dataset for Matara, Sri Lanka

NASA Astrophysics Data System (ADS)

Luo, Yao; Wang, Dongxiao; Priyadarshana Gamage, Tilak; Zhou, Fenghua; Madusanka Widanage, Charith; Liu, Taiwei

2018-01-01

We present a continuous in situ hydro-meteorology observational dataset from a set of instruments first deployed in December 2012 in the south of Sri Lanka, facing toward the north Indian Ocean. In these waters, simultaneous records of wind and wave data are sparse due to difficulties in deploying measurement instruments, although the area hosts one of the busiest shipping lanes in the world. This study describes the survey, deployment, and measurements of wind and waves, with the aim of offering future users of the dataset the most comprehensive and as much information as possible. This dataset advances our understanding of the nearshore hydrodynamic processes and wave climate, including sea waves and swells, in the north Indian Ocean. Moreover, it is a valuable resource for ocean model parameterization and validation. The archived dataset (Table 1) is examined in detail, including wave data at two locations with water depths of 20 and 10 m comprising synchronous time series of wind, ocean astronomical tide, air pressure, etc. In addition, we use these wave observations to evaluate the ERA-Interim reanalysis product. Based on Buoy 2 data, the swells are the main component of waves year-round, although monsoons can markedly alter the proportion between swell and wind sea. The dataset (Luo et al., 2017) is publicly available from Science Data Bank (https://doi.org/10.11922/sciencedb.447).
A Spitzer View of Star Formation in the Cyngus X North Complex

DTIC Science & Technology

2009-11-10

Sky Survey ( 2MASS ) data are used to identify and classify young stellar objects. Of the 8,231 sources detected exhibiting infrared excess in Cygnus X...Telescope. A combination of IRAC, MIPS, UKIRT Deep Infrared Sky Survey (UKIDSS), and Two Micron All Sky Survey ( 2MASS ) data are used to identify and classify...MIPS, Two-Micron All-Sky Survey ( 2MASS , Skrutskie et al. 2006) and UKIRT Deep Sky Survey DR4 (UKIDSS, Lawrence et al. 2007; Lucas et al. 2008
Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.

PubMed

Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S

2017-01-01

As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross
Exploring Galaxy Formation and Evolution via Structural Decomposition

NASA Astrophysics Data System (ADS)

Kelvin, Lee; Driver, Simon; Robotham, Aaron; Hill, David; Cameron, Ewan

2010-06-01

The Galaxy And Mass Assembly (GAMA) structural decomposition pipeline (GAMA-SIGMA Structural Investigation of Galaxies via Model Analysis) will provide multi-component information for a sample of ~12,000 galaxies across 9 bands ranging from near-UV to near-IR. This will allow the relationship between structural properties and broadband, optical-to-near-IR, spectral energy distributions of bulge, bar, and disk components to be explored, revealing clues as to the history of baryonic mass assembly within a hierarchical clustering framework. Data is initially taken from the SDSS & UKIDSS-LAS surveys to test the robustness of our automated decomposition pipeline. This will eventually be replaced with the forthcoming higher-resolution VST & VISTA surveys data, expanding the sample to ~30,000 galaxies.
Using Graph Indices for the Analysis and Comparison of Chemical Datasets.

PubMed

Fourches, Denis; Tropsha, Alexander

2013-10-01

In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Interactive visualization and analysis of multimodal datasets for surgical applications.

PubMed

Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James

2012-12-01

Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.
Five year global dataset: NMC operational analyses (1978 to 1982)

NASA Technical Reports Server (NTRS)

Straus, David; Ardizzone, Joseph

1987-01-01

This document describes procedures used in assembling a five year dataset (1978 to 1982) using NMC Operational Analysis data. These procedures entailed replacing missing and unacceptable data in order to arrive at a complete dataset that is continuous in time. In addition, a subjective assessment on the integrity of all data (both preliminary and final) is presented. Documentation on tapes comprising the Five Year Global Dataset is also included.
Exploring patterns enriched in a dataset with contrastive principal component analysis.

PubMed

Abid, Abubakar; Zhang, Martin J; Bagaria, Vivek K; Zou, James

2018-05-30

Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare.

PubMed

Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

2015-07-02

A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a "data modeler" tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets.
GUDM: Automatic Generation of Unified Datasets for Learning and Reasoning in Healthcare

PubMed Central

Ali, Rahman; Siddiqi, Muhammad Hameed; Idris, Muhammad; Ali, Taqdir; Hussain, Shujaat; Huh, Eui-Nam; Kang, Byeong Ho; Lee, Sungyoung

2015-01-01

A wide array of biomedical data are generated and made available to healthcare experts. However, due to the diverse nature of data, it is difficult to predict outcomes from it. It is therefore necessary to combine these diverse data sources into a single unified dataset. This paper proposes a global unified data model (GUDM) to provide a global unified data structure for all data sources and generate a unified dataset by a “data modeler” tool. The proposed tool implements user-centric priority based approach which can easily resolve the problems of unified data modeling and overlapping attributes across multiple datasets. The tool is illustrated using sample diabetes mellitus data. The diverse data sources to generate the unified dataset for diabetes mellitus include clinical trial information, a social media interaction dataset and physical activity data collected using different sensors. To realize the significance of the unified dataset, we adopted a well-known rough set theory based rules creation process to create rules from the unified dataset. The evaluation of the tool on six different sets of locally created diverse datasets shows that the tool, on average, reduces 94.1% time efforts of the experts and knowledge engineer while creating unified datasets. PMID:26147731
Geoseq: a tool for dissecting deep-sequencing datasets.

PubMed

Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi

2010-10-12

Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
A Research Graph dataset for connecting research data repositories using RD-Switchboard.

PubMed

Aryani, Amir; Poblet, Marta; Unsworth, Kathryn; Wang, Jingbo; Evans, Ben; Devaraju, Anusuriya; Hausstein, Brigitte; Klas, Claus-Peter; Zapilko, Benjamin; Kaplun, Samuele

2018-05-29

This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures. The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants. The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation. Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.
Dataset for forensic analysis of B-tree file system.

PubMed

Wani, Mohamad Ahtisham; Bhat, Wasim Ahmad

2018-06-01

Since B-tree file system (Btrfs) is set to become de facto standard file system on Linux (and Linux based) operating systems, Btrfs dataset for forensic analysis is of great interest and immense value to forensic community. This article presents a novel dataset for forensic analysis of Btrfs that was collected using a proposed data-recovery procedure. The dataset identifies various generalized and common file system layouts and operations, specific node-balancing mechanisms triggered, logical addresses of various data structures, on-disk records, recovered-data as directory entries and extent data from leaf and internal nodes, and percentage of data recovered.
Process mining in oncology using the MIMIC-III dataset

NASA Astrophysics Data System (ADS)

Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen

2018-03-01

Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.
Microarray Analysis Dataset

EPA Pesticide Factsheets

This file contains a link for Gene Expression Omnibus and the GSE designations for the publicly available gene expression data used in the study and reflected in Figures 6 and 7 for the Das et al., 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).
A comparison of public datasets for acceleration-based fall detection.

PubMed

Igual, Raul; Medrano, Carlos; Plaza, Inmaculada

2015-09-01

Falls are one of the leading causes of mortality among the older population, being the rapid detection of a fall a key factor to mitigate its main adverse health consequences. In this context, several authors have conducted studies on acceleration-based fall detection using external accelerometers or smartphones. The published detection rates are diverse, sometimes close to a perfect detector. This divergence may be explained by the difficulties in comparing different fall detection studies in a fair play since each study uses its own dataset obtained under different conditions. In this regard, several datasets have been made publicly available recently. This paper presents a comparison, to the best of our knowledge for the first time, of these public fall detection datasets in order to determine whether they have an influence on the declared performances. Using two different detection algorithms, the study shows that the performances of the fall detection techniques are affected, to a greater or lesser extent, by the specific datasets used to validate them. We have also found large differences in the generalization capability of a fall detector depending on the dataset used for training. In fact, the performance decreases dramatically when the algorithms are tested on a dataset different from the one used for training. Other characteristics of the datasets like the number of training samples also have an influence on the performance while algorithms seem less sensitive to the sampling frequency or the acceleration range. Copyright © 2015 IPEM. Published by Elsevier Ltd. All rights reserved.
SAR image classification based on CNN in real and simulation datasets

NASA Astrophysics Data System (ADS)

Peng, Lijiang; Liu, Ming; Liu, Xiaohua; Dong, Liquan; Hui, Mei; Zhao, Yuejin

2018-04-01

Convolution neural network (CNN) has made great success in image classification tasks. Even in the field of synthetic aperture radar automatic target recognition (SAR-ATR), state-of-art results has been obtained by learning deep representation of features on the MSTAR benchmark. However, the raw data of MSTAR have shortcomings in training a SAR-ATR model because of high similarity in background among the SAR images of each kind. This indicates that the CNN would learn the hierarchies of features of backgrounds as well as the targets. To validate the influence of the background, some other SAR images datasets have been made which contains the simulation SAR images of 10 manufactured targets such as tank and fighter aircraft, and the backgrounds of simulation SAR images are sampled from the whole original MSTAR data. The simulation datasets contain the dataset that the backgrounds of each kind images correspond to the one kind of backgrounds of MSTAR targets or clutters and the dataset that each image shares the random background of whole MSTAR targets or clutters. In addition, mixed datasets of MSTAR and simulation datasets had been made to use in the experiments. The CNN architecture proposed in this paper are trained on all datasets mentioned above. The experimental results shows that the architecture can get high performances on all datasets even the backgrounds of the images are miscellaneous, which indicates the architecture can learn a good representation of the targets even though the drastic changes on background.
On sample size and different interpretations of snow stability datasets

NASA Astrophysics Data System (ADS)

Schirmer, M.; Mitterer, C.; Schweizer, J.

2009-04-01

Interpretations of snow stability variations need an assessment of the stability itself, independent of the scale investigated in the study. Studies on stability variations at a regional scale have often chosen stability tests such as the Rutschblock test or combinations of various tests in order to detect differences in aspect and elevation. The question arose: ‘how capable are such stability interpretations in drawing conclusions'. There are at least three possible errors sources: (i) the variance of the stability test itself; (ii) the stability variance at an underlying slope scale, and (iii) that the stability interpretation might not be directly related to the probability of skier triggering. Various stability interpretations have been proposed in the past that provide partly different results. We compared a subjective one based on expert knowledge with a more objective one based on a measure derived from comparing skier-triggered slopes vs. slopes that have been skied but not triggered. In this study, the uncertainties are discussed and their effects on regional scale stability variations will be quantified in a pragmatic way. An existing dataset with very large sample sizes was revisited. This dataset contained the variance of stability at a regional scale for several situations. The stability in this dataset was determined using the subjective interpretation scheme based on expert knowledge. The question to be answered was how many measurements were needed to obtain similar results (mainly stability differences in aspect or elevation) as with the complete dataset. The optimal sample size was obtained in several ways: (i) assuming a nominal data scale the sample size was determined with a given test, significance level and power, and by calculating the mean and standard deviation of the complete dataset. With this method it can also be determined if the complete dataset consists of an appropriate sample size. (ii) Smaller subsets were created with similar
Really big data: Processing and analysis of large datasets

USDA-ARS?s Scientific Manuscript database

Modern animal breeding datasets are large and getting larger, due in part to the recent availability of DNA data for many animals. Computational methods for efficiently storing and analyzing those data are under development. The amount of storage space required for such datasets is increasing rapidl...
A polymer dataset for accelerated property prediction and design.

PubMed

Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; Sharma, Vinit; Pilania, Ghanshyam; Ramprasad, Rampi

2016-03-01

Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate target of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. It will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.

A polymer dataset for accelerated property prediction and design

DOE PAGES

Huan, Tran Doan; Mannodi-Kanakkithodi, Arun; Kim, Chiho; ...

2016-03-01

Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate targetmore » of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. As a result, it will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.« less
A robust dataset-agnostic heart disease classifier from Phonocardiogram.

PubMed

Banerjee, Rohan; Dutta Choudhury, Anirban; Deshpande, Parijat; Bhattacharya, Sakyajit; Pal, Arpan; Mandana, K M

2017-07-01

Automatic classification of normal and abnormal heart sounds is a popular area of research. However, building a robust algorithm unaffected by signal quality and patient demography is a challenge. In this paper we have analysed a wide list of Phonocardiogram (PCG) features in time and frequency domain along with morphological and statistical features to construct a robust and discriminative feature set for dataset-agnostic classification of normal and cardiac patients. The large and open access database, made available in Physionet 2016 challenge was used for feature selection, internal validation and creation of training models. A second dataset of 41 PCG segments, collected using our in-house smart phone based digital stethoscope from an Indian hospital was used for performance evaluation. Our proposed methodology yielded sensitivity and specificity scores of 0.76 and 0.75 respectively on the test dataset in classifying cardiovascular diseases. The methodology also outperformed three popular prior art approaches, when applied on the same dataset.
Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

PubMed Central

Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.

2017-01-01

Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic
Determining Scale-dependent Patterns in Spatial and Temporal Datasets

NASA Astrophysics Data System (ADS)

Roy, A.; Perfect, E.; Mukerji, T.; Sylvester, L.

2016-12-01

Spatial and temporal datasets of interest to Earth scientists often contain plots of one variable against another, e.g., rainfall magnitude vs. time or fracture aperture vs. spacing. Such data, comprised of distributions of events along a transect / timeline along with their magnitudes, can display persistent or antipersistent trends, as well as random behavior, that may contain signatures of underlying physical processes. Lacunarity is a technique that was originally developed for multiscale analysis of data. In a recent study we showed that lacunarity can be used for revealing changes in scale-dependent patterns in fracture spacing data. Here we present a further improvement in our technique, with lacunarity applied to various non-binary datasets comprised of event spacings and magnitudes. We test our technique on a set of four synthetic datasets, three of which are based on an autoregressive model and have magnitudes at every point along the "timeline" thus representing antipersistent, persistent, and random trends. The fourth dataset is made up of five clusters of events, each containing a set of random magnitudes. The concept of lacunarity ratio, LR, is introduced; this is the lacunarity of a given dataset normalized to the lacunarity of its random counterpart. It is demonstrated that LR can successfully delineate scale-dependent changes in terms of antipersistence and persistence in the synthetic datasets. This technique is then applied to three different types of data: a hundred-year rainfall record from Knoxville, TN, USA, a set of varved sediments from Marca Shale, and a set of fracture aperture and spacing data from NE Mexico. While the rainfall data and varved sediments both appear to be persistent at small scales, at larger scales they both become random. On the other hand, the fracture data shows antipersistence at small scale (within cluster) and random behavior at large scales. Such differences in behavior with respect to scale-dependent changes in
An assessment of differences in gridded precipitation datasets in complex terrain

NASA Astrophysics Data System (ADS)

Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.

2018-01-01

Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.
Accuracy assessment of the U.S. Geological Survey National Elevation Dataset, and comparison with other large-area elevation datasets: SRTM and ASTER

USGS Publications Warehouse

Gesch, Dean B.; Oimoen, Michael J.; Evans, Gayla A.

2014-01-01

The National Elevation Dataset (NED) is the primary elevation data product produced and distributed by the U.S. Geological Survey. The NED provides seamless raster elevation data of the conterminous United States, Alaska, Hawaii, U.S. island territories, Mexico, and Canada. The NED is derived from diverse source datasets that are processed to a specification with consistent resolutions, coordinate system, elevation units, and horizontal and vertical datums. The NED serves as the elevation layer of The National Map, and it provides basic elevation information for earth science studies and mapping applications in the United States and most of North America. An important part of supporting scientific and operational use of the NED is provision of thorough dataset documentation including data quality and accuracy metrics. The focus of this report is on the vertical accuracy of the NED and on comparison of the NED with other similar large-area elevation datasets, namely data from the Shuttle Radar Topography Mission (SRTM) and the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER).
Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems

NASA Astrophysics Data System (ADS)

Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille

2016-04-01

Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not
NCAR's Research Data Archive: OPeNDAP Access for Complex Datasets

NASA Astrophysics Data System (ADS)

Dattore, R.; Worley, S. J.

2014-12-01

Many datasets have complex structures including hundreds of parameters and numerous vertical levels, grid resolutions, and temporal products. Making these data accessible is a challenge for a data provider. OPeNDAP is powerful protocol for delivering in real-time multi-file datasets that can be ingested by many analysis and visualization tools, but for these datasets there are too many choices about how to aggregate. Simple aggregation schemes can fail to support, or at least make it very challenging, for many potential studies based on complex datasets. We address this issue by using a rich file content metadata collection to create a real-time customized OPeNDAP service to match the full suite of access possibilities for complex datasets. The Climate Forecast System Reanalysis (CFSR) and it's extension, the Climate Forecast System Version 2 (CFSv2) datasets produced by the National Centers for Environmental Prediction (NCEP) and hosted by the Research Data Archive (RDA) at the Computational and Information Systems Laboratory (CISL) at NCAR are examples of complex datasets that are difficult to aggregate with existing data server software. CFSR and CFSv2 contain 141 distinct parameters on 152 vertical levels, six grid resolutions and 36 products (analyses, n-hour forecasts, multi-hour averages, etc.) where not all parameter/level combinations are available at all grid resolution/product combinations. These data are archived in the RDA with the data structure provided by the producer; no additional re-organization or aggregation have been applied. Since 2011, users have been able to request customized subsets (e.g. - temporal, parameter, spatial) from the CFSR/CFSv2, which are processed in delayed-mode and then downloaded to a user's system. Until now, the complexity has made it difficult to provide real-time OPeNDAP access to the data. We have developed a service that leverages the already-existing subsetting interface and allows users to create a virtual dataset
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset

NASA Technical Reports Server (NTRS)

Ramasso, Emannuel; Saxena, Abhinav

2014-01-01

Benchmarking of prognostic algorithms has been challenging due to limited availability of common datasets suitable for prognostics. In an attempt to alleviate this problem several benchmarking datasets have been collected by NASA's prognostic center of excellence and made available to the Prognostics and Health Management (PHM) community to allow evaluation and comparison of prognostics algorithms. Among those datasets are five C-MAPSS datasets that have been extremely popular due to their unique characteristics making them suitable for prognostics. The C-MAPSS datasets pose several challenges that have been tackled by different methods in the PHM literature. In particular, management of high variability due to sensor noise, effects of operating conditions, and presence of multiple simultaneous fault modes are some factors that have great impact on the generalization capabilities of prognostics algorithms. More than 70 publications have used the C-MAPSS datasets for developing data-driven prognostic algorithms. The C-MAPSS datasets are also shown to be well-suited for development of new machine learning and pattern recognition tools for several key preprocessing steps such as feature extraction and selection, failure mode assessment, operating conditions assessment, health status estimation, uncertainty management, and prognostics performance evaluation. This paper summarizes a comprehensive literature review of publications using C-MAPSS datasets and provides guidelines and references to further usage of these datasets in a manner that allows clear and consistent comparison between different approaches.
Las ideologias, las ciencias naturales y sus implicaciones en la educacion cientifica

NASA Astrophysics Data System (ADS)

Lozada Roldan, Sandra

Este estudio ausculto las concepciones epistemologicas de los docentes de ciencia del nivel secundario con relacion a las ideologias y las ciencias naturales. Tambien examino las posiciones de los docentes ante asuntos publicos relacionados a la ciencia. Para propositos de este estudio se diseno y se valido el cuestionario con el cual se obtuvieron los resultados. La investigacion es de tipo cuantitativa y se utilizo como diseno la encuesta. El cuestionario se administro en varias actividades de desarrollo profesional para maestros de ciencia. Un total de 78 maestros del nivel secundario respondieron el cuestionario. Para analizar los datos obtenidos se utilizaron estadisticas descriptivas como la distribucion de frecuencia y el porciento. Ademas se establecieron codigos y categorias para describir las posiciones de los maestros ante asuntos publicos relacionados a la ciencia. Los analisis demostraron que entre los docentes participantes de este estudio prevalecen ciertas concepciones epistemologicas adecuadas acerca de las ciencias naturales, a la luz de la literatura consultada. Entre estas concepciones se destacan las siguientes: a) la filosofia materialista de las ciencias naturales, b) la naturaleza tentativa y constructivista del conocimiento cientifico, c) el uso de una metodologia que garantiza cierto grado de objetividad y con el que se justifican y validan los enunciados cientificos y d) la funcion instrumental del conocimiento cientifico. Sin embargo, entre los docentes participantes de este estudio prevalecen ciertas concepciones epistemologicas erroneas acerca de las ciencias naturales, a la luz de la literatura consultada. Entre estas concepciones se destacan las siguientes: a) tendencia inductivista en el que las teorias cientificas comienzan con observaciones que establecen generalizaciones, b) secuencia jerarquica de la metodologia cientifica. Ademas, entre los docentes participantes de este estudio prevalecen concepciones epistemologicas adecuadas
Toxics Release Inventory Chemical Hazard Information Profiles (TRI-CHIP) Dataset

EPA Pesticide Factsheets

The Toxics Release Inventory (TRI) Chemical Hazard Information Profiles (TRI-CHIP) dataset contains hazard information about the chemicals reported in TRI. Users can use this XML-format dataset to create their own databases and hazard analyses of TRI chemicals. The hazard information is compiled from a series of authoritative sources including the Integrated Risk Information System (IRIS). The dataset is provided as a downloadable .zip file that when extracted provides XML files and schemas for the hazard information tables.
Using mixture-tuned match filtering to measure changes in subpixel vegetation area in Las Vegas, Nevada

NASA Astrophysics Data System (ADS)

Brelsford, Christa; Shepherd, Doug

2014-01-01

In desert cities, accurate measurements of vegetation area within residential lots are necessary to understand drivers of change in water consumption. Most residential lots are smaller than an individual 30-m pixel from Landsat satellite images and have a mixture of vegetation and other land covers. Quantifying vegetation change in this environment requires estimating subpixel vegetation area. Mixture-tuned match filtering (MTMF) has been successfully used for subpixel target detection. There have been few successful applications of MTMF to subpixel abundance estimation because the relationship observed between MTMF estimates and ground measurements of abundance is noisy. We use a ground truth dataset over 10 times larger than that available for any previous MTMF application to estimate the bias between ground data and MTMF results. We find that MTMF underestimates the fractional area of vegetation by 5% to 10% and show that averaging over multiple pixels is necessary to reduce noise in the dataset. We conclude that MTMF is a viable technique for fractional area estimation when a large dataset is available for calibration. When this method is applied to estimating vegetation area in Las Vegas, Nevada, spatial and temporal trends are consistent with expectations from known population growth and policy changes.
ESSG-based global spatial reference frame for datasets interrelation

NASA Astrophysics Data System (ADS)

Yu, J. Q.; Wu, L. X.; Jia, Y. J.

2013-10-01

To know well about the highly complex earth system, a large volume of, as well as a large variety of, datasets on the planet Earth are being obtained, distributed, and shared worldwide everyday. However, seldom of existing systems concentrates on the distribution and interrelation of different datasets in a common Global Spatial Reference Frame (GSRF), which holds an invisble obstacle to the data sharing and scientific collaboration. Group on Earth Obeservation (GEO) has recently established a new GSRF, named Earth System Spatial Grid (ESSG), for global datasets distribution, sharing and interrelation in its 2012-2015 WORKING PLAN.The ESSG may bridge the gap among different spatial datasets and hence overcome the obstacles. This paper is to present the implementation of the ESSG-based GSRF. A reference spheroid, a grid subdvision scheme, and a suitable encoding system are required to implement it. The radius of ESSG reference spheroid was set to the double of approximated Earth radius to make datasets from different areas of earth system science being covered. The same paramerters of positioning and orienting as Earth Centred Earth Fixed (ECEF) was adopted for the ESSG reference spheroid to make any other GSRFs being freely transformed into the ESSG-based GSRF. Spheroid degenerated octree grid with radius refiment (SDOG-R) and its encoding method were taken as the grid subdvision and encoding scheme for its good performance in many aspects. A triple (C, T, A) model is introduced to represent and link different datasets based on the ESSG-based GSRF. Finally, the methods of coordinate transformation between the ESSGbased GSRF and other GSRFs were presented to make ESSG-based GSRF operable and propagable.
Atlas-Guided Cluster Analysis of Large Tractography Datasets

PubMed Central

Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

2013-01-01

Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292
Atlas-guided cluster analysis of large tractography datasets.

PubMed

Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer

2013-01-01

Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.
Management and assimilation of diverse, distributed watershed datasets

NASA Astrophysics Data System (ADS)

Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.

2016-12-01

The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.
NP_PAH_interaction dataset

EPA Pesticide Factsheets

Concentrations of different polyaromatic hydrocarbons in water before and after interaction with nanomaterials. The results show the capacity of engineer nanomaterials for adsorbing different organic pollutants. This dataset is associated with the following publication:Sahle-Demessie, E., A. Zhao, C. Han, B. Hann, and H. Grecsek. Interaction of engineered nanomaterials with hydrophobic organic pollutants.. Journal of Nanotechnology. Hindawi Publishing Corporation, New York, NY, USA, 27(28): 284003, (2016).
BanglaLekha-Isolated: A multi-purpose comprehensive dataset of Handwritten Bangla Isolated characters.

PubMed

Biswas, Mithun; Islam, Rafiqul; Shom, Gautam Kumar; Shopon, Md; Mohammed, Nabeel; Momen, Sifat; Abedin, Anowarul

2017-06-01

BanglaLekha-Isolated, a Bangla handwritten isolated character dataset is presented in this article. This dataset contains 84 different characters comprising of 50 Bangla basic characters, 10 Bangla numerals and 24 selected compound characters. 2000 handwriting samples for each of the 84 characters were collected, digitized and pre-processed. After discarding mistakes and scribbles, 1,66,105 handwritten character images were included in the final dataset. The dataset also includes labels indicating the age and the gender of the subjects from whom the samples were collected. This dataset could be used not only for optical handwriting recognition research but also to explore the influence of gender and age on handwriting. The dataset is publicly available at https://data.mendeley.com/datasets/hf6sf8zrkc/2.
Scalable persistent identifier systems for dynamic datasets

NASA Astrophysics Data System (ADS)

Golodoniuc, P.; Cox, S. J. D.; Klump, J. F.

2016-12-01

Reliable and persistent identification of objects, whether tangible or not, is essential in information management. Many Internet-based systems have been developed to identify digital data objects, e.g., PURL, LSID, Handle, ARK. These were largely designed for identification of static digital objects. The amount of data made available online has grown exponentially over the last two decades and fine-grained identification of dynamically generated data objects within large datasets using conventional systems (e.g., PURL) has become impractical. We have compared capabilities of various technological solutions to enable resolvability of data objects in dynamic datasets, and developed a dataset-centric approach to resolution of identifiers. This is particularly important in Semantic Linked Data environments where dynamic frequently changing data is delivered live via web services, so registration of individual data objects to obtain identifiers is impractical. We use identifier patterns and pattern hierarchies for identification of data objects, which allows relationships between identifiers to be expressed, and also provides means for resolving a single identifier into multiple forms (i.e. views or representations of an object). The latter can be implemented through (a) HTTP content negotiation, or (b) use of URI querystring parameters. The pattern and hierarchy approach has been implemented in the Linked Data API supporting the United Nations Spatial Data Infrastructure (UNSDI) initiative and later in the implementation of geoscientific data delivery for the Capricorn Distal Footprints project using International Geo Sample Numbers (IGSN). This enables flexible resolution of multi-view persistent identifiers and provides a scalable solution for large heterogeneous datasets.
An Intercomparison of Large-Extent Tree Canopy Cover Geospatial Datasets

NASA Astrophysics Data System (ADS)

Bender, S.; Liknes, G.; Ruefenacht, B.; Reynolds, J.; Miller, W. P.

2017-12-01

As a member of the Multi-Resolution Land Characteristics Consortium (MRLC), the U.S. Forest Service (USFS) is responsible for producing and maintaining the tree canopy cover (TCC) component of the National Land Cover Database (NLCD). The NLCD-TCC data are available for the conterminous United States (CONUS), coastal Alaska, Hawai'i, Puerto Rico, and the U.S. Virgin Islands. The most recent official version of the NLCD-TCC data is based primarily on reference data from 2010-2011 and is part of the multi-component 2011 version of the NLCD. NLCD data are updated on a five-year cycle. The USFS is currently producing the next official version (2016) of the NLCD-TCC data for the United States, and it will be made publicly-available in early 2018. In this presentation, we describe the model inputs, modeling methods, and tools used to produce the 30-m NLCD-TCC data. Several tree cover datasets at 30-m, as well as datasets at finer resolution, have become available in recent years due to advancements in earth observation data and their availability, computing, and sensors. We compare multiple tree cover datasets that have similar resolution to the NLCD-TCC data. We also aggregate the tree class from fine-resolution land cover datasets to a percent canopy value on a 30-m pixel, in order to compare the fine-resolution datasets to the datasets created directly from 30-m Landsat data. The extent of the tree canopy cover datasets included in the study ranges from global and national to the state level. Preliminary investigation of multiple tree cover datasets over the CONUS indicates a high amount of spatial variability. For example, in a comparison of the NLCD-TCC and the Global Land Cover Facility's Landsat Tree Cover Continuous Fields (2010) data by MRLC mapping zones, the zone-level root mean-square deviation ranges from 2% to 39% (mean=17%, median=15%). The analysis outcomes are expected to inform USFS decisions with regard to the next cycle (2021) of NLCD-TCC production.

Las Vegas Basin Seismic Response Project: Measured Shallow Soil Velocities

NASA Astrophysics Data System (ADS)

Luke, B. A.; Louie, J.; Beeston, H. E.; Skidmore, V.; Concha, A.

2002-12-01

The Las Vegas valley in Nevada is a deep (up to 5 km) alluvial basin filled with interlayered gravels, sands, and clays. The climate is arid. The water table ranges from a few meters to many tens of meters deep. Laterally extensive thin carbonate-cemented lenses are commonly found across parts of the valley. Lenses range beyond 2 m in thickness, and occur at depths exceeding 200 m. Shallow seismic datasets have been collected at approximately ten sites around the Las Vegas valley, to characterize shear and compression wave velocities in the near surface. Purposes for the surveys include modeling of ground response to dynamic loads, both natural and manmade, quantification of soil stiffness to aid structural foundation design, and non-intrusive materials identification. Borehole-based measurement techniques used include downhole and crosshole, to depths exceeding 100 m. Surface-based techniques used include refraction and three different methods involving inversion of surface-wave dispersion datasets. This latter group includes two active-source techniques, the Spectral Analysis of Surface Waves (SASW) method and the Multi-Channel Analysis of Surface Waves (MASW) method; and a new passive-source technique, the Refraction Mictrotremor (ReMi) method. Depths to halfspace for the active-source measurements ranged beyond 50 m. The passive-source method constrains shear wave velocities to 100 m depths. As expected, the stiff cemented layers profoundly affect local velocity gradients. Scale effects are evident in comparisons of (1) very local measurements typified by borehole methods, to (2) the broader coverage of the SASW and MASW measurements, to (3) the still broader and deeper resolution made possible by the ReMi measurements. The cemented layers appear as sharp spikes in the downhole datasets and are problematic in crosshole measurements due to refraction. The refraction method is useful only to locate the depth to the uppermost cemented layer. The surface
Scalable Visual Analytics of Massive Textual Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krishnan, Manoj Kumar; Bohn, Shawn J.; Cowley, Wendy E.

2007-04-01

This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.
Harvard Aging Brain Study: Dataset and accessibility.

PubMed

Dagley, Alexander; LaPoint, Molly; Huijbers, Willem; Hedden, Trey; McLaren, Donald G; Chatwal, Jasmeer P; Papp, Kathryn V; Amariglio, Rebecca E; Blacker, Deborah; Rentz, Dorene M; Johnson, Keith A; Sperling, Reisa A; Schultz, Aaron P

2017-01-01

The Harvard Aging Brain Study is sharing its data with the global research community. The longitudinal dataset consists of a 284-subject cohort with the following modalities acquired: demographics, clinical assessment, comprehensive neuropsychological testing, clinical biomarkers, and neuroimaging. To promote more extensive analyses, imaging data was designed to be compatible with other publicly available datasets. A cloud-based system enables access to interested researchers with blinded data available contingent upon completion of a data usage agreement and administrative approval. Data collection is ongoing and currently in its fifth year. Copyright © 2015 Elsevier Inc. All rights reserved.
Sensitivity of a numerical wave model on wind re-analysis datasets

NASA Astrophysics Data System (ADS)

Lavidas, George; Venugopal, Vengatesan; Friedrich, Daniel

2017-03-01

Wind is the dominant process for wave generation. Detailed evaluation of metocean conditions strengthens our understanding of issues concerning potential offshore applications. However, the scarcity of buoys and high cost of monitoring systems pose a barrier to properly defining offshore conditions. Through use of numerical wave models, metocean conditions can be hindcasted and forecasted providing reliable characterisations. This study reports the sensitivity of wind inputs on a numerical wave model for the Scottish region. Two re-analysis wind datasets with different spatio-temporal characteristics are used, the ERA-Interim Re-Analysis and the CFSR-NCEP Re-Analysis dataset. Different wind products alter results, affecting the accuracy obtained. The scope of this study is to assess different available wind databases and provide information concerning the most appropriate wind dataset for the specific region, based on temporal, spatial and geographic terms for wave modelling and offshore applications. Both wind input datasets delivered results from the numerical wave model with good correlation. Wave results by the 1-h dataset have higher peaks and lower biases, in expense of a high scatter index. On the other hand, the 6-h dataset has lower scatter but higher biases. The study shows how wind dataset affects the numerical wave modelling performance, and that depending on location and study needs, different wind inputs should be considered.
Querying Large Biological Network Datasets

ERIC Educational Resources Information Center

Gulsoy, Gunhan

2013-01-01

New experimental methods has resulted in increasing amount of genetic interaction data to be generated every day. Biological networks are used to store genetic interaction data gathered. Increasing amount of data available requires fast large scale analysis methods. Therefore, we address the problem of querying large biological network datasets.…
Dataset used to improve liquid water absorption models in the microwave

DOE Data Explorer

Turner, David

2015-12-14

Two datasets, one a compilation of laboratory data and one a compilation from three field sites, are provided here. These datasets provide measurements of the real and imaginary refractive indices and absorption as a function of cloud temperature. These datasets were used in the development of the new liquid water absorption model that was published in Turner et al. 2015.
Primary Datasets for Case Studies of River-Water Quality

ERIC Educational Resources Information Center

Goulder, Raymond

2008-01-01

Level 6 (final-year BSc) students undertook case studies on between-site and temporal variation in river-water quality. They used professionally-collected datasets supplied by the Environment Agency. The exercise gave students the experience of working with large, real-world datasets and led to their understanding how the quality of river water is…
A dataset of human decision-making in teamwork management.

PubMed

Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

2017-01-17

Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
A dataset of human decision-making in teamwork management

PubMed Central

Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

2017-01-01

Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members’ capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches. PMID:28094787
A global experimental dataset for assessing grain legume production

PubMed Central

Cernay, Charles; Pelzer, Elise; Makowski, David

2016-01-01

Grain legume crops are a significant component of the human diet and animal feed and have an important role in the environment, but the global diversity of agricultural legume species is currently underexploited. Experimental assessments of grain legume performances are required, to identify potential species with high yields. Here, we introduce a dataset including results of field experiments published in 173 articles. The selected experiments were carried out over five continents on 39 grain legume species. The dataset includes measurements of grain yield, aerial biomass, crop nitrogen content, residual soil nitrogen content and water use. When available, yields for cereals and oilseeds grown after grain legumes in the crop sequence are also included. The dataset is arranged into a relational database with nine structured tables and 198 standardized attributes. Tillage, fertilization, pest and irrigation management are systematically recorded for each of the 8,581 crop*field site*growing season*treatment combinations. The dataset is freely reusable and easy to update. We anticipate that it will provide valuable information for assessing grain legume production worldwide. PMID:27676125
A dataset of human decision-making in teamwork management

NASA Astrophysics Data System (ADS)

Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Chen, Yiqiang; Fauvel, Simon; Lin, Jun; Cui, Lizhen; Pan, Zhengxiang; Yang, Qiang

2017-01-01

Today, most endeavours require teamwork by people with diverse skills and characteristics. In managing teamwork, decisions are often made under uncertainty and resource constraints. The strategies and the effectiveness of the strategies different people adopt to manage teamwork under different situations have not yet been fully explored, partially due to a lack of detailed large-scale data. In this paper, we describe a multi-faceted large-scale dataset to bridge this gap. It is derived from a game simulating complex project management processes. It presents the participants with different conditions in terms of team members' capabilities and task characteristics for them to exhibit their decision-making strategies. The dataset contains detailed data reflecting the decision situations, decision strategies, decision outcomes, and the emotional responses of 1,144 participants from diverse backgrounds. To our knowledge, this is the first dataset simultaneously covering these four facets of decision-making. With repeated measurements, the dataset may help establish baseline variability of decision-making in teamwork management, leading to more realistic decision theoretic models and more effective decision support approaches.
Reference datasets for bioequivalence trials in a two-group parallel design.

PubMed

Fuglsang, Anders; Schütz, Helmut; Labes, Detlew

2015-03-01

In order to help companies qualify and validate the software used to evaluate bioequivalence trials with two parallel treatment groups, this work aims to define datasets with known results. This paper puts a total 11 datasets into the public domain along with proposed consensus obtained via evaluations from six different software packages (R, SAS, WinNonlin, OpenOffice Calc, Kinetica, EquivTest). Insofar as possible, datasets were evaluated with and without the assumption of equal variances for the construction of a 90% confidence interval. Not all software packages provide functionality for the assumption of unequal variances (EquivTest, Kinetica), and not all packages can handle datasets with more than 1000 subjects per group (WinNonlin). Where results could be obtained across all packages, one showed questionable results when datasets contained unequal group sizes (Kinetica). A proposal is made for the results that should be used as validation targets.
Development of a SPARK Training Dataset

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sayre, Amanda M.; Olson, Jarrod R.

2015-03-01

In its first five years, the National Nuclear Security Administration’s (NNSA) Next Generation Safeguards Initiative (NGSI) sponsored more than 400 undergraduate, graduate, and post-doctoral students in internships and research positions (Wyse 2012). In the past seven years, the NGSI program has, and continues to produce a large body of scientific, technical, and policy work in targeted core safeguards capabilities and human capital development activities. Not only does the NGSI program carry out activities across multiple disciplines, but also across all U.S. Department of Energy (DOE)/NNSA locations in the United States. However, products are not readily shared among disciplines and acrossmore » locations, nor are they archived in a comprehensive library. Rather, knowledge of NGSI-produced literature is localized to the researchers, clients, and internal laboratory/facility publication systems such as the Electronic Records and Information Capture Architecture (ERICA) at the Pacific Northwest National Laboratory (PNNL). There is also no incorporated way of analyzing existing NGSI literature to determine whether the larger NGSI program is achieving its core safeguards capabilities and activities. A complete library of NGSI literature could prove beneficial to a cohesive, sustainable, and more economical NGSI program. The Safeguards Platform for Automated Retrieval of Knowledge (SPARK) has been developed to be a knowledge storage, retrieval, and analysis capability to capture safeguards knowledge to exist beyond the lifespan of NGSI. During the development process, it was necessary to build a SPARK training dataset (a corpus of documents) for initial entry into the system and for demonstration purposes. We manipulated these data to gain new information about the breadth of NGSI publications, and they evaluated the science-policy interface at PNNL as a practical demonstration of SPARK’s intended analysis capability. The analysis demonstration sought to answer
Validation of the Hospital Episode Statistics Outpatient Dataset in England.

PubMed

Thorn, Joanna C; Turner, Emma; Hounsome, Luke; Walsh, Eleanor; Donovan, Jenny L; Verne, Julia; Neal, David E; Hamdy, Freddie C; Martin, Richard M; Noble, Sian M

2016-02-01

The Hospital Episode Statistics (HES) dataset is a source of administrative 'big data' with potential for costing purposes in economic evaluations alongside clinical trials. This study assesses the validity of coverage in the HES outpatient dataset. Men who died of, or with, prostate cancer were selected from a prostate-cancer screening trial (CAP, Cluster randomised triAl of PSA testing for Prostate cancer). Details of visits that took place after 1/4/2003 to hospital outpatient departments for conditions related to prostate cancer were extracted from medical records (MR); these appointments were sought in the HES outpatient dataset based on date. The matching procedure was repeated for periods before and after 1/4/2008, when the HES outpatient dataset was accredited as a national statistic. 4922 outpatient appointments were extracted from MR for 370 men. 4088 appointments recorded in MR were identified in the HES outpatient dataset (83.1%; 95% confidence interval [CI] 82.0-84.1). For appointments occurring prior to 1/4/2008, 2195/2755 (79.7%; 95% CI 78.2-81.2) matches were observed, while 1893/2167 (87.4%; 95% CI 86.0-88.9) appointments occurring after 1/4/2008 were identified (p for difference <0.001). 215/370 men (58.1%) had at least one appointment in the MR review that was unmatched in HES, 155 men (41.9%) had all their appointments identified, and 20 men (5.4%) had no appointments identified in HES. The HES outpatient dataset appears reasonably valid for research, particularly following accreditation. The dataset may be a suitable alternative to collecting MR data from hospital notes within a trial, although caution should be exercised with data collected prior to accreditation.
Identifying Differentially Abundant Metabolic Pathways in Metagenomic Datasets

NASA Astrophysics Data System (ADS)

Liu, Bo; Pop, Mihai

Enabled by rapid advances in sequencing technology, metagenomic studies aim to characterize entire communities of microbes bypassing the need for culturing individual bacterial members. One major goal of such studies is to identify specific functional adaptations of microbial communities to their habitats. Here we describe a powerful analytical method (MetaPath) that can identify differentially abundant pathways in metagenomic data-sets, relying on a combination of metagenomic sequence data and prior metabolic pathway knowledge. We show that MetaPath outperforms other common approaches when evaluated on simulated datasets. We also demonstrate the power of our methods in analyzing two, publicly available, metagenomic datasets: a comparison of the gut microbiome of obese and lean twins; and a comparison of the gut microbiome of infant and adult subjects. We demonstrate that the subpathways identified by our method provide valuable insights into the biological activities of the microbiome.
ClimateNet: A Machine Learning dataset for Climate Science Research

NASA Astrophysics Data System (ADS)

Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.

2017-12-01

Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.
Regional climate change study requires new temperature datasets

NASA Astrophysics Data System (ADS)

Wang, K.; Zhou, C.

2016-12-01

Analyses of global mean air temperature (Ta), i. e., NCDC GHCN, GISS, and CRUTEM4, are the fundamental datasets for climate change study and provide key evidence for global warming. All of the global temperature analyses over land are primarily based on meteorological observations of the daily maximum and minimum temperatures (Tmax and Tmin) and their averages (T2) because in most weather stations, the measurements of Tmax and Tmin may be the only choice for a homogenous century-long analysis of mean temperature. Our studies show that these datasets are suitable for long-term global warming studies. However, they may introduce substantial bias in quantifying local and regional warming rates, i.e., with a root mean square error of more than 25% at 5°x 5° grids. From 1973 to 1997, the current datasets tend to significantly underestimate the warming rate over the central U.S. and overestimate the warming rate over the northern high latitudes. Similar results revealed during the period 1998-2013, the warming hiatus period, indicate the use of T2 enlarges the spatial contrast of temperature trends. This because T2 over land only sample air temperature twice daily and cannot accurately reflect land-atmosphere and incoming radiation variations in the temperature diurnal cycle. For better regional climate change detection and attribution, we suggest creating new global mean air temperature datasets based on the recently available high spatiotemporal resolution meteorological observations, i.e., daily four observations weather station since 1960s, These datasets will not only help investigate dynamical processes on temperature variances but also help better evaluate the reanalyzed and modeled simulations of temperature and make some substantial improvements for other related climate variables in models, especially over regional and seasonal aspects.
Regional climate change study requires new temperature datasets

NASA Astrophysics Data System (ADS)

Wang, Kaicun; Zhou, Chunlüe

2017-04-01

Analyses of global mean air temperature (Ta), i. e., NCDC GHCN, GISS, and CRUTEM4, are the fundamental datasets for climate change study and provide key evidence for global warming. All of the global temperature analyses over land are primarily based on meteorological observations of the daily maximum and minimum temperatures (Tmax and Tmin) and their averages (T2) because in most weather stations, the measurements of Tmax and Tmin may be the only choice for a homogenous century-long analysis of mean temperature. Our studies show that these datasets are suitable for long-term global warming studies. However, they may have substantial biases in quantifying local and regional warming rates, i.e., with a root mean square error of more than 25% at 5 degree grids. From 1973 to 1997, the current datasets tend to significantly underestimate the warming rate over the central U.S. and overestimate the warming rate over the northern high latitudes. Similar results revealed during the period 1998-2013, the warming hiatus period, indicate the use of T2 enlarges the spatial contrast of temperature trends. This is because T2 over land only samples air temperature twice daily and cannot accurately reflect land-atmosphere and incoming radiation variations in the temperature diurnal cycle. For better regional climate change detection and attribution, we suggest creating new global mean air temperature datasets based on the recently available high spatiotemporal resolution meteorological observations, i.e., daily four observations weather station since 1960s. These datasets will not only help investigate dynamical processes on temperature variances but also help better evaluate the reanalyzed and modeled simulations of temperature and make some substantial improvements for other related climate variables in models, especially over regional and seasonal aspects.
Las Vegas

NASA Image and Video Library

2001-10-22

This image of Las Vegas, NV was acquired on August, 2000 and covers an area 42 km (25 miles) wide and 30 km (18 miles) long. The image displays three bands of the reflected visible and infrared wavelength region, with a spatial resolution of 15 m. McCarran International Airport to the south and Nellis Air Force Base to the NE are the two major airports visible. Golf courses appear as bright red areas of worms. The first settlement in Las Vegas (which is Spanish for The Meadows) was recorded back in the early 1850s when the Mormon church, headed by Brigham Young, sent a mission of 30 men to construct a fort and teach agriculture to the Indians. Las Vegas became a city in 1905 when the railroad announced this city was to be a major division point. Prior to legalized gambling in 1931, Las Vegas was developing as an agricultural area. Las Vegas' fame as a resort area became prominent after World War II. The image is located at 36.1 degrees north latitude and 115.1 degrees west longitude. http://photojournal.jpl.nasa.gov/catalog/PIA11096
Image segmentation evaluation for very-large datasets

NASA Astrophysics Data System (ADS)

Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

2016-03-01

With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.

The health care and life sciences community profile for dataset descriptions

PubMed Central

Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko

2016-01-01

Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295
The CMS dataset bookkeeping service

DOE Office of Scientific and Technical Information (OSTI.GOV)

Afaq, Anzar,; /Fermilab; Dolgert, Andrew

2007-10-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS ismore » available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.« less
The CMS dataset bookkeeping service

NASA Astrophysics Data System (ADS)

Afaq, A.; Dolgert, A.; Guo, Y.; Jones, C.; Kosyakov, S.; Kuznetsov, V.; Lueking, L.; Riley, D.; Sekhri, V.

2008-07-01

The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.
A Merged Dataset for Solar Probe Plus FIELDS Magnetometers

NASA Astrophysics Data System (ADS)

Bowen, T. A.; Dudok de Wit, T.; Bale, S. D.; Revillet, C.; MacDowall, R. J.; Sheppard, D.

2016-12-01

The Solar Probe Plus FIELDS experiment will observe turbulent magnetic fluctuations deep in the inner heliosphere. The FIELDS magnetometer suite implements a set of three magnetometers: two vector DC fluxgate magnetometers (MAGs), sensitive from DC- 100Hz, as well as a vector search coil magnetometer (SCM), sensitive from 10Hz-50kHz. Single axis measurements are additionally made up to 1MHz. To study the full range of observations, we propose merging data from the individual magnetometers into a single dataset. A merged dataset will improve the quality of observations in the range of frequencies observed by both magnetometers ( 10-100 Hz). Here we present updates on the individual MAG and SCM calibrations as well as our results on generating a cross-calibrated and merged dataset.
A cross-country Exchange Market Pressure (EMP) dataset.

PubMed

Desai, Mohit; Patnaik, Ila; Felman, Joshua; Shah, Ajay

2017-06-01

The data presented in this article are related to the research article titled - "An exchange market pressure measure for cross country analysis" (Patnaik et al. [1]). In this article, we present the dataset for Exchange Market Pressure values (EMP) for 139 countries along with their conversion factors, ρ (rho). Exchange Market Pressure, expressed in percentage change in exchange rate, measures the change in exchange rate that would have taken place had the central bank not intervened. The conversion factor ρ can interpreted as the change in exchange rate associated with $1 billion of intervention. Estimates of conversion factor ρ allow us to calculate a monthly time series of EMP for 139 countries. Additionally, the dataset contains the 68% confidence interval (high and low values) for the point estimates of ρ 's. Using the standard errors of estimates of ρ 's, we obtain one sigma intervals around mean estimates of EMP values. These values are also reported in the dataset.
The NASA Subsonic Jet Particle Image Velocimetry (PIV) Dataset

NASA Technical Reports Server (NTRS)

Bridges, James; Wernet, Mark P.

2011-01-01

Many tasks in fluids engineering require prediction of turbulence of jet flows. The present document documents the single-point statistics of velocity, mean and variance, of cold and hot jet flows. The jet velocities ranged from 0.5 to 1.4 times the ambient speed of sound, and temperatures ranged from unheated to static temperature ratio 2.7. Further, the report assesses the accuracies of the data, e.g., establish uncertainties for the data. This paper covers the following five tasks: (1) Document acquisition and processing procedures used to create the particle image velocimetry (PIV) datasets. (2) Compare PIV data with hotwire and laser Doppler velocimetry (LDV) data published in the open literature. (3) Compare different datasets acquired at the same flow conditions in multiple tests to establish uncertainties. (4) Create a consensus dataset for a range of hot jet flows, including uncertainty bands. (5) Analyze this consensus dataset for self-consistency and compare jet characteristics to those of the open literature. The final objective was fulfilled by using the potential core length and the spread rate of the half-velocity radius to collapse of the mean and turbulent velocity fields over the first 20 jet diameters.
A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video

DTIC Science & Technology

2011-06-01

orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3
Animal Viruses Probe dataset (AVPDS) for microarray-based diagnosis and identification of viruses.

PubMed

Yadav, Brijesh S; Pokhriyal, Mayank; Vasishtha, Dinesh P; Sharma, Bhaskar

2014-03-01

AVPDS (Animal Viruses Probe dataset) is a dataset of virus-specific and conserve oligonucleotides for identification and diagnosis of viruses infecting animals. The current dataset contain 20,619 virus specific probes for 833 viruses and their subtypes and 3,988 conserved probes for 146 viral genera. Dataset of virus specific probe has been divided into two fields namely virus name and probe sequence. Similarly conserved probes for virus genera table have genus, and subgroup within genus name and probe sequence. The subgroup within genus is artificially divided subgroups with no taxonomic significance and contains probes which identifies viruses in that specific subgroup of the genus. Using this dataset we have successfully diagnosed the first case of Newcastle disease virus in sheep and reported a mixed infection of Bovine viral diarrhea and Bovine herpesvirus in cattle. These dataset also contains probes which cross reacts across species experimentally though computationally they meet specifications. These probes have been marked. We hope that this dataset will be useful in microarray-based detection of viruses. The dataset can be accessed through the link https://dl.dropboxusercontent.com/u/94060831/avpds/HOME.html.
Dataset from chemical gas sensor array in turbulent wind tunnel.

PubMed

Fonollosa, Jordi; Rodríguez-Luján, Irene; Trincavelli, Marco; Huerta, Ramón

2015-06-01

The dataset includes the acquired time series of a chemical detection platform exposed to different gas conditions in a turbulent wind tunnel. The chemo-sensory elements were sampling directly the environment. In contrast to traditional approaches that include measurement chambers, open sampling systems are sensitive to dispersion mechanisms of gaseous chemical analytes, namely diffusion, turbulence, and advection, making the identification and monitoring of chemical substances more challenging. The sensing platform included 72 metal-oxide gas sensors that were positioned at 6 different locations of the wind tunnel. At each location, 10 distinct chemical gases were released in the wind tunnel, the sensors were evaluated at 5 different operating temperatures, and 3 different wind speeds were generated in the wind tunnel to induce different levels of turbulence. Moreover, each configuration was repeated 20 times, yielding a dataset of 18,000 measurements. The dataset was collected over a period of 16 months. The data is related to "On the performance of gas sensor arrays in open sampling systems using Inhibitory Support Vector Machines", by Vergara et al.[1]. The dataset can be accessed publicly at the UCI repository upon citation of [1]: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+arrays+in+open+sampling+settings.
Las Vegas

NASA Technical Reports Server (NTRS)

2001-01-01

This image of Las Vegas, NV was acquired on August, 2000 and covers an area 42 km (25 miles) wide and 30 km (18 miles) long. The image displays three bands of the reflected visible and infrared wavelength region, with a spatial resolution of 15 m. McCarran International Airport to the south and Nellis Air Force Base to the NE are the two major airports visible. Golf courses appear as bright red areas of worms. The first settlement in Las Vegas (which is Spanish for The Meadows) was recorded back in the early 1850s when the Mormon church, headed by Brigham Young, sent a mission of 30 men to construct a fort and teach agriculture to the Indians. Las Vegas became a city in 1905 when the railroad announced this city was to be a major division point. Prior to legalized gambling in 1931, Las Vegas was developing as an agricultural area. Las Vegas' fame as a resort area became prominent after World War II. The image is located at 36.1 degrees north latitude and 115.1 degrees west longitude.
The U.S. science team is located at NASA's Jet Propulsion Laboratory, Pasadena, Calif. The Terra mission is part of NASA's Science Mission Directorate.
Knowledge mining from clinical datasets using rough sets and backpropagation neural network.

PubMed

Nahato, Kindie Biredagn; Harichandran, Khanna Nehemiah; Arputharaj, Kannan

2015-01-01

The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.
A photogrammetric technique for generation of an accurate multispectral optical flow dataset

NASA Astrophysics Data System (ADS)

Kniaz, V. V.

2017-06-01

A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.
Scalable Machine Learning for Massive Astronomical Datasets

NASA Astrophysics Data System (ADS)

Ball, Nicholas M.; Gray, A.

2014-04-01

We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex
Scalable Machine Learning for Massive Astronomical Datasets

NASA Astrophysics Data System (ADS)

Ball, Nicholas M.; Astronomy Data Centre, Canadian

2014-01-01

We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.
Wide-Open: Accelerating public data release by automating detection of overdue datasets

PubMed Central

Poon, Hoifung; Howe, Bill

2017-01-01

Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.

PubMed

Grechkin, Maxim; Poon, Hoifung; Howe, Bill

2017-06-01

Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
Evaluation of bulk heat fluxes from atmospheric datasets

NASA Astrophysics Data System (ADS)

Farmer, Benton

Heat fluxes at the air-sea interface are an important component of the Earth's heat budget. In addition, they are an integral factor in determining the sea surface temperature (SST) evolution of the oceans. Different representations of these fluxes are used in both the atmospheric and oceanic communities for the purpose of heat budget studies and, in particular, for forcing oceanic models. It is currently difficult to quantify the potential impact varying heat flux representations have on the ocean response. In this study, a diagnostic tool is presented that allows for a straightforward comparison of surface heat flux formulations and atmospheric data sets. Two variables, relaxation time (RT) and the apparent temperature (T*), are derived from the linearization of the bulk formulas. They are then calculated to compare three bulk formulae and five atmospheric datasets. Additionally, the linearization is expanded to the second order to compare the amount of residual flux present. It is found that the use of a bulk formula employing a constant heat transfer coefficient produces longer relaxation times and contains a greater amount of residual flux in the higher order terms of the linearization. Depending on the temperature difference, the residual flux remaining in the second order and above terms can reach as much as 40--50% of the total residual on a monthly time scale. This is certainly a non-negligible residual flux. In contrast, a bulk formula using a stability and wind dependent transfer coefficient retains much of the total flux in the first order term, as only a few percent remain in the residual flux. Most of the difference displayed among the bulk formulas stems from the sensitivity to wind speed and the choice of a constant or spatially varying transfer coefficient. Comparing the representation of RT and T* provides insight into the differences among various atmospheric datasets. In particular, the representations of the western boundary current, upwelling
Antibody-protein interactions: benchmark datasets and prediction tools evaluation

PubMed Central

Ponomarenko, Julia V; Bourne, Philip E

2007-01-01

Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It
A daily global mesoscale ocean eddy dataset from satellite altimetry.

PubMed

Faghmous, James H; Frenger, Ivy; Yao, Yuanshun; Warmka, Robert; Lindell, Aron; Kumar, Vipin

2015-01-01

Mesoscale ocean eddies are ubiquitous coherent rotating structures of water with radial scales on the order of 100 kilometers. Eddies play a key role in the transport and mixing of momentum and tracers across the World Ocean. We present a global daily mesoscale ocean eddy dataset that contains ~45 million mesoscale features and 3.3 million eddy trajectories that persist at least two days as identified in the AVISO dataset over a period of 1993-2014. This dataset, along with the open-source eddy identification software, extract eddies with any parameters (minimum size, lifetime, etc.), to study global eddy properties and dynamics, and to empirically estimate the impact eddies have on mass or heat transport. Furthermore, our open-source software may be used to identify mesoscale features in model simulations and compare them to observed features. Finally, this dataset can be used to study the interaction between mesoscale ocean eddies and other components of the Earth System.
A daily global mesoscale ocean eddy dataset from satellite altimetry

PubMed Central

Faghmous, James H.; Frenger, Ivy; Yao, Yuanshun; Warmka, Robert; Lindell, Aron; Kumar, Vipin

2015-01-01

Mesoscale ocean eddies are ubiquitous coherent rotating structures of water with radial scales on the order of 100 kilometers. Eddies play a key role in the transport and mixing of momentum and tracers across the World Ocean. We present a global daily mesoscale ocean eddy dataset that contains ~45 million mesoscale features and 3.3 million eddy trajectories that persist at least two days as identified in the AVISO dataset over a period of 1993–2014. This dataset, along with the open-source eddy identification software, extract eddies with any parameters (minimum size, lifetime, etc.), to study global eddy properties and dynamics, and to empirically estimate the impact eddies have on mass or heat transport. Furthermore, our open-source software may be used to identify mesoscale features in model simulations and compare them to observed features. Finally, this dataset can be used to study the interaction between mesoscale ocean eddies and other components of the Earth System. PMID:26097744

Spatially-explicit estimation of geographical representation in large-scale species distribution datasets.

PubMed

Kalwij, Jesse M; Robertson, Mark P; Ronk, Argo; Zobel, Martin; Pärtel, Meelis

2014-01-01

Much ecological research relies on existing multispecies distribution datasets. Such datasets, however, can vary considerably in quality, extent, resolution or taxonomic coverage. We provide a framework for a spatially-explicit evaluation of geographical representation within large-scale species distribution datasets, using the comparison of an occurrence atlas with a range atlas dataset as a working example. Specifically, we compared occurrence maps for 3773 taxa from the widely-used Atlas Florae Europaeae (AFE) with digitised range maps for 2049 taxa of the lesser-known Atlas of North European Vascular Plants. We calculated the level of agreement at a 50-km spatial resolution using average latitudinal and longitudinal species range, and area of occupancy. Agreement in species distribution was calculated and mapped using Jaccard similarity index and a reduced major axis (RMA) regression analysis of species richness between the entire atlases (5221 taxa in total) and between co-occurring species (601 taxa). We found no difference in distribution ranges or in the area of occupancy frequency distribution, indicating that atlases were sufficiently overlapping for a valid comparison. The similarity index map showed high levels of agreement for central, western, and northern Europe. The RMA regression confirmed that geographical representation of AFE was low in areas with a sparse data recording history (e.g., Russia, Belarus and the Ukraine). For co-occurring species in south-eastern Europe, however, the Atlas of North European Vascular Plants showed remarkably higher richness estimations. Geographical representation of atlas data can be much more heterogeneous than often assumed. Level of agreement between datasets can be used to evaluate geographical representation within datasets. Merging atlases into a single dataset is worthwhile in spite of methodological differences, and helps to fill gaps in our knowledge of species distribution ranges. Species distribution
Comparison and validation of gridded precipitation datasets for Spain

NASA Astrophysics Data System (ADS)

Quintana-Seguí, Pere; Turco, Marco; Míguez-Macho, Gonzalo

2016-04-01

In this study, two gridded precipitation datasets are compared and validated in Spain: the recently developed SAFRAN dataset and the Spain02 dataset. These are validated using rain gauges and they are also compared to the low resolution ERA-Interim reanalysis. The SAFRAN precipitation dataset has been recently produced, using the SAFRAN meteorological analysis, which is extensively used in France (Durand et al. 1993, 1999; Quintana-Seguí et al. 2008; Vidal et al., 2010) and which has recently been applied to Spain (Quintana-Seguí et al., 2015). SAFRAN uses an optimal interpolation (OI) algorithm and uses all available rain gauges from the Spanish State Meteorological Agency (Agencia Estatal de Meteorología, AEMET). The product has a spatial resolution of 5 km and it spans from September 1979 to August 2014. This dataset has been produced mainly to be used in large scale hydrological applications. Spain02 (Herrera et al. 2012, 2015) is another high quality precipitation dataset for Spain based on a dense network of quality-controlled stations and it has different versions at different resolutions. In this study we used the version with a resolution of 0.11°. The product spans from 1971 to 2010. Spain02 is well tested and widely used, mainly, but not exclusively, for RCM model validation and statistical downscliang. ERA-Interim is a well known global reanalysis with a spatial resolution of ˜79 km. It has been included in the comparison because it is a widely used product for continental and global scale studies and also in smaller scale studies in data poor countries. Thus, its comparison with higher resolution products of a data rich country, such as Spain, allows us to quantify the errors made when using such datasets for national scale studies, in line with some of the objectives of the EU-FP7 eartH2Observe project. The comparison shows that SAFRAN and Spain02 perform similarly, even though their underlying principles are different. Both products are largely
The Global Precipitation Climatology Project (GPCP) Combined Precipitation Dataset

NASA Technical Reports Server (NTRS)

Huffman, George J.; Adler, Robert F.; Arkin, Philip; Chang, Alfred; Ferraro, Ralph; Gruber, Arnold; Janowiak, John; McNab, Alan; Rudolf, Bruno; Schneider, Udo

1997-01-01

The Global Precipitation Climatology Project (GPCP) has released the GPCP Version 1 Combined Precipitation Data Set, a global, monthly precipitation dataset covering the period July 1987 through December 1995. The primary product in the dataset is a merged analysis incorporating precipitation estimates from low-orbit-satellite microwave data, geosynchronous-orbit -satellite infrared data, and rain gauge observations. The dataset also contains the individual input fields, a combination of the microwave and infrared satellite estimates, and error estimates for each field. The data are provided on 2.5 deg x 2.5 deg latitude-longitude global grids. Preliminary analyses show general agreement with prior studies of global precipitation and extends prior studies of El Nino-Southern Oscillation precipitation patterns. At the regional scale there are systematic differences with standard climatologies.
Geospatial datasets for watershed delineation and characterization used in the Hawaii StreamStats web application

USGS Publications Warehouse

Rea, Alan; Skinner, Kenneth D.

2012-01-01

The U.S. Geological Survey Hawaii StreamStats application uses an integrated suite of raster and vector geospatial datasets to delineate and characterize watersheds. The geospatial datasets used to delineate and characterize watersheds on the StreamStats website, and the methods used to develop the datasets are described in this report. The datasets for Hawaii were derived primarily from 10 meter resolution National Elevation Dataset (NED) elevation models, and the National Hydrography Dataset (NHD), using a set of procedures designed to enforce the drainage pattern from the NHD into the NED, resulting in an integrated suite of elevation-derived datasets. Additional sources of data used for computing basin characteristics include precipitation, land cover, soil permeability, and elevation-derivative datasets. The report also includes links for metadata and downloads of the geospatial datasets.
The Wind Integration National Dataset (WIND) toolkit (Presentation)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Caroline Draxl: NREL

2014-01-01

Regional wind integration studies require detailed wind power output data at many locations to perform simulations of how the power system will operate under high penetration scenarios. The wind datasets that serve as inputs into the study must realistically reflect the ramping characteristics, spatial and temporal correlations, and capacity factors of the simulated wind plants, as well as being time synchronized with available load profiles.As described in this presentation, the WIND Toolkit fulfills these requirements by providing a state-of-the-art national (US) wind resource, power production and forecast dataset.
FieldSAFE: Dataset for Obstacle Detection in Agriculture.

PubMed

Kragh, Mikkel Fly; Christiansen, Peter; Laursen, Morten Stigaard; Larsen, Morten; Steen, Kim Arild; Green, Ole; Karstoft, Henrik; Jørgensen, Rasmus Nyholm

2017-11-09

In this paper, we present a multi-modal dataset for obstacle detection in agriculture. The dataset comprises approximately 2 h of raw sensor data from a tractor-mounted sensor system in a grass mowing scenario in Denmark, October 2016. Sensing modalities include stereo camera, thermal camera, web camera, 360 ∘ camera, LiDAR and radar, while precise localization is available from fused IMU and GNSS. Both static and moving obstacles are present, including humans, mannequin dolls, rocks, barrels, buildings, vehicles and vegetation. All obstacles have ground truth object labels and geographic coordinates.
FieldSAFE: Dataset for Obstacle Detection in Agriculture

PubMed Central

Christiansen, Peter; Larsen, Morten; Steen, Kim Arild; Green, Ole; Karstoft, Henrik

2017-01-01

In this paper, we present a multi-modal dataset for obstacle detection in agriculture. The dataset comprises approximately 2 h of raw sensor data from a tractor-mounted sensor system in a grass mowing scenario in Denmark, October 2016. Sensing modalities include stereo camera, thermal camera, web camera, 360∘ camera, LiDAR and radar, while precise localization is available from fused IMU and GNSS. Both static and moving obstacles are present, including humans, mannequin dolls, rocks, barrels, buildings, vehicles and vegetation. All obstacles have ground truth object labels and geographic coordinates. PMID:29120383
Fast randomization of large genomic datasets while preserving alteration counts.

PubMed

Gobbi, Andrea; Iorio, Francesco; Dawson, Kevin J; Wedge, David C; Tamborero, David; Alexandrov, Ludmil B; Lopez-Bigas, Nuria; Garnett, Mathew J; Jurman, Giuseppe; Saez-Rodriguez, Julio

2014-09-01

Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA

PubMed Central

Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.

2018-01-01

Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513
Inter-comparison of multiple statistically downscaled climate datasets for the Pacific Northwest, USA.

PubMed

Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G

2018-02-20

Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.
Passive Containment DataSet

EPA Pesticide Factsheets

This data is for Figures 6 and 7 in the journal article. The data also includes the two EPANET input files used for the analysis described in the paper, one for the looped system and one for the block system.This dataset is associated with the following publication:Grayman, W., R. Murray , and D. Savic. Redesign of Water Distribution Systems for Passive Containment of Contamination. JOURNAL OF THE AMERICAN WATER WORKS ASSOCIATION. American Water Works Association, Denver, CO, USA, 108(7): 381-391, (2016).
The Lunar Source Disk: Old Lunar Datasets on a New CD-ROM

NASA Astrophysics Data System (ADS)

Hiesinger, H.

1998-01-01

A compilation of previously published datasets on CD-ROM is presented. This Lunar Source Disk is intended to be a first step in the improvement/expansion of the Lunar Consortium Disk, in order to create an "image-cube"-like data pool that can be easily accessed and might be useful for a variety of future lunar investigations. All datasets were transformed to a standard map projection that allows direct comparison of different types of information on a pixel-by pixel basis. Lunar observations have a long history and have been important to mankind for centuries, notably since the work of Plutarch and Galileo. As a consequence of centuries of lunar investigations, knowledge of the characteristics and properties of the Moon has accumulated over time. However, a side effect of this accumulation is that it has become more and more complicated for scientists to review all the datasets obtained through different techniques, to interpret them properly, to recognize their weaknesses and strengths in detail, and to combine them synoptically in geologic interpretations. Such synoptic geologic interpretations are crucial for the study of planetary bodies through remote-sensing data in order to avoid misinterpretation. In addition, many of the modem datasets, derived from Earth-based telescopes as well as from spacecraft missions, are acquired at different geometric and radiometric conditions. These differences make it challenging to compare or combine datasets directly or to extract information from different datasets on a pixel-by-pixel basis. Also, as there is no convention for the presentation of lunar datasets, different authors choose different map projections, depending on the location of the investigated areas and their personal interests. Insufficient or incomplete information on the map parameters used by different authors further complicates the reprojection of these datasets to a standard geometry. The goal of our efforts was to transfer previously published lunar
Lessons learned in the generation of biomedical research datasets using Semantic Open Data technologies.

PubMed

Legaz-García, María del Carmen; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

2015-01-01

Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources. Such heterogeneity makes difficult not only the generation of research-oriented dataset but also its exploitation. In recent years, the Open Data paradigm has proposed new ways for making data available in ways that sharing and integration are facilitated. Open Data approaches may pursue the generation of content readable only by humans and by both humans and machines, which are the ones of interest in our work. The Semantic Web provides a natural technological space for data integration and exploitation and offers a range of technologies for generating not only Open Datasets but also Linked Datasets, that is, open datasets linked to other open datasets. According to the Berners-Lee's classification, each open dataset can be given a rating between one and five stars attending to can be given to each dataset. In the last years, we have developed and applied our SWIT tool, which automates the generation of semantic datasets from heterogeneous data sources. SWIT produces four stars datasets, given that fifth one can be obtained by being the dataset linked from external ones. In this paper, we describe how we have applied the tool in two projects related to health care records and orthology data, as well as the major lessons learned from such efforts.
Global Precipitation Measurement: Methods, Datasets and Applications

NASA Technical Reports Server (NTRS)

Tapiador, Francisco; Turk, Francis J.; Petersen, Walt; Hou, Arthur Y.; Garcia-Ortega, Eduardo; Machado, Luiz, A. T.; Angelis, Carlos F.; Salio, Paola; Kidd, Chris; Huffman, George J.;

2011-01-01

This paper reviews the many aspects of precipitation measurement that are relevant to providing an accurate global assessment of this important environmental parameter. Methods discussed include ground data, satellite estimates and numerical models. First, the methods for measuring, estimating, and modeling precipitation are discussed. Then, the most relevant datasets gathering precipitation information from those three sources are presented. The third part of the paper illustrates a number of the many applications of those measurements and databases. The aim of the paper is to organize the many links and feedbacks between precipitation measurement, estimation and modeling, indicating the uncertainties and limitations of each technique in order to identify areas requiring further attention, and to show the limits within which datasets can be used.

CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.

PubMed

Planey, Catherine R; Gevaert, Olivier

2016-03-09

Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.
Annotating spatio-temporal datasets for meaningful analysis in the Web

NASA Astrophysics Data System (ADS)

Stasch, Christoph; Pebesma, Edzer; Scheider, Simon

2014-05-01

More and more environmental datasets that vary in space and time are available in the Web. This comes along with an advantage of using the data for other purposes than originally foreseen, but also with the danger that users may apply inappropriate analysis procedures due to lack of important assumptions made during the data collection process. In order to guide towards a meaningful (statistical) analysis of spatio-temporal datasets available in the Web, we have developed a Higher-Order-Logic formalism that captures some relevant assumptions in our previous work [1]. It allows to proof on meaningful spatial prediction and aggregation in a semi-automated fashion. In this poster presentation, we will present a concept for annotating spatio-temporal datasets available in the Web with concepts defined in our formalism. Therefore, we have defined a subset of the formalism as a Web Ontology Language (OWL) pattern. It allows capturing the distinction between the different spatio-temporal variable types, i.e. point patterns, fields, lattices and trajectories, that in turn determine whether a particular dataset can be interpolated or aggregated in a meaningful way using a certain procedure. The actual annotations that link spatio-temporal datasets with the concepts in the ontology pattern are provided as Linked Data. In order to allow data producers to add the annotations to their datasets, we have implemented a Web portal that uses a triple store at the backend to store the annotations and to make them available in the Linked Data cloud. Furthermore, we have implemented functions in the statistical environment R to retrieve the RDF annotations and, based on these annotations, to support a stronger typing of spatio-temporal datatypes guiding towards a meaningful analysis in R. [1] Stasch, C., Scheider, S., Pebesma, E., Kuhn, W. (2014): "Meaningful spatial prediction and aggregation", Environmental Modelling & Software, 51, 149-165.
Land cover trends dataset, 1973-2000

USGS Publications Warehouse

Soulard, Christopher E.; Acevedo, William; Auch, Roger F.; Sohl, Terry L.; Drummond, Mark A.; Sleeter, Benjamin M.; Sorenson, Daniel G.; Kambly, Steven; Wilson, Tamara S.; Taylor, Janis L.; Sayler, Kristi L.; Stier, Michael P.; Barnes, Christopher A.; Methven, Steven C.; Loveland, Thomas R.; Headley, Rachel; Brooks, Mark S.

2014-01-01

The U.S. Geological Survey Land Cover Trends Project is releasing a 1973–2000 time-series land-use/land-cover dataset for the conterminous United States. The dataset contains 5 dates of land-use/land-cover data for 2,688 sample blocks randomly selected within 84 ecological regions. The nominal dates of the land-use/land-cover maps are 1973, 1980, 1986, 1992, and 2000. The land-use/land-cover maps were classified manually from Landsat Multispectral Scanner, Thematic Mapper, and Enhanced Thematic Mapper Plus imagery using a modified Anderson Level I classification scheme. The resulting land-use/land-cover data has a 60-meter resolution and the projection is set to Albers Equal-Area Conic, North American Datum of 1983. The files are labeled using a standard file naming convention that contains the number of the ecoregion, sample block, and Landsat year. The downloadable files are organized by ecoregion, and are available in the ERDAS IMAGINETM (.img) raster file format.
Evolving hard problems: Generating human genetics datasets with a complex etiology.

PubMed

Himmelstein, Daniel S; Greene, Casey S; Moore, Jason H

2011-07-07

A goal of human genetics is to discover genetic factors that influence individuals' susceptibility to common diseases. Most common diseases are thought to result from the joint failure of two or more interacting components instead of single component failures. This greatly complicates both the task of selecting informative genetic variants and the task of modeling interactions between them. We and others have previously developed algorithms to detect and model the relationships between these genetic factors and disease. Previously these methods have been evaluated with datasets simulated according to pre-defined genetic models. Here we develop and evaluate a model free evolution strategy to generate datasets which display a complex relationship between individual genotype and disease susceptibility. We show that this model free approach is capable of generating a diverse array of datasets with distinct gene-disease relationships for an arbitrary interaction order and sample size. We specifically generate eight-hundred Pareto fronts; one for each independent run of our algorithm. In each run the predictiveness of single genetic variation and pairs of genetic variants have been minimized, while the predictiveness of third, fourth, or fifth-order combinations is maximized. Two hundred runs of the algorithm are further dedicated to creating datasets with predictive four or five order interactions and minimized lower-level effects. This method and the resulting datasets will allow the capabilities of novel methods to be tested without pre-specified genetic models. This allows researchers to evaluate which methods will succeed on human genetics problems where the model is not known in advance. We further make freely available to the community the entire Pareto-optimal front of datasets from each run so that novel methods may be rigorously evaluated. These 76,600 datasets are available from http://discovery.dartmouth.edu/model_free_data/.
DATS, the data tag suite to enable discoverability of datasets.

PubMed

Sansone, Susanna-Assunta; Gonzalez-Beltran, Alejandra; Rocca-Serra, Philippe; Alter, George; Grethe, Jeffrey S; Xu, Hua; Fore, Ian M; Lyle, Jared; Gururaj, Anupama E; Chen, Xiaoling; Kim, Hyeon-Eui; Zong, Nansu; Li, Yueling; Liu, Ruiling; Ozyurt, I Burak; Ohno-Machado, Lucila

2017-06-06

Today's science increasingly requires effective ways to find and access existing datasets that are distributed across a range of repositories. For researchers in the life sciences, discoverability of datasets may soon become as essential as identifying the latest publications via PubMed. Through an international collaborative effort funded by the National Institutes of Health (NIH)'s Big Data to Knowledge (BD2K) initiative, we have designed and implemented the DAta Tag Suite (DATS) model to support the DataMed data discovery index. DataMed's goal is to be for data what PubMed has been for the scientific literature. Akin to the Journal Article Tag Suite (JATS) used in PubMed, the DATS model enables submission of metadata on datasets to DataMed. DATS has a core set of elements, which are generic and applicable to any type of dataset, and an extended set that can accommodate more specialized data types. DATS is a platform-independent model also available as an annotated serialization in schema.org, which in turn is widely used by major search engines like Google, Microsoft, Yahoo and Yandex.
A Dataset from TIMSS to Examine the Relationship between Computer Use and Mathematics Achievement

ERIC Educational Resources Information Center

Kadijevich, Djordje M.

2015-01-01

Because the relationship between computer use and achievement is still puzzling, there is a need to prepare and analyze good quality datasets on computer use and achievement. Such a dataset can be derived from TIMSS data. This paper describes how this dataset can be prepared. It also gives an example of how the dataset may be analyzed. The…

The Development of a Noncontact Letter Input Interface “Fingual” Using Magnetic Dataset

NASA Astrophysics Data System (ADS)

Fukushima, Taishi; Miyazaki, Fumio; Nishikawa, Atsushi

We have newly developed a noncontact letter input interface called “Fingual”. Fingual uses a glove mounted with inexpensive and small magnetic sensors. Using the glove, users can input letters to form the finger alphabets, a kind of sign language. The proposed method uses some dataset which consists of magnetic field and the corresponding letter information. In this paper, we show two recognition methods using the dataset. First method uses Euclidean norm, and second one additionally uses Gaussian function as a weighting function. Then we conducted verification experiments for the recognition rate of each method in two situations. One of the situations is that subjects used their own dataset; the other is that they used another person's dataset. As a result, the proposed method could recognize letters with a high rate in both situations, even though it is better to use their own dataset than to use another person's dataset. Though Fingual needs to collect magnetic dataset for each letter in advance, its feature is the ability to recognize letters without the complicated calculations such as inverse problems. This paper shows results of the recognition experiments, and shows the utility of the proposed system “Fingual”.
A new dataset validation system for the Planetary Science Archive

NASA Astrophysics Data System (ADS)

Manaud, N.; Zender, J.; Heather, D.; Martinez, S.

2007-08-01

The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that
Data Recommender: An Alternative Way to Discover Open Scientific Datasets

NASA Astrophysics Data System (ADS)

Klump, J. F.; Devaraju, A.; Williams, G.; Hogan, D.; Davy, R.; Page, J.; Singh, D.; Peterson, N.

2017-12-01

Over the past few years, institutions and government agencies have adopted policies to openly release their data, which has resulted in huge amounts of open data becoming available on the web. When trying to discover the data, users face two challenges: an overload of choice and the limitations of the existing data search tools. On the one hand, there are too many datasets to choose from, and therefore, users need to spend considerable effort to find the datasets most relevant to their research. On the other hand, data portals commonly offer keyword and faceted search, which depend fully on the user queries to search and rank relevant datasets. Consequently, keyword and faceted search may return loosely related or irrelevant results, although the results may contain the same query. They may also return highly specific results that depend more on how well metadata was authored. They do not account well for variance in metadata due to variance in author styles and preferences. The top-ranked results may also come from the same data collection, and users are unlikely to discover new and interesting datasets. These search modes mainly suits users who can express their information needs in terms of the structure and terminology of the data portals, but may pose a challenge otherwise. The above challenges reflect that we need a solution that delivers the most relevant (i.e., similar and serendipitous) datasets to users, beyond the existing search functionalities on the portals. A recommender system is an information filtering system that presents users with relevant and interesting contents based on users' context and preferences. Delivering data recommendations to users can make data discovery easier, and as a result may enhance user engagement with the portal. We developed a hybrid data recommendation approach for the CSIRO Data Access Portal. The approach leverages existing recommendation techniques (e.g., content-based filtering and item co-occurrence) to produce
Data assimilation and model evaluation experiment datasets

NASA Technical Reports Server (NTRS)

Lai, Chung-Cheng A.; Qian, Wen; Glenn, Scott M.

1994-01-01

The Institute for Naval Oceanography, in cooperation with Naval Research Laboratories and universities, executed the Data Assimilation and Model Evaluation Experiment (DAMEE) for the Gulf Stream region during fiscal years 1991-1993. Enormous effort has gone into the preparation of several high-quality and consistent datasets for model initialization and verification. This paper describes the preparation process, the temporal and spatial scopes, the contents, the structure, etc., of these datasets. The goal of DAMEE and the need of data for the four phases of experiment are briefly stated. The preparation of DAMEE datasets consisted of a series of processes: (1) collection of observational data; (2) analysis and interpretation; (3) interpolation using the Optimum Thermal Interpolation System package; (4) quality control and re-analysis; and (5) data archiving and software documentation. The data products from these processes included a time series of 3D fields of temperature and salinity, 2D fields of surface dynamic height and mixed-layer depth, analysis of the Gulf Stream and rings system, and bathythermograph profiles. To date, these are the most detailed and high-quality data for mesoscale ocean modeling, data assimilation, and forecasting research. Feedback from ocean modeling groups who tested this data was incorporated into its refinement. Suggestions for DAMEE data usages include (1) ocean modeling and data assimilation studies, (2) diagnosis and theoretical studies, and (3) comparisons with locally detailed observations.
Artificial intelligence (AI) systems for interpreting complex medical datasets.

PubMed

Altman, R B

2017-05-01

Advances in machine intelligence have created powerful capabilities in algorithms that find hidden patterns in data, classify objects based on their measured characteristics, and associate similar patients/diseases/drugs based on common features. However, artificial intelligence (AI) applications in medical data have several technical challenges: complex and heterogeneous datasets, noisy medical datasets, and explaining their output to users. There are also social challenges related to intellectual property, data provenance, regulatory issues, economics, and liability. © 2017 ASCPT.
Use of Electronic Health-Related Datasets in Nursing and Health-Related Research.

PubMed

Al-Rawajfah, Omar M; Aloush, Sami; Hewitt, Jeanne Beauchamp

2015-07-01

Datasets of gigabyte size are common in medical sciences. There is increasing consensus that significant untapped knowledge lies hidden in these large datasets. This review article aims to discuss Electronic Health-Related Datasets (EHRDs) in terms of types, features, advantages, limitations, and possible use in nursing and health-related research. Major scientific databases, MEDLINE, ScienceDirect, and Scopus, were searched for studies or review articles regarding using EHRDs in research. A total number of 442 articles were located. After application of study inclusion criteria, 113 articles were included in the final review. EHRDs were categorized into Electronic Administrative Health-Related Datasets and Electronic Clinical Health-Related Datasets. Subcategories of each major category were identified. EHRDs are invaluable assets for nursing the health-related research. Advanced research skills such as using analytical softwares, advanced statistical procedures, dealing with missing data and missing variables will maximize the efficient utilization of EHRDs in research. © The Author(s) 2014.
Recent Development on the NOAA's Global Surface Temperature Dataset

NASA Astrophysics Data System (ADS)

Zhang, H. M.; Huang, B.; Boyer, T.; Lawrimore, J. H.; Menne, M. J.; Rennie, J.

2016-12-01

Global Surface Temperature (GST) is one of the most widely used indicators for climate trend and extreme analyses. A widely used GST dataset is the NOAA merged land-ocean surface temperature dataset known as NOAAGlobalTemp (formerly MLOST). The NOAAGlobalTemp had recently been updated from version 3.5.4 to version 4. The update includes a significant improvement in the ocean surface component (Extended Reconstructed Sea Surface Temperature or ERSST, from version 3b to version 4) which resulted in an increased temperature trends in recent decades. Since then, advancements in both the ocean component (ERSST) and land component (GHCN-Monthly) have been made, including the inclusion of Argo float SSTs and expanded EOT modes in ERSST, and the use of ISTI databank in GHCN-Monthly. In this presentation, we describe the impact of those improvements on the merged global temperature dataset, in terms of global trends and other aspects.
Integrative Exploratory Analysis of Two or More Genomic Datasets.

PubMed

Meng, Chen; Culhane, Aedin

2016-01-01

Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.
Status and Preliminary Evaluation for Chinese Re-Analysis Datasets

NASA Astrophysics Data System (ADS)

bin, zhao; chunxiang, shi; tianbao, zhao; dong, si; jingwei, liu

2016-04-01

Based on operational T639L60 spectral model, combined with Hybird_GSI assimilation system by using meteorological observations including radiosondes, buoyes, satellites el al., a set of Chinese Re-Analysis (CRA) datasets is developing by Chinese National Meteorological Information Center (NMIC) of Chinese Meteorological Administration (CMA). The datasets are run at 30km (0.28°latitude / longitude) resolution which holds higher resolution than most of the existing reanalysis dataset. The reanalysis is done in an effort to enhance the accuracy of historical synoptic analysis and aid to find out detailed investigation of various weather and climate systems. The current status of reanalysis is in a stage of preliminary experimental analysis. One-year forecast data during Jun 2013 and May 2014 has been simulated and used in synoptic and climate evaluation. We first examine the model prediction ability with the new assimilation system, and find out that it represents significant improvement in Northern and Southern hemisphere, due to addition of new satellite data, compared with operational T639L60 model, the effect of upper-level prediction is improved obviously and overall prediction stability is enhanced. In climatological analysis, compared with ERA-40, NCEP/NCAR and NCEP/DOE reanalyses, the results show that surface temperature simulates a bit lower in land and higher over ocean, 850-hPa specific humidity reflects weakened anomaly and the zonal wind value anomaly is focus on equatorial tropics. Meanwhile, the reanalysis dataset shows good ability for various climate index, such as subtropical high index, ESMI (East-Asia subtropical Summer Monsoon Index) et al., especially for the Indian and western North Pacific monsoon index. Latter we will further improve the assimilation system and dynamical simulating performance, and obtain 40-years (1979-2018) reanalysis datasets. It will provide a more comprehensive analysis for synoptic and climate diagnosis.
Realistic computer network simulation for network intrusion detection dataset generation

NASA Astrophysics Data System (ADS)

Payer, Garrett

2015-05-01

The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.
The ugrizYJHK luminosity distributions and densities from the combined MGC, SDSS and UKIDSS LAS data sets

NASA Astrophysics Data System (ADS)

Hill, David T.; Driver, Simon P.; Cameron, Ewan; Cross, Nicholas; Liske, Jochen; Robotham, Aaron

2010-05-01

We combine data from the Millennium Galaxy Catalogue, Sloan Digital Sky Survey and UKIRT Infrared Deep Sky Survey Large Area Survey to produce ugrizYJHK luminosity functions and densities from within a common, low-redshift volume (z < 0.1, ~ 71000h-31 Mpc3 for L* systems) with 100 per cent spectroscopic completeness. In the optical the fitted Schechter functions are comparable in shape to those previously reported values but with higher normalizations (typically 0, 30, 20, 15, 5 per cent higher φ* values in u, g, r, i, z, respectively, over those reported by the SDSS team). We attribute these to differences in the redshift ranges probed, incompleteness and adopted normalization methods. In the near-IR (NIR) we find significantly different Schechter function parameters (mainly in the M* values) to those previously reported and attribute this to the improvement in the quality of the imaging data over previous studies. This is the first homogeneous measurement of the extragalactic luminosity density which fully samples both the optical and NIR regimes. Unlike previous compilations that have noted a discontinuity between the optical and NIR regimes our homogeneous data set shows a smooth cosmic spectral energy distribution (CSED). After correcting for dust attenuation we compare our CSED to the expected values based on recent constraints on the cosmic star formation history and the initial mass function.
Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories.

PubMed

Jong, Victor L; Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C

2014-12-01

The literature shows that classifiers perform differently across datasets and that correlations within datasets affect the performance of classifiers. The question that arises is whether the correlation structure within datasets differ significantly across diseases. In this study, we evaluated the homogeneity of correlation structures within and between datasets of six etiological disease categories; inflammatory, immune, infectious, degenerative, hereditary and acute myeloid leukemia (AML). We also assessed the effect of filtering; detection call and variance filtering on correlation structures. We downloaded microarray datasets from ArrayExpress for experiments meeting predefined criteria and ended up with 12 datasets for non-cancerous diseases and six for AML. The datasets were preprocessed by a common procedure incorporating platform-specific recommendations and the two filtering methods mentioned above. Homogeneity of correlation matrices between and within datasets of etiological diseases was assessed using the Box's M statistic on permuted samples. We found that correlation structures significantly differ between datasets of the same and/or different etiological disease categories and that variance filtering eliminates more uncorrelated probesets than detection call filtering and thus renders the data highly correlated.
Isotherm ranking and selection using thirteen literature datasets involving hydrophobic organic compounds.

PubMed

Matott, L Shawn; Jiang, Zhengzheng; Rabideau, Alan J; Allen-King, Richelle M

2015-01-01

Numerous isotherm expressions have been developed for describing sorption of hydrophobic organic compounds (HOCs), including "dual-mode" approaches that combine nonlinear behavior with a linear partitioning component. Choosing among these alternative expressions for describing a given dataset is an important task that can significantly influence subsequent transport modeling and/or mechanistic interpretation. In this study, a series of numerical experiments were undertaken to identify "best-in-class" isotherms by refitting 10 alternative models to a suite of 13 previously published literature datasets. The corrected Akaike Information Criterion (AICc) was used for ranking these alternative fits and distinguishing between plausible and implausible isotherms for each dataset. The occurrence of multiple plausible isotherms was inversely correlated with dataset "richness", such that datasets with fewer observations and/or a narrow range of aqueous concentrations resulted in a greater number of plausible isotherms. Overall, only the Polanyi-partition dual-mode isotherm was classified as "plausible" across all 13 of the considered datasets, indicating substantial statistical support consistent with current advances in sorption theory. However, these findings are predicated on the use of the AICc measure as an unbiased ranking metric and the adoption of a subjective, but defensible, threshold for separating plausible and implausible isotherms. Copyright © 2015 Elsevier B.V. All rights reserved.
Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

NASA Astrophysics Data System (ADS)

Beck, Hylke E.; Vergopolan, Noemi; Pan, Ming; Levizzani, Vincenzo; van Dijk, Albert I. J. M.; Weedon, Graham P.; Brocca, Luca; Pappenberger, Florian; Huffman, George J.; Wood, Eric F.

2017-12-01

We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Thirteen non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76 086 gauges worldwide. Another nine gauge-corrected datasets were evaluated using hydrological modeling, by calibrating the HBV conceptual model against streamflow records for each of 9053 small to medium-sized ( < 50 000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR) and the satellite- and reanalysis-based CHIRP V2.0 dataset, the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified, and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed the one indirectly incorporating gauge data through another multi-source dataset (PERSIANN-CDR V1R1). Our results highlight large differences in estimation accuracy, and hence the importance of P
Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.

PubMed

Schütz, Helmut; Labes, Detlew; Fuglsang, Anders

2014-11-01

It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.
Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks

PubMed Central

Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E.

2016-01-01

Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states. PMID:26864723
Atlas Toolkit: Fast registration of 3D morphological datasets in the absence of landmarks.

PubMed

Grocott, Timothy; Thomas, Paul; Münsterberg, Andrea E

2016-02-11

Image registration is a gateway technology for Developmental Systems Biology, enabling computational analysis of related datasets within a shared coordinate system. Many registration tools rely on landmarks to ensure that datasets are correctly aligned; yet suitable landmarks are not present in many datasets. Atlas Toolkit is a Fiji/ImageJ plugin collection offering elastic group-wise registration of 3D morphological datasets, guided by segmentation of the interesting morphology. We demonstrate the method by combinatorial mapping of cell signalling events in the developing eyes of chick embryos, and use the integrated datasets to predictively enumerate Gene Regulatory Network states.
Correction of elevation offsets in multiple co-located lidar datasets

USGS Publications Warehouse

Thompson, David M.; Dalyander, P. Soupy; Long, Joseph W.; Plant, Nathaniel G.

2017-04-07

IntroductionTopographic elevation data collected with airborne light detection and ranging (lidar) can be used to analyze short- and long-term changes to beach and dune systems. Analysis of multiple lidar datasets at Dauphin Island, Alabama, revealed systematic, island-wide elevation differences on the order of 10s of centimeters (cm) that were not attributable to real-world change and, therefore, were likely to represent systematic sampling offsets. These offsets vary between the datasets, but appear spatially consistent within a given survey. This report describes a method that was developed to identify and correct offsets between lidar datasets collected over the same site at different times so that true elevation changes over time, associated with sediment accumulation or erosion, can be analyzed.
DNAism: exploring genomic datasets on the web with Horizon Charts.

PubMed

Rio Deiros, David; Gibbs, Richard A; Rogers, Jeffrey

2016-01-27

Computational biologists daily face the need to explore massive amounts of genomic data. New visualization techniques can help researchers navigate and understand these big data. Horizon Charts are a relatively new visualization method that, under the right circumstances, maximizes data density without losing graphical perception. Horizon Charts have been successfully applied to understand multi-metric time series data. We have adapted an existing JavaScript library (Cubism) that implements Horizon Charts for the time series domain so that it works effectively with genomic datasets. We call this new library DNAism. Horizon Charts can be an effective visual tool to explore complex and large genomic datasets. Researchers can use our library to leverage these techniques to extract additional insights from their own datasets.
Synthetic ALSPAC longitudinal datasets for the Big Data VR project.

PubMed

Avraam, Demetris; Wilson, Rebecca C; Burton, Paul

2017-01-01

Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information. In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.

Characterization and visualization of the accuracy of FIA's CONUS-wide tree species datasets

Treesearch

Rachel Riemann; Barry T. Wilson

2014-01-01

Modeled geospatial datasets have been created for 325 tree species across the contiguous United States (CONUS). Effective application of all geospatial datasets depends on their accuracy. Dataset error can be systematic (bias) or unsystematic (scatter), and their magnitude can vary by region and scale. Each of these characteristics affects the locations, scales, uses,...
Evaluation of Greenland near surface air temperature datasets

DOE PAGES

Reeves Eyre, J. E. Jack; Zeng, Xubin

2017-07-05

Near-surface air temperature (SAT) over Greenland has important effects on mass balance of the ice sheet, but it is unclear which SAT datasets are reliable in the region. Here extensive in situ SAT measurements ( ∼  1400 station-years) are used to assess monthly mean SAT from seven global reanalysis datasets, five gridded SAT analyses, one satellite retrieval and three dynamically downscaled reanalyses. Strengths and weaknesses of these products are identified, and their biases are found to vary by season and glaciological regime. MERRA2 reanalysis overall performs best with mean absolute error less than 2 °C in all months. Ice sheet-average annual mean SAT frommore » different datasets are highly correlated in recent decades, but their 1901–2000 trends differ even in sign. Compared with the MERRA2 climatology combined with gridded SAT analysis anomalies, thirty-one earth system model historical runs from the CMIP5 archive reach ∼  5 °C for the 1901–2000 average bias and have opposite trends for a number of sub-periods.« less
Evaluation of Greenland near surface air temperature datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Reeves Eyre, J. E. Jack; Zeng, Xubin

Near-surface air temperature (SAT) over Greenland has important effects on mass balance of the ice sheet, but it is unclear which SAT datasets are reliable in the region. Here extensive in situ SAT measurements ( ∼  1400 station-years) are used to assess monthly mean SAT from seven global reanalysis datasets, five gridded SAT analyses, one satellite retrieval and three dynamically downscaled reanalyses. Strengths and weaknesses of these products are identified, and their biases are found to vary by season and glaciological regime. MERRA2 reanalysis overall performs best with mean absolute error less than 2 °C in all months. Ice sheet-average annual mean SAT frommore » different datasets are highly correlated in recent decades, but their 1901–2000 trends differ even in sign. Compared with the MERRA2 climatology combined with gridded SAT analysis anomalies, thirty-one earth system model historical runs from the CMIP5 archive reach ∼  5 °C for the 1901–2000 average bias and have opposite trends for a number of sub-periods.« less
De-identification of health records using Anonym: effectiveness and robustness across datasets.

PubMed

Zuccon, Guido; Kotzur, Daniel; Nguyen, Anthony; Bergheim, Anton

2014-07-01

Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data. Crown Copyright © 2014. Published by Elsevier B.V. All rights reserved.
Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach

PubMed Central

Li, Haoran; Jiang, Xiaoqian; Xiong, Li; Liu, Jinfei

2016-01-01

Differential privacy has recently become a de facto standard for private statistical data release. Many algorithms have been proposed to generate differentially private histograms or synthetic data. However, most of them focus on “one-time” release of a static dataset and do not adequately address the increasing need of releasing series of dynamic datasets in real time. A straightforward application of existing histogram methods on each snapshot of such dynamic datasets will incur high accumulated error due to the composibility of differential privacy and correlations or overlapping users between the snapshots. In this paper, we address the problem of releasing series of dynamic datasets in real time with differential privacy, using a novel adaptive distance-based sampling approach. Our first method, DSFT, uses a fixed distance threshold and releases a differentially private histogram only when the current snapshot is sufficiently different from the previous one, i.e., with a distance greater than a predefined threshold. Our second method, DSAT, further improves DSFT and uses a dynamic threshold adaptively adjusted by a feedback control mechanism to capture the data dynamics. Extensive experiments on real and synthetic datasets demonstrate that our approach achieves better utility than baseline methods and existing state-of-the-art methods. PMID:26973795
Satellite-Based Precipitation Datasets

NASA Astrophysics Data System (ADS)

Munchak, S. J.; Huffman, G. J.

2017-12-01

Of the possible sources of precipitation data, those based on satellites provide the greatest spatial coverage. There is a wide selection of datasets, algorithms, and versions from which to choose, which can be confusing to non-specialists wishing to use the data. The International Precipitation Working Group (IPWG) maintains tables of the major publicly available, long-term, quasi-global precipitation data sets (http://www.isac.cnr.it/ ipwg/data/datasets.html), and this talk briefly reviews the various categories. As examples, NASA provides two sets of quasi-global precipitation data sets: the older Tropical Rainfall Measuring Mission (TRMM) Multi-satellite Precipitation Analysis (TMPA) and current Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (GPM) mission (IMERG). Both provide near-real-time and post-real-time products that are uniformly gridded in space and time. The TMPA products are 3-hourly 0.25°x0.25° on the latitude band 50°N-S for about 16 years, while the IMERG products are half-hourly 0.1°x0.1° on 60°N-S for over 3 years (with plans to go to 16+ years in Spring 2018). In addition to the precipitation estimates, each data set provides fields of other variables, such as the satellite sensor providing estimates and estimated random error. The discussion concludes with advice about determining suitability for use, the necessity of being clear about product names and versions, and the need for continued support for satellite- and surface-based observation.
Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

PubMed Central

Yoon, Chun Hong; DeMirci, Hasan; Sierra, Raymond G.; Dao, E. Han; Ahmadi, Radman; Aksit, Fulya; Aquila, Andrew L.; Batyuk, Alexander; Ciftci, Halilibrahim; Guillet, Serge; Hayes, Matt J.; Hayes, Brandon; Lane, Thomas J.; Liang, Meng; Lundström, Ulf; Koglin, Jason E.; Mgbam, Paul; Rao, Yashas; Rendahl, Theodore; Rodriguez, Evan; Zhang, Lindsey; Wakatsuki, Soichi; Boutet, Sébastien; Holton, James M.; Hunter, Mark S.

2017-01-01

We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C10H16N2O3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operate simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development. PMID:28440794
Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

NASA Astrophysics Data System (ADS)

Yoon, Chun Hong; Demirci, Hasan; Sierra, Raymond G.; Dao, E. Han; Ahmadi, Radman; Aksit, Fulya; Aquila, Andrew L.; Batyuk, Alexander; Ciftci, Halilibrahim; Guillet, Serge; Hayes, Matt J.; Hayes, Brandon; Lane, Thomas J.; Liang, Meng; Lundström, Ulf; Koglin, Jason E.; Mgbam, Paul; Rao, Yashas; Rendahl, Theodore; Rodriguez, Evan; Zhang, Lindsey; Wakatsuki, Soichi; Boutet, Sébastien; Holton, James M.; Hunter, Mark S.

2017-04-01

We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C10H16N2O3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operate simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.
Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

DOE PAGES

Yoon, Chun Hong; DeMirci, Hasan; Sierra, Raymond G.; ...

2017-04-25

We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C 10H 16N 2O 3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operatemore » simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.« less
Se-SAD serial femtosecond crystallography datasets from selenobiotinyl-streptavidin

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoon, Chun Hong; DeMirci, Hasan; Sierra, Raymond G.

We provide a detailed description of selenobiotinyl-streptavidin (Se-B SA) co-crystal datasets recorded using the Coherent X-ray Imaging (CXI) instrument at the Linac Coherent Light Source (LCLS) for selenium single-wavelength anomalous diffraction (Se-SAD) structure determination. Se-B SA was chosen as the model system for its high affinity between biotin and streptavidin where the sulfur atom in the biotin molecule (C 10H 16N 2O 3S) is substituted with selenium. The dataset was collected at three different transmissions (100, 50, and 10%) using a serial sample chamber setup which allows for two sample chambers, a front chamber and a back chamber, to operatemore » simultaneously. Diffraction patterns from Se-B SA were recorded to a resolution of 1.9 Å. The dataset is publicly available through the Coherent X-ray Imaging Data Bank (CXIDB) and also on LCLS compute nodes as a resource for research and algorithm development.« less
Igloo-Plot: a tool for visualization of multidimensional datasets.

PubMed

Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S

2014-01-01

Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.
Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

PubMed Central

Wei, Wei; Ji, Zhanglong; He, Yupeng; Zhang, Kai; Ha, Yuanchi; Li, Qi; Ohno-Machado, Lucila

2018-01-01

Abstract The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline PMID:29688374
UK surveillance: provision of quality assured information from combined datasets.

PubMed

Paiba, G A; Roberts, S R; Houston, C W; Williams, E C; Smith, L H; Gibbens, J C; Holdship, S; Lysons, R

2007-09-14

Surveillance information is most useful when provided within a risk framework, which is achieved by presenting results against an appropriate denominator. Often the datasets are captured separately and for different purposes, and will have inherent errors and biases that can be further confounded by the act of merging. The United Kingdom Rapid Analysis and Detection of Animal-related Risks (RADAR) system contains data from several sources and provides both data extracts for research purposes and reports for wider stakeholders. Considerable efforts are made to optimise the data in RADAR during the Extraction, Transformation and Loading (ETL) process. Despite efforts to ensure data quality, the final dataset inevitably contains some data errors and biases, most of which cannot be rectified during subsequent analysis. So, in order for users to establish the 'fitness for purpose' of data merged from more than one data source, Quality Statements are produced as defined within the overarching surveillance Quality Framework. These documents detail identified data errors and biases following ETL and report construction as well as relevant aspects of the datasets from which the data originated. This paper illustrates these issues using RADAR datasets, and describes how they can be minimised.
Datasets, Technologies and Products from the NASA/NOAA Electronic Theater 2002

NASA Technical Reports Server (NTRS)

Hasler, A. Fritz; Starr, David (Technical Monitor)

2001-01-01

An in depth look at the Earth Science datasets used in the Etheater Visualizations will be presented. This will include the satellite orbits, platforms, scan patterns, the size, temporal and spatial resolution, and compositing techniques used to obtain the datasets as well as the spectral bands utilized.
A biclustering algorithm for extracting bit-patterns from binary datasets.

PubMed

Rodriguez-Baena, Domingo S; Perez-Pulido, Antonio J; Aguilar-Ruiz, Jesus S

2011-10-01

Binary datasets represent a compact and simple way to store data about the relationships between a group of objects and their possible properties. In the last few years, different biclustering algorithms have been specially developed to be applied to binary datasets. Several approaches based on matrix factorization, suffix trees or divide-and-conquer techniques have been proposed to extract useful biclusters from binary data, and these approaches provide information about the distribution of patterns and intrinsic correlations. A novel approach to extracting biclusters from binary datasets, BiBit, is introduced here. The results obtained from different experiments with synthetic data reveal the excellent performance and the robustness of BiBit to density and size of input data. Also, BiBit is applied to a central nervous system embryonic tumor gene expression dataset to test the quality of the results. A novel gene expression preprocessing methodology, based on expression level layers, and the selective search performed by BiBit, based on a very fast bit-pattern processing technique, provide very satisfactory results in quality and computational cost. The power of biclustering in finding genes involved simultaneously in different cancer processes is also shown. Finally, a comparison with Bimax, one of the most cited binary biclustering algorithms, shows that BiBit is faster while providing essentially the same results. The source and binary codes, the datasets used in the experiments and the results can be found at: http://www.upo.es/eps/bigs/BiBit.html dsrodbae@upo.es Supplementary data are available at Bioinformatics online.
Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

PubMed Central

Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.

2016-01-01

An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777
Residential load and rooftop PV generation: an Australian distribution network dataset

NASA Astrophysics Data System (ADS)

Ratnam, Elizabeth L.; Weller, Steven R.; Kellett, Christopher M.; Murray, Alan T.

2017-09-01

Despite the rapid uptake of small-scale solar photovoltaic (PV) systems in recent years, public availability of generation and load data at the household level remains very limited. Moreover, such data are typically measured using bi-directional meters recording only PV generation in excess of residential load rather than recording generation and load separately. In this paper, we report a publicly available dataset consisting of load and rooftop PV generation for 300 de-identified residential customers in an Australian distribution network, with load centres covering metropolitan Sydney and surrounding regional areas. The dataset spans a 3-year period, with separately reported measurements of load and PV generation at 30-min intervals. Following a detailed description of the dataset, we identify several means by which anomalous records (e.g. due to inverter failure) are identified and excised. With the resulting 'clean' dataset, we identify key customer-specific and aggregated characteristics of rooftop PV generation and residential load.
An extensive dataset of eye movements during viewing of complex images.

PubMed

Wilming, Niklas; Onat, Selim; Ossandón, José P; Açık, Alper; Kietzmann, Tim C; Kaspar, Kai; Gameiro, Ricardo R; Vormberg, Alexandra; König, Peter

2017-01-31

We present a dataset of free-viewing eye-movement recordings that contains more than 2.7 million fixation locations from 949 observers on more than 1000 images from different categories. This dataset aggregates and harmonizes data from 23 different studies conducted at the Institute of Cognitive Science at Osnabrück University and the University Medical Center in Hamburg-Eppendorf. Trained personnel recorded all studies under standard conditions with homogeneous equipment and parameter settings. All studies allowed for free eye-movements, and differed in the age range of participants (~7-80 years), stimulus sizes, stimulus modifications (phase scrambled, spatial filtering, mirrored), and stimuli categories (natural and urban scenes, web sites, fractal, pink-noise, and ambiguous artistic figures). The size and variability of viewing behavior within this dataset presents a strong opportunity for evaluating and comparing computational models of overt attention, and furthermore, for thoroughly quantifying strategies of viewing behavior. This also makes the dataset a good starting point for investigating whether viewing strategies change in patient groups.
Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset.

PubMed

Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert

2018-02-13

We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for 'target' versus 'non-target' (dataset A) and symbol 'O' versus symbol 'X' (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques.
Improving average ranking precision in user searches for biomedical research datasets

PubMed Central

Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick

2017-01-01

Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw

Content-level deduplication on mobile internet datasets

NASA Astrophysics Data System (ADS)

Hou, Ziyu; Chen, Xunxun; Wang, Yang

2017-06-01

Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over 16KB, existing systems are not applicable to the datasets consisting of short records. In this paper, we propose a new framework called SF-Dedup which is able to implement the deduplication process on a large set of Mobile Internet records, the size of records can be smaller than 100B, or even smaller than 10B. SF-Dedup is a short fingerprint, in-line, hash-collisions-resolved deduplication. Results of experimental applications illustrate that SH-Dedup is able to reduce storage capacity and shorten query time on relational database.
Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

NASA Astrophysics Data System (ADS)

Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

2017-12-01

Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.
Comparing the accuracy of food outlet datasets in an urban environment.

PubMed

Wong, Michelle S; Peyton, Jennifer M; Shields, Timothy M; Curriero, Frank C; Gudzune, Kimberly A

2017-05-11

Studies that investigate the relationship between the retail food environment and health outcomes often use geospatial datasets. Prior studies have identified challenges of using the most common data sources. Retail food environment datasets created through academic-government partnership present an alternative, but their validity (retail existence, type, location) has not been assessed yet. In our study, we used ground-truth data to compare the validity of two datasets, a 2015 commercial dataset (InfoUSA) and data collected from 2012 to 2014 through the Maryland Food Systems Mapping Project (MFSMP), an academic-government partnership, on the retail food environment in two low-income, inner city neighbourhoods in Baltimore City. We compared sensitivity and positive predictive value (PPV) of the commercial and academic-government partnership data to ground-truth data for two broad categories of unhealthy food retailers: small food retailers and quick-service restaurants. Ground-truth data was collected in 2015 and analysed in 2016. Compared to the ground-truth data, MFSMP and InfoUSA generally had similar sensitivity that was greater than 85%. MFSMP had higher PPV compared to InfoUSA for both small food retailers (MFSMP: 56.3% vs InfoUSA: 40.7%) and quick-service restaurants (MFSMP: 58.6% vs InfoUSA: 36.4%). We conclude that data from academic-government partnerships like MFSMP might be an attractive alternative option and improvement to relying only on commercial data. Other research institutes or cities might consider efforts to create and maintain such an environmental dataset. Even if these datasets cannot be updated on an annual basis, they are likely more accurate than commercial data.
Efficient segmentation of 3D fluoroscopic datasets from mobile C-arm

NASA Astrophysics Data System (ADS)

Styner, Martin A.; Talib, Haydar; Singh, Digvijay; Nolte, Lutz-Peter

2004-05-01

The emerging mobile fluoroscopic 3D technology linked with a navigation system combines the advantages of CT-based and C-arm-based navigation. The intra-operative, automatic segmentation of 3D fluoroscopy datasets enables the combined visualization of surgical instruments and anatomical structures for enhanced planning, surgical eye-navigation and landmark digitization. We performed a thorough evaluation of several segmentation algorithms using a large set of data from different anatomical regions and man-made phantom objects. The analyzed segmentation methods include automatic thresholding, morphological operations, an adapted region growing method and an implicit 3D geodesic snake method. In regard to computational efficiency, all methods performed within acceptable limits on a standard Desktop PC (30sec-5min). In general, the best results were obtained with datasets from long bones, followed by extremities. The segmentations of spine, pelvis and shoulder datasets were generally of poorer quality. As expected, the threshold-based methods produced the worst results. The combined thresholding and morphological operations methods were considered appropriate for a smaller set of clean images. The region growing method performed generally much better in regard to computational efficiency and segmentation correctness, especially for datasets of joints, and lumbar and cervical spine regions. The less efficient implicit snake method was able to additionally remove wrongly segmented skin tissue regions. This study presents a step towards efficient intra-operative segmentation of 3D fluoroscopy datasets, but there is room for improvement. Next, we plan to study model-based approaches for datasets from the knee and hip joint region, which would be thenceforth applied to all anatomical regions in our continuing development of an ideal segmentation procedure for 3D fluoroscopic images.
Evaluation of reanalysis datasets against observational soil temperature data over China

NASA Astrophysics Data System (ADS)

Yang, Kai; Zhang, Jingyong

2018-01-01

Soil temperature is a key land surface variable, and is a potential predictor for seasonal climate anomalies and extremes. Using observational soil temperature data in China for 1981-2005, we evaluate four reanalysis datasets, the land surface reanalysis of the European Centre for Medium-Range Weather Forecasts (ERA-Interim/Land), the second modern-era retrospective analysis for research and applications (MERRA-2), the National Center for Environmental Prediction Climate Forecast System Reanalysis (NCEP-CFSR), and version 2 of the Global Land Data Assimilation System (GLDAS-2.0), with a focus on 40 cm soil layer. The results show that reanalysis data can mainly reproduce the spatial distributions of soil temperature in summer and winter, especially over the east of China, but generally underestimate their magnitudes. Owing to the influence of precipitation on soil temperature, the four datasets perform better in winter than in summer. The ERA-Interim/Land and GLDAS-2.0 produce spatial characteristics of the climatological mean that are similar to observations. The interannual variability of soil temperature is well reproduced by the ERA-Interim/Land dataset in summer and by the CFSR dataset in winter. The linear trend of soil temperature in summer is well rebuilt by reanalysis datasets. We demonstrate that soil heat fluxes in April-June and in winter are highly correlated with the soil temperature in summer and winter, respectively. Different estimations of surface energy balance components can contribute to different behaviors in reanalysis products in terms of estimating soil temperature. In addition, reanalysis datasets can mainly rebuild the northwest-southeast gradient of soil temperature memory over China.
Mutual-information-based registration for ultrasound and CT datasets

NASA Astrophysics Data System (ADS)

Firle, Evelyn A.; Wesarg, Stefan; Dold, Christian

2004-05-01

In many applications for minimal invasive surgery the acquisition of intra-operative medical images is helpful if not absolutely necessary. Especially for Brachytherapy imaging is critically important to the safe delivery of the therapy. Modern computed tomography (CT) and magnetic resonance (MR) scanners allow minimal invasive procedures to be performed under direct imaging guidance. However, conventional scanners do not have real-time imaging capability and are expensive technologies requiring a special facility. Ultrasound (U/S) is a much cheaper and one of the most flexible imaging modalities. It can be moved to the application room as required and the physician sees what is happening as it occurs. Nevertheless it may be easier to interpret these 3D intra-operative U/S images if they are used in combination with less noisier preoperative data such as CT. The purpose of our current investigation is to develop a registration tool for automatically combining pre-operative CT volumes with intra-operatively acquired 3D U/S datasets. The applied alignment procedure is based on the information theoretic approach of maximizing the mutual information of two arbitrary datasets from different modalities. Since the CT datasets include a much bigger field of view we introduced a bounding box to narrow down the region of interest within the CT dataset. We conducted a phantom experiment using a CIRS Model 53 U/S Prostate Training Phantom to evaluate the feasibility and accuracy of the proposed method.
Creation of the Naturalistic Engagement in Secondary Tasks (NEST) distracted driving dataset.

PubMed

Owens, Justin M; Angell, Linda; Hankey, Jonathan M; Foley, James; Ebe, Kazutoshi

2015-09-01

Distracted driving has become a topic of critical importance to driving safety research over the past several decades. Naturalistic driving data offer a unique opportunity to study how drivers engage with secondary tasks in real-world driving; however, the complexities involved with identifying and coding relevant epochs of naturalistic data have limited its accessibility to the general research community. This project was developed to help address this problem by creating an accessible dataset of driver behavior and situational factors observed during distraction-related safety-critical events and baseline driving epochs, using the Strategic Highway Research Program 2 (SHRP2) naturalistic dataset. The new NEST (Naturalistic Engagement in Secondary Tasks) dataset was created using crashes and near-crashes from the SHRP2 dataset that were identified as including secondary task engagement as a potential contributing factor. Data coding included frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. In addition, information about each secondary task engagement within the trip prior to the crash/near-crash was coded at a higher level. Data were also coded for four baseline epochs and trips per safety-critical event. 1,180 events and baseline epochs were coded, and a dataset was constructed. The project team is currently working to determine the most useful way to allow broad public access to the dataset. We anticipate that the NEST dataset will be extraordinarily useful in allowing qualified researchers access to timely, real-world data concerning how drivers interact with secondary tasks during safety-critical events and baseline driving. The coded dataset developed for this project will allow future researchers to have access to detailed data on driver secondary task engagement in the real world. It will be useful for standalone research, as well as for integration with additional SHRP2 data to enable the
A multimodal dataset for authoring and editing multimedia content: The MAMEM project.

PubMed

Nikolopoulos, Spiros; Petrantonakis, Panagiotis C; Georgiadis, Kostas; Kalaganis, Fotis; Liaros, Georgios; Lazarou, Ioulietta; Adam, Katerina; Papazoglou-Chalikias, Anastasios; Chatzilari, Elisavet; Oikonomou, Vangelis P; Kumar, Chandan; Menges, Raphael; Staab, Steffen; Müller, Daniel; Sengupta, Korok; Bostantjopoulou, Sevasti; Katsarou, Zoe; Zeilig, Gabi; Plotnik, Meir; Gotlieb, Amihai; Kizoni, Racheli; Fountoukidou, Sofia; Ham, Jaap; Athanasiou, Dimitrios; Mariakaki, Agnes; Comanducci, Dario; Sabatini, Edoardo; Nistico, Walter; Plank, Markus; Kompatsiaris, Ioannis

2017-12-01

We present a dataset that combines multimodal biosignals and eye tracking information gathered under a human-computer interaction framework. The dataset was developed in the vein of the MAMEM project that aims to endow people with motor disabilities with the ability to edit and author multimedia content through mental commands and gaze activity. The dataset includes EEG, eye-tracking, and physiological (GSR and Heart rate) signals collected from 34 individuals (18 able-bodied and 16 motor-impaired). Data were collected during the interaction with specifically designed interface for web browsing and multimedia content manipulation and during imaginary movement tasks. The presented dataset will contribute towards the development and evaluation of modern human-computer interaction systems that would foster the integration of people with severe motor impairments back into society.
Comparative analysis and assessment of M. tuberculosis H37Rv protein-protein interaction datasets

PubMed Central

2011-01-01

Background M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al’s recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions. Results To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252
geoknife: Reproducible web-processing of large gridded datasets

USGS Publications Warehouse

Read, Jordan S.; Walker, Jordan I.; Appling, Alison P.; Blodgett, David L.; Read, Emily K.; Winslow, Luke A.

2016-01-01

Geoprocessing of large gridded data according to overlap with irregular landscape features is common to many large-scale ecological analyses. The geoknife R package was created to facilitate reproducible analyses of gridded datasets found on the U.S. Geological Survey Geo Data Portal web application or elsewhere, using a web-enabled workflow that eliminates the need to download and store large datasets that are reliably hosted on the Internet. The package provides access to several data subset and summarization algorithms that are available on remote web processing servers. Outputs from geoknife include spatial and temporal data subsets, spatially-averaged time series values filtered by user-specified areas of interest, and categorical coverage fractions for various land-use types.
An innovative privacy preserving technique for incremental datasets on cloud computing.

PubMed

Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan

2016-08-01

Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

PubMed Central

Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

2018-01-01

Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379
Simultaneous acquisition of EEG and NIRS during cognitive tasks for an open access dataset

PubMed Central

Shin, Jaeyoung; von Lühmann, Alexander; Kim, Do-Won; Mehnert, Jan; Hwang, Han-Jeong; Müller, Klaus-Robert

2018-01-01

We provide an open access multimodal brain-imaging dataset of simultaneous electroencephalography (EEG) and near-infrared spectroscopy (NIRS) recordings. Twenty-six healthy participants performed three cognitive tasks: 1) n-back (0-, 2- and 3-back), 2) discrimination/selection response task (DSR) and 3) word generation (WG) tasks. The data provided includes: 1) measured data, 2) demographic data, and 3) basic analysis results. For n-back (dataset A) and DSR tasks (dataset B), event-related potential (ERP) analysis was performed, and spatiotemporal characteristics and classification results for ‘target’ versus ‘non-target’ (dataset A) and symbol ‘O’ versus symbol ‘X’ (dataset B) are provided. Time-frequency analysis was performed to show the EEG spectral power to differentiate the task-relevant activations. Spatiotemporal characteristics of hemodynamic responses are also shown. For the WG task (dataset C), the EEG spectral power and spatiotemporal characteristics of hemodynamic responses are analyzed, and the potential merit of hybrid EEG-NIRS BCIs was validated with respect to classification accuracy. We expect that the dataset provided will facilitate performance evaluation and comparison of many neuroimaging analysis techniques. PMID:29437166
Multiresolution persistent homology for excessively large biomolecular datasets

NASA Astrophysics Data System (ADS)

Xia, Kelin; Zhao, Zhixiong; Wei, Guo-Wei

2015-10-01

Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.
Decoys Selection in Benchmarking Datasets: Overview and Perspectives

PubMed Central

Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu

2018-01-01

Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
Parton Distributions based on a Maximally Consistent Dataset

NASA Astrophysics Data System (ADS)

Rojo, Juan

2016-04-01

The choice of data that enters a global QCD analysis can have a substantial impact on the resulting parton distributions and their predictions for collider observables. One of the main reasons for this has to do with the possible presence of inconsistencies, either internal within an experiment or external between different experiments. In order to assess the robustness of the global fit, different definitions of a conservative PDF set, that is, a PDF set based on a maximally consistent dataset, have been introduced. However, these approaches are typically affected by theory biases in the selection of the dataset. In this contribution, after a brief overview of recent NNPDF developments, we propose a new, fully objective, definition of a conservative PDF set, based on the Bayesian reweighting approach. Using the new NNPDF3.0 framework, we produce various conservative sets, which turn out to be mutually in agreement within the respective PDF uncertainties, as well as with the global fit. We explore some of their implications for LHC phenomenology, finding also good consistency with the global fit result. These results provide a non-trivial validation test of the new NNPDF3.0 fitting methodology, and indicate that possible inconsistencies in the fitted dataset do not affect substantially the global fit PDFs.
A reference human genome dataset of the BGISEQ-500 sequencer.

PubMed

Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian

2017-05-01

BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.
Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism.

PubMed

Magasin, Jonathan D; Gerloff, Dietlind L

2015-02-01

Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing ('454') datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in 'old' data. dgerloff@ffame.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Classification of foods by transferring knowledge from ImageNet dataset

NASA Astrophysics Data System (ADS)

Heravi, Elnaz J.; Aghdam, Hamed H.; Puig, Domenec

2017-03-01

Automatic classification of foods is a way to control food intake and tackle with obesity. However, it is a challenging problem since foods are highly deformable and complex objects. Results on ImageNet dataset have revealed that Convolutional Neural Network has a great expressive power to model natural objects. Nonetheless, it is not trivial to train a ConvNet from scratch for classification of foods. This is due to the fact that ConvNets require large datasets and to our knowledge there is not a large public dataset of food for this purpose. Alternative solution is to transfer knowledge from trained ConvNets to the domain of foods. In this work, we study how transferable are state-of-art ConvNets to the task of food classification. We also propose a method for transferring knowledge from a bigger ConvNet to a smaller ConvNet by keeping its accuracy similar to the bigger ConvNet. Our experiments on UECFood256 datasets show that Googlenet, VGG and residual networks produce comparable results if we start transferring knowledge from appropriate layer. In addition, we show that our method is able to effectively transfer knowledge to the smaller ConvNet using unlabeled samples.
National Hydropower Plant Dataset, Version 2 (FY18Q3)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Samu, Nicole; Kao, Shih-Chieh; O'Connor, Patrick

The National Hydropower Plant Dataset, Version 2 (FY18Q3) is a geospatially comprehensive point-level dataset containing locations and key characteristics of U.S. hydropower plants that are currently either in the hydropower development pipeline (pre-operational), operational, withdrawn, or retired. These data are provided in GIS and tabular formats with corresponding metadata for each. In addition, we include access to download 2 versions of the National Hydropower Map, which was produced with these data (i.e. Map 1 displays the geospatial distribution and characteristics of all operational hydropower plants; Map 2 displays the geospatial distribution and characteristics of operational hydropower plants with pumped storagemore » and mixed capabilities only). This dataset is a subset of ORNL's Existing Hydropower Assets data series, updated quarterly as part of ORNL's National Hydropower Asset Assessment Program.« less

Generation of open biomedical datasets through ontology-driven transformation and integration processes.

PubMed

Carmen Legaz-García, María Del; Miñarro-Giménez, José Antonio; Menárguez-Tortosa, Marcos; Fernández-Breis, Jesualdo Tomás

2016-06-03

Biomedical research usually requires combining large volumes of data from multiple heterogeneous sources, which makes difficult the integrated exploitation of such data. The Semantic Web paradigm offers a natural technological space for data integration and exploitation by generating content readable by machines. Linked Open Data is a Semantic Web initiative that promotes the publication and sharing of data in machine readable semantic formats. We present an approach for the transformation and integration of heterogeneous biomedical data with the objective of generating open biomedical datasets in Semantic Web formats. The transformation of the data is based on the mappings between the entities of the data schema and the ontological infrastructure that provides the meaning to the content. Our approach permits different types of mappings and includes the possibility of defining complex transformation patterns. Once the mappings are defined, they can be automatically applied to datasets to generate logically consistent content and the mappings can be reused in further transformation processes. The results of our research are (1) a common transformation and integration process for heterogeneous biomedical data; (2) the application of Linked Open Data principles to generate interoperable, open, biomedical datasets; (3) a software tool, called SWIT, that implements the approach. In this paper we also describe how we have applied SWIT in different biomedical scenarios and some lessons learned. We have presented an approach that is able to generate open biomedical repositories in Semantic Web formats. SWIT is able to apply the Linked Open Data principles in the generation of the datasets, so allowing for linking their content to external repositories and creating linked open datasets. SWIT datasets may contain data from multiple sources and schemas, thus becoming integrated datasets.
Securely Measuring the Overlap between Private Datasets with Cryptosets

PubMed Central

Swamidass, S. Joshua; Matlock, Matthew; Rozenblit, Leon

2015-01-01

Many scientific questions are best approached by sharing data—collected by different groups or across large collaborative networks—into a combined analysis. Unfortunately, some of the most interesting and powerful datasets—like health records, genetic data, and drug discovery data—cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset’s contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach “information-theoretic” security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure. PMID:25714898
NASA Cold Land Processes Experiment (CLPX 2002/03): Atmospheric analyses datasets

Treesearch

Glen E. Liston; Daniel L. Birkenheuer; Christopher A. Hiemstra; Donald W. Cline; Kelly Elder

2008-01-01

This paper describes the Local Analysis and Prediction System (LAPS) and the 20-km horizontal grid version of the Rapid Update Cycle (RUC20) atmospheric analyses datasets, which are available as part of the Cold Land Processes Field Experiment (CLPX) data archive. The LAPS dataset contains spatially and temporally continuous atmospheric and surface variables over...
interPopula: a Python API to access the HapMap Project dataset

PubMed Central

2010-01-01

Background The HapMap project is a publicly available catalogue of common genetic variants that occur in humans, currently including several million SNPs across 1115 individuals spanning 11 different populations. This important database does not provide any programmatic access to the dataset, furthermore no standard relational database interface is provided. Results interPopula is a Python API to access the HapMap dataset. interPopula provides integration facilities with both the Python ecology of software (e.g. Biopython and matplotlib) and other relevant human population datasets (e.g. Ensembl gene annotation and UCSC Known Genes). A set of guidelines and code examples to address possible inconsistencies across heterogeneous data sources is also provided. Conclusions interPopula is a straightforward and flexible Python API that facilitates the construction of scripts and applications that require access to the HapMap dataset. PMID:21210977
Efficient genotype compression and analysis of large genetic variation datasets

PubMed Central

Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.

2015-01-01

Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772
Pantheon 1.0, a manually verified dataset of globally famous biographies.

PubMed

Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A

2016-01-05

We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008-2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals.
Pantheon 1.0, a manually verified dataset of globally famous biographies

PubMed Central

Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A.

2016-01-01

We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008–2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals. PMID:26731133
Publishing datasets with eSciDoc and panMetaDocs

NASA Astrophysics Data System (ADS)

Ulbricht, D.; Klump, J.; Bertelmann, R.

2012-04-01

Currently serveral research institutions worldwide undertake considerable efforts to have their scientific datasets published and to syndicate them to data portals as extensively described objects identified by a persistent identifier. This is done to foster the reuse of data, to make scientific work more transparent, and to create a citable entity that can be referenced unambigously in written publications. GFZ Potsdam established a publishing workflow for file based research datasets. Key software components are an eSciDoc infrastructure [1] and multiple instances of the data curation tool panMetaDocs [2]. The eSciDoc repository holds data objects and their associated metadata in container objects, called eSciDoc items. A key metadata element in this context is the publication status of the referenced data set. PanMetaDocs, which is based on PanMetaWorks [3], is a PHP based web application that allows to describe data with any XML-based metadata schema. The metadata fields can be filled with static or dynamic content to reduce the number of fields that require manual entries to a minimum and make use of contextual information in a project setting. Access rights can be applied to set visibility of datasets to other project members and allow collaboration on and notifying about datasets (RSS) and interaction with the internal messaging system, that was inherited from panMetaWorks. When a dataset is to be published, panMetaDocs allows to change the publication status of the eSciDoc item from status "private" to "submitted" and prepare the dataset for verification by an external reviewer. After quality checks, the item publication status can be changed to "published". This makes the data and metadata available through the internet worldwide. PanMetaDocs is developed as an eSciDoc application. It is an easy to use graphical user interface to eSciDoc items, their data and metadata. It is also an application supporting a DOI publication agent during the process of
Developing a new global network of river reaches from merged satellite-derived datasets

NASA Astrophysics Data System (ADS)

Lion, C.; Allen, G. H.; Beighley, E.; Pavelsky, T.

2015-12-01

In 2020, the Surface Water and Ocean Topography satellite (SWOT), a joint mission of NASA/CNES/CSA/UK will be launched. One of its major products will be the measurements of continental water extent, including the width, height, and slope of rivers and the surface area and elevations of lakes. The mission will improve the monitoring of continental water and also our understanding of the interactions between different hydrologic reservoirs. For rivers, SWOT measurements of slope must be carried out over predefined river reaches. As such, an a priori dataset for rivers is needed in order to facilitate analysis of the raw SWOT data. The information required to produce this dataset includes measurements of river width, elevation, slope, planform, river network topology, and flow accumulation. To produce this product, we have linked two existing global datasets: the Global River Widths from Landsat (GRWL) database, which contains river centerline locations, widths, and a braiding index derived from Landsat imagery, and a modified version of the HydroSHEDS hydrologically corrected digital elevation product, which contains heights and flow accumulation measurements for streams at 3 arcsecond spatial resolution. Merging these two datasets requires considerable care. The difficulties, among others, lie in the difference of resolution: 30m versus 3 arseconds, and the age of the datasets: 2000 versus ~2010 (some rivers have moved, the braided sections are different). As such, we have developed custom software to merge the two datasets, taking into account the spatial proximity of river channels in the two datasets and ensuring that flow accumulation in the final dataset always increases downstream. Here, we present our preliminary results for a portion of South America and demonstrate the strengths and weaknesses of the method.
Evaluation and inter-comparison of modern day reanalysis datasets over Africa and the Middle East

NASA Astrophysics Data System (ADS)

Shukla, S.; Arsenault, K. R.; Hobbins, M.; Peters-Lidard, C. D.; Verdin, J. P.

2015-12-01

Reanalysis datasets are potentially very valuable for otherwise data-sparse regions such as Africa and the Middle East. They are potentially useful for long-term climate and hydrologic analyses and, given their availability in real-time, they are particularity attractive for real-time hydrologic monitoring purposes (e.g. to monitor flood and drought events). Generally in data-sparse regions, reanalysis variables such as precipitation, temperature, radiation and humidity are used in conjunction with in-situ and/or satellite-based datasets to generate long-term gridded atmospheric forcing datasets. These atmospheric forcing datasets are used to drive offline land surface models and simulate soil moisture and runoff, which are natural indicators of hydrologic conditions. Therefore, any uncertainty or bias in the reanalysis datasets contributes to uncertainties in hydrologic monitoring estimates. In this presentation, we report on a comprehensive analysis that evaluates several modern-day reanalysis products (such as NASA's MERRA-1 and -2, ECMWF's ERA-Interim and NCEP's CFS Reanalysis) over Africa and the Middle East region. We compare the precipitation and temperature from the reanalysis products with other independent gridded datasets such as GPCC, CRU, and USGS/UCSB's CHIRPS precipitation datasets, and CRU's temperature datasets. The evaluations are conducted at a monthly time scale, since some of these independent datasets are only available at this temporal resolution. The evaluations range from the comparison of the monthly mean climatology to inter-annual variability and long-term changes. Finally, we also present the results of inter-comparisons of radiation and humidity variables from the different reanalysis datasets.
MVIRI/SEVIRI TOA Radiation Datasets within the Climate Monitoring SAF

NASA Astrophysics Data System (ADS)

Urbain, Manon; Clerbaux, Nicolas; Ipe, Alessandro; Baudrez, Edward; Velazquez Blazquez, Almudena; Moreels, Johan

2016-04-01

Within CM SAF, Interim Climate Data Records (ICDR) of Top-Of-Atmosphere (TOA) radiation products from the Geostationary Earth Radiation Budget (GERB) instruments on the Meteosat Second Generation (MSG) satellites have been released in 2013. These datasets (referred to as CM-113 and CM-115, resp. for shortwave (SW) and longwave (LW) radiation) are based on the instantaneous TOA fluxes from the GERB Edition-1 dataset. They cover the time period 2004-2011. Extending these datasets backward in the past is not possible as no GERB instruments were available on the Meteosat First Generation (MFG) satellites. As an alternative, it is proposed to rely on the Meteosat Visible and InfraRed Imager (MVIRI - from 1982 until 2004) and the Spinning Enhanced Visible and Infrared Imager (SEVIRI - from 2004 onward) to generate a long Thematic Climate Data Record (TCDR) from Meteosat instruments. Combining MVIRI and SEVIRI allows an unprecedented temporal (30 minutes / 15 minutes) and spatial (2.5 km / 3 km) resolution compared to the Clouds and the Earth's Radiant Energy System (CERES) products. This is a step forward as it helps to increase the knowledge of the diurnal cycle and the small-scale spatial variations of radiation. The MVIRI/SEVIRI datasets (referred to as CM-23311 and CM-23341, resp. for SW and LW radiation) will provide daily and monthly averaged TOA Reflected Solar (TRS) and Emitted Thermal (TET) radiation in "all-sky" conditions (no clear-sky conditions for this first version of the datasets), as well as monthly averaged of the hourly integrated values. The SEVIRI Solar Channels Calibration (SSCC) and the operational calibration have been used resp. for the SW and LW channels. For MFG, it is foreseen to replace the latter by the EUMETSAT/GSICS recalibration of MVIRI using HIRS. The CERES TRMM angular dependency models have been used to compute TRS fluxes while theoretical models have been used for TET fluxes. The CM-23311 and CM-23341 datasets will cover a 32 years
A multimodal MRI dataset of professional chess players.

PubMed

Li, Kaiming; Jiang, Jing; Qiu, Lihua; Yang, Xun; Huang, Xiaoqi; Lui, Su; Gong, Qiyong

2015-01-01

Chess is a good model to study high-level human brain functions such as spatial cognition, memory, planning, learning and problem solving. Recent studies have demonstrated that non-invasive MRI techniques are valuable for researchers to investigate the underlying neural mechanism of playing chess. For professional chess players (e.g., chess grand masters and masters or GM/Ms), what are the structural and functional alterations due to long-term professional practice, and how these alterations relate to behavior, are largely veiled. Here, we report a multimodal MRI dataset from 29 professional Chinese chess players (most of whom are GM/Ms), and 29 age matched novices. We hope that this dataset will provide researchers with new materials to further explore high-level human brain functions.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations

NASA Astrophysics Data System (ADS)

Moulik, P.; Lekic, V.; Romanowicz, B. A.

2017-12-01

A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history
JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets.

PubMed

Ner-Gaon, Hadas; Melchior, Ariel; Golan, Nili; Ben-Haim, Yael; Shay, Tal

2017-05-01

Recent advances in single-cell RNA-sequencing (scRNA-seq) technology increase the understanding of immune differentiation and activation processes, as well as the heterogeneity of immune cell types. Although the number of available immune-related scRNA-seq datasets increases rapidly, their large size and various formats render them hard for the wider immunology community to use, and read-level data are practically inaccessible to the non-computational immunologist. To facilitate datasets reuse, we created the JingleBells repository for immune-related scRNA-seq datasets ready for analysis and visualization of reads at the single-cell level (http://jinglebells.bgu.ac.il/). To this end, we collected the raw data of publicly available immune-related scRNA-seq datasets, aligned the reads to the relevant genome, and saved aligned reads in a uniform format, annotated for cell of origin. We also added scripts and a step-by-step tutorial for visualizing each dataset at the single-cell level, through the commonly used Integrated Genome Viewer (www.broadinstitute.org/igv/). The uniform scRNA-seq format used in JingleBells can facilitate reuse of scRNA-seq data by computational biologists. It also enables immunologists who are interested in a specific gene to visualize the reads aligned to this gene to estimate cell-specific preferences for splicing, mutation load, or alleles. Thus JingleBells is a resource that will extend the usefulness of scRNA-seq datasets outside the programming aficionado realm. Copyright © 2017 by The American Association of Immunologists, Inc.
Real-world datasets for portfolio selection and solutions of some stochastic dominance portfolio models.

PubMed

Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio

2016-09-01

A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.
Selecting AGN through Variability in SN Datasets

NASA Astrophysics Data System (ADS)

Boutsia, K.; Leibundgut, B.; Trevese, D.; Vagnetti, F.

2010-07-01

Variability is a main property of Active Galactic Nuclei (AGN) and it was adopted as a selection criterion using multi epoch surveys conducted for the detection of supernovae (SNe). We have used two SN datasets. First we selected the AXAF field of the STRESS project, centered in the Chandra Deep Field South where, besides the deep X-ray surveys also various optical catalogs exist. Our method yielded 132 variable AGN candidates. We then extended our method including the dataset of the ESSENCE project that has been active for 6 years, producing high quality light curves in the R and I bands. We obtained a sample of ˜4800 variable sources, down to R=22, in the whole 12 deg2 ESSENCE field. Among them, a subsample of ˜500 high priority AGN candidates was created using as secondary criterion the shape of the structure function. In a pilot spectroscopic run we have confirmed the AGN nature for nearly all of our candidates.
Large Dataset of Acute Oral Toxicity Data Created for Testing ...

EPA Pesticide Factsheets

Acute toxicity data is a common requirement for substance registration in the US. Currently only data derived from animal tests are accepted by regulatory agencies, and the standard in vivo tests use lethality as the endpoint. Non-animal alternatives such as in silico models are being developed due to animal welfare and resource considerations. We compiled a large dataset of oral rat LD50 values to assess the predictive performance currently available in silico models. Our dataset combines LD50 values from five different sources: literature data provided by The Dow Chemical Company, REACH data from eChemportal, HSDB (Hazardous Substances Data Bank), RTECS data from Leadscope, and the training set underpinning TEST (Toxicity Estimation Software Tool). Combined these data sources yield 33848 chemical-LD50 pairs (data points), with 23475 unique data points covering 16439 compounds. The entire dataset was loaded into a chemical properties database. All of the compounds were registered in DSSTox and 59.5% have publically available structures. Compounds without a structure in DSSTox are currently having their structures registered. The structural data will be used to evaluate the predictive performance and applicable chemical domains of three QSAR models (TIMES, PROTOX, and TEST). Future work will combine the dataset with information from ToxCast assays, and using random forest modeling, assess whether ToxCast assays are useful in predicting acute oral toxicity. Pre
Selection-Fusion Approach for Classification of Datasets with Missing Values

PubMed Central

Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming

2010-01-01

This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921
Dataset for Testing Contamination Source Identification Methods for Water Distribution Networks

EPA Pesticide Factsheets

This dataset includes the results of a simulation study using the source inversion techniques available in the Water Security Toolkit. The data was created to test the different techniques for accuracy, specificity, false positive rate, and false negative rate. The tests examined different parameters including measurement error, modeling error, injection characteristics, time horizon, network size, and sensor placement. The water distribution system network models that were used in the study are also included in the dataset. This dataset is associated with the following publication:Seth, A., K. Klise, J. Siirola, T. Haxton , and C. Laird. Testing Contamination Source Identification Methods for Water Distribution Networks. Journal of Environmental Division, Proceedings of American Society of Civil Engineers. American Society of Civil Engineers (ASCE), Reston, VA, USA, ., (2016).
Dataset variability leverages white-matter lesion segmentation performance with convolutional neural network

NASA Astrophysics Data System (ADS)

Ravnik, Domen; Jerman, Tim; Pernuš, Franjo; Likar, Boštjan; Å piclin, Žiga

2018-03-01

Performance of a convolutional neural network (CNN) based white-matter lesion segmentation in magnetic resonance (MR) brain images was evaluated under various conditions involving different levels of image preprocessing and augmentation applied and different compositions of the training dataset. On images of sixty multiple sclerosis patients, half acquired on one and half on another scanner of different vendor, we first created a highly accurate multi-rater consensus based lesion segmentations, which were used in several experiments to evaluate the CNN segmentation result. First, the CNN was trained and tested without preprocessing the images and by using various combinations of preprocessing techniques, namely histogram-based intensity standardization, normalization by whitening, and train dataset augmentation by flipping the images across the midsagittal plane. Then, the CNN was trained and tested on images of the same, different or interleaved scanner datasets using a cross-validation approach. The results indicate that image preprocessing has little impact on performance in a same-scanner situation, while between-scanner performance benefits most from intensity standardization and normalization, but also further by incorporating heterogeneous multi-scanner datasets in the training phase. Under such conditions the between-scanner performance of the CNN approaches that of the ideal situation, when the CNN is trained and tested on the same scanner dataset.

Nanocubes for real-time exploration of spatiotemporal datasets.

PubMed

Lins, Lauro; Klosowski, James T; Scheidegger, Carlos

2013-12-01

Consider real-time exploration of large multidimensional spatiotemporal datasets with billions of entries, each defined by a location, a time, and other attributes. Are certain attributes correlated spatially or temporally? Are there trends or outliers in the data? Answering these questions requires aggregation over arbitrary regions of the domain and attributes of the data. Many relational databases implement the well-known data cube aggregation operation, which in a sense precomputes every possible aggregate query over the database. Data cubes are sometimes assumed to take a prohibitively large amount of space, and to consequently require disk storage. In contrast, we show how to construct a data cube that fits in a modern laptop's main memory, even for billions of entries; we call this data structure a nanocube. We present algorithms to compute and query a nanocube, and show how it can be used to generate well-known visual encodings such as heatmaps, histograms, and parallel coordinate plots. When compared to exact visualizations created by scanning an entire dataset, nanocube plots have bounded screen error across a variety of scales, thanks to a hierarchical structure in space and time. We demonstrate the effectiveness of our technique on a variety of real-world datasets, and present memory, timing, and network bandwidth measurements. We find that the timings for the queries in our examples are dominated by network and user-interaction latencies.
Increasing consistency of disease biomarker prediction across datasets.

PubMed

Chikina, Maria D; Sealfon, Stuart C

2014-01-01

Microarray studies with human subjects often have limited sample sizes which hampers the ability to detect reliable biomarkers associated with disease and motivates the need to aggregate data across studies. However, human gene expression measurements may be influenced by many non-random factors such as genetics, sample preparations, and tissue heterogeneity. These factors can contribute to a lack of agreement among related studies, limiting the utility of their aggregation. We show that it is feasible to carry out an automatic correction of individual datasets to reduce the effect of such 'latent variables' (without prior knowledge of the variables) in such a way that datasets addressing the same condition show better agreement once each is corrected. We build our approach on the method of surrogate variable analysis but we demonstrate that the original algorithm is unsuitable for the analysis of human tissue samples that are mixtures of different cell types. We propose a modification to SVA that is crucial to obtaining the improvement in agreement that we observe. We develop our method on a compendium of multiple sclerosis data and verify it on an independent compendium of Parkinson's disease datasets. In both cases, we show that our method is able to improve agreement across varying study designs, platforms, and tissues. This approach has the potential for wide applicability to any field where lack of inter-study agreement has been a concern.
Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling

NASA Astrophysics Data System (ADS)

Beck, H.; Vergopolan, N.; Pan, M.; Levizzani, V.; van Dijk, A.; Weedon, G. P.; Brocca, L.; Huffman, G. J.; Wood, E. F.; William, L.

2017-12-01

We undertook a comprehensive evaluation of 22 gridded (quasi-)global (sub-)daily precipitation (P) datasets for the period 2000-2016. Twelve non-gauge-corrected P datasets were evaluated using daily P gauge observations from 76,086 gauges worldwide. Another ten gauge-corrected ones were evaluated using hydrological modeling, by calibrating the conceptual model HBV against streamflow records for each of 9053 small to medium-sized (<50,000 km2) catchments worldwide, and comparing the resulting performance. Marked differences in spatio-temporal patterns and accuracy were found among the datasets. Among the uncorrected P datasets, the satellite- and reanalysis-based MSWEP-ng V1.2 and V2.0 datasets generally showed the best temporal correlations with the gauge observations, followed by the reanalyses (ERA-Interim, JRA-55, and NCEP-CFSR), the estimates based primarily on passive microwave remote sensing of rainfall (CMORPH V1.0, GSMaP V5/6, and TMPA 3B42RT V7) or near-surface soil moisture (SM2RAIN-ASCAT), and finally, estimates based primarily on thermal infrared imagery (GridSat V1.0, PERSIANN, and PERSIANN-CCS). Two of the three reanalyses (ERA-Interim and JRA-55) unexpectedly obtained lower trend errors than the satellite datasets. Among the corrected P datasets, the ones directly incorporating daily gauge data (CPC Unified and MSWEP V1.2 and V2.0) generally provided the best calibration scores, although the good performance of the fully gauge-based CPC Unified is unlikely to translate to sparsely or ungauged regions. Next best results were obtained with P estimates directly incorporating temporally coarser gauge data (CHIRPS V2.0, GPCP-1DD V1.2, TMPA 3B42 V7, and WFDEI-CRU), which in turn outperformed those indirectly incorporating gauge data through other multi-source datasets (PERSIANN-CDR V1R1 and PGF). Our results highlight large differences in estimation accuracy, and hence, the importance of P dataset selection in both research and operational applications
A longitudinal dataset of five years of public activity in the Scratch online community.

PubMed

Hill, Benjamin Mako; Monroy-Hernández, Andrés

2017-01-31

Scratch is a programming environment and an online community where young people can create, share, learn, and communicate. In collaboration with the Scratch Team at MIT, we created a longitudinal dataset of public activity in the Scratch online community during its first five years (2007-2012). The dataset comprises 32 tables with information on more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and more. To help researchers understand this dataset, and to establish the validity of the data, we also include the source code of every version of the software that operated the website, as well as the software used to generate this dataset. We believe this is the largest and most comprehensive downloadable dataset of youth programming artifacts and communication.
Systematic chemical-genetic and chemical-chemical interaction datasets for prediction of compound synergism

PubMed Central

Wildenhain, Jan; Spitzer, Michaela; Dolma, Sonam; Jarvik, Nick; White, Rachel; Roy, Marcia; Griffiths, Emma; Bellows, David S.; Wright, Gerard D.; Tyers, Mike

2016-01-01

The network structure of biological systems suggests that effective therapeutic intervention may require combinations of agents that act synergistically. However, a dearth of systematic chemical combination datasets have limited the development of predictive algorithms for chemical synergism. Here, we report two large datasets of linked chemical-genetic and chemical-chemical interactions in the budding yeast Saccharomyces cerevisiae. We screened 5,518 unique compounds against 242 diverse yeast gene deletion strains to generate an extended chemical-genetic matrix (CGM) of 492,126 chemical-gene interaction measurements. This CGM dataset contained 1,434 genotype-specific inhibitors, termed cryptagens. We selected 128 structurally diverse cryptagens and tested all pairwise combinations to generate a benchmark dataset of 8,128 pairwise chemical-chemical interaction tests for synergy prediction, termed the cryptagen matrix (CM). An accompanying database resource called ChemGRID was developed to enable analysis, visualisation and downloads of all data. The CGM and CM datasets will facilitate the benchmarking of computational approaches for synergy prediction, as well as chemical structure-activity relationship models for anti-fungal drug discovery. PMID:27874849
Thesaurus Dataset of Educational Technology in Chinese

ERIC Educational Resources Information Center

Wu, Linjing; Liu, Qingtang; Zhao, Gang; Huang, Huan; Huang, Tao

2015-01-01

The thesaurus dataset of educational technology is a knowledge description of educational technology in Chinese. The aims of this thesaurus were to collect the subject terms in the domain of educational technology, facilitate the standardization of terminology and promote the communication between Chinese researchers and scholars from various…
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets

PubMed Central

Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C.; Wahba, Grace

2018-01-01

When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer’s disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. PMID:29386387
Statistical tests and identifiability conditions for pooling and analyzing multisite datasets.

PubMed

Zhou, Hao Henry; Singh, Vikas; Johnson, Sterling C; Wahba, Grace

2018-02-13

When sample sizes are small, the ability to identify weak (but scientifically interesting) associations between a set of predictors and a response may be enhanced by pooling existing datasets. However, variations in acquisition methods and the distribution of participants or observations between datasets, especially due to the distributional shifts in some predictors, may obfuscate real effects when datasets are combined. We present a rigorous statistical treatment of this problem and identify conditions where we can correct the distributional shift. We also provide an algorithm for the situation where the correction is identifiable. We analyze various properties of the framework for testing model fit, constructing confidence intervals, and evaluating consistency characteristics. Our technical development is motivated by Alzheimer's disease (AD) studies, and we present empirical results showing that our framework enables harmonizing of protein biomarkers, even when the assays across sites differ. Our contribution may, in part, mitigate a bottleneck that researchers face in clinical research when pooling smaller sized datasets and may offer benefits when the subjects of interest are difficult to recruit or when resources prohibit large single-site studies. Copyright © 2018 the Author(s). Published by PNAS.
Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation

PubMed Central

Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B.

2016-01-01

Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware
Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation.

PubMed

Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B

2016-01-01

Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware
A collection of Australian Drosophila datasets on climate adaptation and species distributions.

PubMed

Hangartner, Sandra B; Hoffmann, Ary A; Smith, Ailie; Griffin, Philippa C

2015-11-24

The Australian Drosophila Ecology and Evolution Resource (ADEER) collates Australian datasets on drosophilid flies, which are aimed at investigating questions around climate adaptation, species distribution limits and population genetics. Australian drosophilid species are diverse in climatic tolerance, geographic distribution and behaviour. Many species are restricted to the tropics, a few are temperate specialists, and some have broad distributions across climatic regions. Whereas some species show adaptability to climate changes through genetic and plastic changes, other species have limited adaptive capacity. This knowledge has been used to identify traits and genetic polymorphisms involved in climate change adaptation and build predictive models of responses to climate change. ADEER brings together 103 datasets from 39 studies published between 1982-2013 in a single online resource. All datasets can be downloaded freely in full, along with maps and other visualisations. These historical datasets are preserved for future studies, which will be especially useful for assessing climate-related changes over time.
Exudate-based diabetic macular edema detection in fundus images using publicly available datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

2011-01-01

Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME through the presence of exudation. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME.more » This and other two publicly available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing (e.g., the classifier was trained on an independent dataset and tested on MESSIDOR). Our algorithm obtained an AUC between 0.88 and 0.94 depending on the dataset/features used. Additionally, it does not need ground truth at lesion level to reject false positives and is computationally efficient, as it generates a diagnosis on an average of 4.4 s (9.3 s, considering the optic nerve localization) per image on an 2.6 GHz platform with an unoptimized Matlab implementation.« less
Boosting association rule mining in large datasets via Gibbs sampling.

PubMed

Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua

2016-05-03

Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.
A conceptual prototype for the next-generation national elevation dataset

USGS Publications Warehouse

Stoker, Jason M.; Heidemann, Hans Karl; Evans, Gayla A.; Greenlee, Susan K.

2013-01-01

In 2012 the U.S. Geological Survey's (USGS) National Geospatial Program (NGP) funded a study to develop a conceptual prototype for a new National Elevation Dataset (NED) design with expanded capabilities to generate and deliver a suite of bare earth and above ground feature information over the United States. This report details the research on identifying operational requirements based on prior research, evaluation of what is needed for the USGS to meet these requirements, and development of a possible conceptual framework that could potentially deliver the kinds of information that are needed to support NGP's partners and constituents. This report provides an initial proof-of-concept demonstration using an existing dataset, and recommendations for the future, to inform NGP's ongoing and future elevation program planning and management decisions. The demonstration shows that this type of functional process can robustly create derivatives from lidar point cloud data; however, more research needs to be done to see how well it extends to multiple datasets.
Kernel-based discriminant feature extraction using a representative dataset

NASA Astrophysics Data System (ADS)

Li, Honglin; Sancho Gomez, Jose-Luis; Ahalt, Stanley C.

2002-07-01

Discriminant Feature Extraction (DFE) is widely recognized as an important pre-processing step in classification applications. Most DFE algorithms are linear and thus can only explore the linear discriminant information among the different classes. Recently, there has been several promising attempts to develop nonlinear DFE algorithms, among which is Kernel-based Feature Extraction (KFE). The efficacy of KFE has been experimentally verified by both synthetic data and real problems. However, KFE has some known limitations. First, KFE does not work well for strongly overlapped data. Second, KFE employs all of the training set samples during the feature extraction phase, which can result in significant computation when applied to very large datasets. Finally, KFE can result in overfitting. In this paper, we propose a substantial improvement to KFE that overcomes the above limitations by using a representative dataset, which consists of critical points that are generated from data-editing techniques and centroid points that are determined by using the Frequency Sensitive Competitive Learning (FSCL) algorithm. Experiments show that this new KFE algorithm performs well on significantly overlapped datasets, and it also reduces computational complexity. Further, by controlling the number of centroids, the overfitting problem can be effectively alleviated.
a Comparative Analysis of Five Cropland Datasets in Africa

NASA Astrophysics Data System (ADS)

Wei, Y.; Lu, M.; Wu, W.

2018-04-01

The food security, particularly in Africa, is a challenge to be resolved. The cropland area and spatial distribution obtained from remote sensing imagery are vital information. In this paper, according to cropland area and spatial location, we compare five global cropland datasets including CCI Land Cover, GlobCover, MODIS Collection 5, GlobeLand30 and Unified Cropland in circa 2010 of Africa in terms of cropland area and spatial location. The accuracy of cropland area calculated from five datasets was analyzed compared with statistic data. Based on validation samples, the accuracies of spatial location for the five cropland products were assessed by error matrix. The results show that GlobeLand30 has the best fitness with the statistics, followed by MODIS Collection 5 and Unified Cropland, GlobCover and CCI Land Cover have the lower accuracies. For the accuracy of spatial location of cropland, GlobeLand30 reaches the highest accuracy, followed by Unified Cropland, MODIS Collection 5 and GlobCover, CCI Land Cover has the lowest accuracy. The spatial location accuracy of five datasets in the Csa with suitable farming condition is generally higher than in the Bsk.
Topographical effects of climate dataset and their impacts on the estimation of regional net primary productivity

NASA Astrophysics Data System (ADS)

Sun, L. Qing; Feng, Feng X.

2014-11-01

In this study, we first built and compared two different climate datasets for Wuling mountainous area in 2010, one of which considered topographical effects during the ANUSPLIN interpolation was referred as terrain-based climate dataset, while the other one did not was called ordinary climate dataset. Then, we quantified the topographical effects of climatic inputs on NPP estimation by inputting two different climate datasets to the same ecosystem model, the Boreal Ecosystem Productivity Simulator (BEPS), to evaluate the importance of considering relief when estimating NPP. Finally, we found the primary contributing variables to the topographical effects through a series of experiments given an overall accuracy of the model output for NPP. The results showed that: (1) The terrain-based climate dataset presented more reliable topographic information and had closer agreements with the station dataset than the ordinary climate dataset at successive time series of 365 days in terms of the daily mean values. (2) On average, ordinary climate dataset underestimated NPP by 12.5% compared with terrain-based climate dataset over the whole study area. (3) The primary climate variables contributing to the topographical effects of climatic inputs for Wuling mountainous area were temperatures, which suggest that it is necessary to correct temperature differences for estimating NPP accurately in such a complex terrain.
Assessment of NASA's Physiographic and Meteorological Datasets as Input to HSPF and SWAT Hydrological Models

NASA Technical Reports Server (NTRS)

Alacron, Vladimir J.; Nigro, Joseph D.; McAnally, William H.; OHara, Charles G.; Engman, Edwin Ted; Toll, David

2011-01-01

This paper documents the use of simulated Moderate Resolution Imaging Spectroradiometer land use/land cover (MODIS-LULC), NASA-LIS generated precipitation and evapo-transpiration (ET), and Shuttle Radar Topography Mission (SRTM) datasets (in conjunction with standard land use, topographical and meteorological datasets) as input to hydrological models routinely used by the watershed hydrology modeling community. The study is focused in coastal watersheds in the Mississippi Gulf Coast although one of the test cases focuses in an inland watershed located in northeastern State of Mississippi, USA. The decision support tools (DSTs) into which the NASA datasets were assimilated were the Soil Water & Assessment Tool (SWAT) and the Hydrological Simulation Program FORTRAN (HSPF). These DSTs are endorsed by several US government agencies (EPA, FEMA, USGS) for water resources management strategies. These models use physiographic and meteorological data extensively. Precipitation gages and USGS gage stations in the region were used to calibrate several HSPF and SWAT model applications. Land use and topographical datasets were swapped to assess model output sensitivities. NASA-LIS meteorological data were introduced in the calibrated model applications for simulation of watershed hydrology for a time period in which no weather data were available (1997-2006). The performance of the NASA datasets in the context of hydrological modeling was assessed through comparison of measured and model-simulated hydrographs. Overall, NASA datasets were as useful as standard land use, topographical , and meteorological datasets. Moreover, NASA datasets were used for performing analyses that the standard datasets could not made possible, e.g., introduction of land use dynamics into hydrological simulations
Toward a complete dataset of drug-drug interaction information from publicly available sources.

PubMed

Ayvaz, Serkan; Horn, John; Hassanzadeh, Oktie; Zhu, Qian; Stan, Johann; Tatonetti, Nicholas P; Vilar, Santiago; Brochhausen, Mathias; Samwald, Matthias; Rastegar-Mojarad, Majid; Dumontier, Michel; Boyce, Richard D

2015-06-01

Although potential drug-drug interactions (PDDIs) are a significant source of preventable drug-related harm, there is currently no single complete source of PDDI information. In the current study, all publically available sources of PDDI information that could be identified using a comprehensive and broad search were combined into a single dataset. The combined dataset merged fourteen different sources including 5 clinically-oriented information sources, 4 Natural Language Processing (NLP) Corpora, and 5 Bioinformatics/Pharmacovigilance information sources. As a comprehensive PDDI source, the merged dataset might benefit the pharmacovigilance text mining community by making it possible to compare the representativeness of NLP corpora for PDDI text extraction tasks, and specifying elements that can be useful for future PDDI extraction purposes. An analysis of the overlap between and across the data sources showed that there was little overlap. Even comprehensive PDDI lists such as DrugBank, KEGG, and the NDF-RT had less than 50% overlap with each other. Moreover, all of the comprehensive lists had incomplete coverage of two data sources that focus on PDDIs of interest in most clinical settings. Based on this information, we think that systems that provide access to the comprehensive lists, such as APIs into RxNorm, should be careful to inform users that the lists may be incomplete with respect to PDDIs that drug experts suggest clinicians be aware of. In spite of the low degree of overlap, several dozen cases were identified where PDDI information provided in drug product labeling might be augmented by the merged dataset. Moreover, the combined dataset was also shown to improve the performance of an existing PDDI NLP pipeline and a recently published PDDI pharmacovigilance protocol. Future work will focus on improvement of the methods for mapping between PDDI information sources, identifying methods to improve the use of the merged dataset in PDDI NLP algorithms
SDCLIREF - A sub-daily gridded reference dataset

NASA Astrophysics Data System (ADS)

Wood, Raul R.; Willkofer, Florian; Schmid, Franz-Josef; Trentini, Fabian; Komischke, Holger; Ludwig, Ralf

2017-04-01

Climate change is expected to impact the intensity and frequency of hydrometeorological extreme events. In order to adequately capture and analyze extreme rainfall events, in particular when assessing flood and flash flood situations, data is required at high spatial and sub-daily resolution which is often not available in sufficient density and over extended time periods. The ClimEx project (Climate Change and Hydrological Extreme Events) addresses the alteration of hydrological extreme events under climate change conditions. In order to differentiate between a clear climate change signal and the limits of natural variability, unique Single-Model Regional Climate Model Ensembles (CRCM5 driven by CanESM2, RCP8.5) were created for a European and North-American domain, each comprising 50 members of 150 years (1951-2100). In combination with the CORDEX-Database, this newly created ClimEx-Ensemble is a one-of-a-kind model dataset to analyze changes of sub-daily extreme events. For the purpose of bias-correcting the regional climate model ensembles as well as for the baseline calibration and validation of hydrological catchment models, a new sub-daily (3h) high-resolution (500m) gridded reference dataset (SDCLIREF) was created for a domain covering the Upper Danube and Main watersheds ( 100.000km2). As the sub-daily observations lack a continuous time series for the reference period 1980-2010, the need for a suitable method to bridge the gap of the discontinuous time series arouse. The Method of Fragments (Sharma and Srikanthan (2006); Westra et al. (2012)) was applied to transform daily observations to sub-daily rainfall events to extend the time series and densify the station network. Prior to applying the Method of Fragments and creating the gridded dataset using rigorous interpolation routines, data collection of observations, operated by several institutions in three countries (Germany, Austria, Switzerland), and the subsequent quality control of the observations

The distance function effect on k-nearest neighbor classification for medical datasets.

PubMed

Hu, Li-Yu; Huang, Min-Wei; Ke, Shih-Wen; Tsai, Chih-Fong

2016-01-01

K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.
Ecologia de las lombrices de tierra

Treesearch

Grizelle Gonzalez

2014-01-01

De los organismos de suelo, las lombrices de tierra son las mas conocidas y a menudo son consideradas las mas importantes por su influencia en el funcionamiento de ecosistemas de suelo (Hendriz y Bohlen, 2002). Tienen un efecto significativo en la estructura del suelo, el ciclo de nutrimentos y ls productividad de las cosechas. En terminos de biomasa, generalmente...
A large spectroscopic sample of L and T dwarfs from UKIDSS LAS: peculiar objects, binaries, and space density

NASA Astrophysics Data System (ADS)

Marocco, F.; Jones, H. R. A.; Day-Jones, A. C.; Pinfield, D. J.; Lucas, P. W.; Burningham, B.; Zhang, Z. H.; Smart, R. L.; Gomes, J. I.; Smith, L.

2015-06-01

We present the spectroscopic analysis of a large sample of late-M, L, and T dwarfs from the United Kingdom Deep Infrared Sky Survey. Using the YJHK photometry from the Large Area Survey and the red-optical photometry from the Sloan Digital Sky Survey we selected a sample of 262 brown dwarf candidates and we have followed-up 196 of them using the echelle spectrograph X-shooter on the Very Large Telescope. The large wavelength coverage (0.30-2.48 μm) and moderate resolution (R ˜ 5000-9000) of X-shooter allowed us to identify peculiar objects including 22 blue L dwarfs, 2 blue T dwarfs, and 2 low-gravity M dwarfs. Using a spectral indices-based technique, we identified 27 unresolved binary candidates, for which we have determined the spectral type of the potential components via spectral deconvolution. The spectra allowed us to measure the equivalent width of the prominent absorption features and to compare them to atmospheric models. Cross-correlating the spectra with a radial velocity standard, we measured the radial velocity of our targets, and we determined the distribution of the sample, which is centred at -1.7 ± 1.2 km s-1 with a dispersion of 31.5 km s-1. Using our results, we estimated the space density of field brown dwarfs and compared it with the results of numerical simulations. Depending on the binary fraction, we found that there are (0.85 ± 0.55) × 10-3 to (1.00 ± 0.64) × 10-3 objects per cubic parsec in the L4-L6.5 range, (0.73 ± 0.47) × 10-3 to (0.85 ± 0.55) × 10-3 objects per cubic parsec in the L7-T0.5 range, and (0.74 ± 0.48) × 10-3 to (0.88 ± 0.56) × 10-3 objects per cubic parsec in the T1-T4.5 range. We notice that there seems to be an excess of objects in the L-T transition with respect to the late-T dwarfs, a discrepancy that could be explained assuming a higher binary fraction than expected for the L-T transition, or that objects in the high-mass end and low-mass end of this regime form in different environments, i.e. following different initial mass functions.
Quality Controlling CMIP datasets at GFDL

NASA Astrophysics Data System (ADS)

Horowitz, L. W.; Radhakrishnan, A.; Balaji, V.; Adcroft, A.; Krasting, J. P.; Nikonov, S.; Mason, E. E.; Schweitzer, R.; Nadeau, D.

2017-12-01

As GFDL makes the switch from model development to production in light of the Climate Model Intercomparison Project (CMIP), GFDL's efforts are shifted to testing and more importantly establishing guidelines and protocols for Quality Controlling and semi-automated data publishing. Every CMIP cycle introduces key challenges and the upcoming CMIP6 is no exception. The new CMIP experimental design comprises of multiple MIPs facilitating research in different focus areas. This paradigm has implications not only for the groups that develop the models and conduct the runs, but also for the groups that monitor, analyze and quality control the datasets before data publishing, before their knowledge makes its way into reports like the IPCC (Intergovernmental Panel on Climate Change) Assessment Reports. In this talk, we discuss some of the paths taken at GFDL to quality control the CMIP-ready datasets including: Jupyter notebooks, PrePARE, LAMP (Linux, Apache, MySQL, PHP/Python/Perl): technology-driven tracker system to monitor the status of experiments qualitatively and quantitatively, provide additional metadata and analysis services along with some in-built controlled-vocabulary validations in the workflow. In addition to this, we also discuss the integration of community-based model evaluation software (ESMValTool, PCMDI Metrics Package, and ILAMB) as part of our CMIP6 workflow.
BABAR: an R package to simplify the normalisation of common reference design microarray-based transcriptomic datasets

PubMed Central

2010-01-01

Background The development of DNA microarrays has facilitated the generation of hundreds of thousands of transcriptomic datasets. The use of a common reference microarray design allows existing transcriptomic data to be readily compared and re-analysed in the light of new data, and the combination of this design with large datasets is ideal for 'systems'-level analyses. One issue is that these datasets are typically collected over many years and may be heterogeneous in nature, containing different microarray file formats and gene array layouts, dye-swaps, and showing varying scales of log2- ratios of expression between microarrays. Excellent software exists for the normalisation and analysis of microarray data but many data have yet to be analysed as existing methods struggle with heterogeneous datasets; options include normalising microarrays on an individual or experimental group basis. Our solution was to develop the Batch Anti-Banana Algorithm in R (BABAR) algorithm and software package which uses cyclic loess to normalise across the complete dataset. We have already used BABAR to analyse the function of Salmonella genes involved in the process of infection of mammalian cells. Results The only input required by BABAR is unprocessed GenePix or BlueFuse microarray data files. BABAR provides a combination of 'within' and 'between' microarray normalisation steps and diagnostic boxplots. When applied to a real heterogeneous dataset, BABAR normalised the dataset to produce a comparable scaling between the microarrays, with the microarray data in excellent agreement with RT-PCR analysis. When applied to a real non-heterogeneous dataset and a simulated dataset, BABAR's performance in identifying differentially expressed genes showed some benefits over standard techniques. Conclusions BABAR is an easy-to-use software tool, simplifying the simultaneous normalisation of heterogeneous two-colour common reference design cDNA microarray-based transcriptomic datasets. We show
MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.

PubMed

Qin, Li-Xuan; Zhou, Qin

2014-01-01

MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.
Sinfonevada: Dataset of Floristic diversity in Sierra Nevada forests (SE Spain)

PubMed Central

Pérez-Luque, Antonio Jesús; Bonet, Francisco Javier; Pérez-Pérez, Ramón; Rut Aspizua; Lorite, Juan; Zamora, Regino

2014-01-01

Abstract The Sinfonevada database is a forest inventory that contains information on the forest ecosystem in the Sierra Nevada mountains (SE Spain). The Sinfonevada dataset contains more than 7,500 occurrence records belonging to 270 taxa (24 of these threatened) from floristic inventories of the Sinfonevada Forest inventory. Expert field workers collected the information. The whole dataset underwent a quality control by botanists with broad expertise in Sierra Nevada flora. This floristic inventory was created to gather useful information for the proper management of Pinus plantations in Sierra Nevada. This is the only dataset that shows a comprehensive view of the forest flora in Sierra Nevada. This is the reason why it is being used to assess the biodiversity in the very dense pine plantations on this massif. With this dataset, managers have improved their ability to decide where to apply forest treatments in order to avoid biodiversity loss. The dataset forms part of the Sierra Nevada Global Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:24843285
OpenCL based machine learning labeling of biomedical datasets

NASA Astrophysics Data System (ADS)

Amoros, Oscar; Escalera, Sergio; Puig, Anna

2011-03-01

In this paper, we propose a two-stage labeling method of large biomedical datasets through a parallel approach in a single GPU. Diagnostic methods, structures volume measurements, and visualization systems are of major importance for surgery planning, intra-operative imaging and image-guided surgery. In all cases, to provide an automatic and interactive method to label or to tag different structures contained into input data becomes imperative. Several approaches to label or segment biomedical datasets has been proposed to discriminate different anatomical structures in an output tagged dataset. Among existing methods, supervised learning methods for segmentation have been devised to easily analyze biomedical datasets by a non-expert user. However, they still have some problems concerning practical application, such as slow learning and testing speeds. In addition, recent technological developments have led to widespread availability of multi-core CPUs and GPUs, as well as new software languages, such as NVIDIA's CUDA and OpenCL, allowing to apply parallel programming paradigms in conventional personal computers. Adaboost classifier is one of the most widely applied methods for labeling in the Machine Learning community. In a first stage, Adaboost trains a binary classifier from a set of pre-labeled samples described by a set of features. This binary classifier is defined as a weighted combination of weak classifiers. Each weak classifier is a simple decision function estimated on a single feature value. Then, at the testing stage, each weak classifier is independently applied on the features of a set of unlabeled samples. In this work, we propose an alternative representation of the Adaboost binary classifier. We use this proposed representation to define a new GPU-based parallelized Adaboost testing stage using OpenCL. We provide numerical experiments based on large available data sets and we compare our results to CPU-based strategies in terms of time and
Development of a video tampering dataset for forensic investigation.

PubMed

Ismael Al-Sanjary, Omar; Ahmed, Ahmed Abdullah; Sulong, Ghazali

2016-09-01

Forgery is an act of modifying a document, product, image or video, among other media. Video tampering detection research requires an inclusive database of video modification. This paper aims to discuss a comprehensive proposal to create a dataset composed of modified videos for forensic investigation, in order to standardize existing techniques for detecting video tampering. The primary purpose of developing and designing this new video library is for usage in video forensics, which can be consciously associated with reliable verification using dynamic and static camera recognition. To the best of the author's knowledge, there exists no similar library among the research community. Videos were sourced from YouTube and by exploring social networking sites extensively by observing posted videos and rating their feedback. The video tampering dataset (VTD) comprises a total of 33 videos, divided among three categories in video tampering: (1) copy-move, (2) splicing, and (3) swapping-frames. Compared to existing datasets, this is a higher number of tampered videos, and with longer durations. The duration of every video is 16s, with a 1280×720 resolution, and a frame rate of 30 frames per second. Moreover, all videos possess the same formatting quality (720p(HD).avi). Both temporal and spatial video features were considered carefully during selection of the videos, and there exists complete information related to the doctored regions in every modified video in the VTD dataset. This database has been made publically available for research on splicing, Swapping frames, and copy-move tampering, and, as such, various video tampering detection issues with ground truth. The database has been utilised by many international researchers and groups of researchers. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets

NASA Astrophysics Data System (ADS)

Fitriah Rusland, Nurul; Wahid, Norfaradilla; Kasim, Shahreen; Hafit, Hanayanti

2017-08-01

E-mail spam continues to become a problem on the Internet. Spammed e-mail may contain many copies of the same message, commercial advertisement or other irrelevant posts like pornographic content. In previous research, different filtering techniques are used to detect these e-mails such as using Random Forest, Naïve Bayesian, Support Vector Machine (SVM) and Neutral Network. In this research, we test Naïve Bayes algorithm for e-mail spam filtering on two datasets and test its performance, i.e., Spam Data and SPAMBASE datasets [8]. The performance of the datasets is evaluated based on their accuracy, recall, precision and F-measure. Our research use WEKA tool for the evaluation of Naïve Bayes algorithm for e-mail spam filtering on both datasets. The result shows that the type of email and the number of instances of the dataset has an influence towards the performance of Naïve Bayes.
Can a national dataset generate a nomogram for necrotizing enterocolitis onset?

PubMed

Gordon, P V; Clark, R; Swanson, J R; Spitzer, A

2014-10-01

Mother's own milk and donor human milk use is increasing as a means of necrotizing enterocolitis (NEC) prevention. Early onset of enteral feeding has been associated with improvement of many outcomes but has not been shown to reduce the incidence of NEC. Better definition of the window of risk for NEC by gestational strata should improve resource management with respect to donor human milk and enhance our understanding of NEC timing and pathogenesis. Our objective was to establish a NEC dataset of sufficient size and quality, then build a generalizable model of NEC onset from the dataset across gestational strata. We used de-identified data from the Pediatrix national dataset and filtered out all diagnostic confounders that could be identified by either specific diagnoses or logical exclusions (example dual diagnoses), with a specific focus on NEC and spontaneous intestinal perforation (SIP) as the outcomes of interest. The median day of onset was plotted against the gestational age for each of these diagnoses and analyzed for similarities and differences in the day of diagnosis. Onset time of medical NEC was inversely proportional to gestation in a linear relationship across all gestational ages. We found the medical NEC dataset displayed characteristics most consistent with a homogeneous disease entity, whereas there was a skew towards early presentation in the youngest gestation groups of surgical NEC (suggesting probable SIP contamination). Our national dataset demonstrates that NEC onset occurs in an inverse stereotypic, linear relationship with gestational age at birth. Medical NEC is the most reliable sub-cohort for the purpose of determining the temporal window of NEC risk.
Analysis of plant-derived miRNAs in animal small RNA datasets

PubMed Central

2012-01-01

Background Plants contain significant quantities of small RNAs (sRNAs) derived from various sRNA biogenesis pathways. Many of these sRNAs play regulatory roles in plants. Previous analysis revealed that numerous sRNAs in corn, rice and soybean seeds have high sequence similarity to animal genes. However, exogenous RNA is considered to be unstable within the gastrointestinal tract of many animals, thus limiting potential for any adverse effects from consumption of dietary RNA. A recent paper reported that putative plant miRNAs were detected in animal plasma and serum, presumably acquired through ingestion, and may have a functional impact in the consuming organisms. Results To address the question of how common this phenomenon could be, we searched for plant miRNAs sequences in public sRNA datasets from various tissues of mammals, chicken and insects. Our analyses revealed that plant miRNAs were present in the animal sRNA datasets, and significantly miR168 was extremely over-represented. Furthermore, all or nearly all (>96%) miR168 sequences were monocot derived for most datasets, including datasets for two insects reared on dicot plants in their respective experiments. To investigate if plant-derived miRNAs, including miR168, could accumulate and move systemically in insects, we conducted insect feeding studies for three insects including corn rootworm, which has been shown to be responsive to plant-produced long double-stranded RNAs. Conclusions Our analyses suggest that the observed plant miRNAs in animal sRNA datasets can originate in the process of sequencing, and that accumulation of plant miRNAs via dietary exposure is not universal in animals. PMID:22873950
Comparison of Radiative Energy Flows in Observational Datasets and Climate Modeling

NASA Technical Reports Server (NTRS)

Raschke, Ehrhard; Kinne, Stefan; Rossow, William B.; Stackhouse, Paul W. Jr.; Wild, Martin

2016-01-01

This study examines radiative flux distributions and local spread of values from three major observational datasets (CERES, ISCCP, and SRB) and compares them with results from climate modeling (CMIP3). Examinations of the spread and differences also differentiate among contributions from cloudy and clear-sky conditions. The spread among observational datasets is in large part caused by noncloud ancillary data. Average differences of at least 10Wm(exp -2) each for clear-sky downward solar, upward solar, and upward infrared fluxes at the surface demonstrate via spatial difference patterns major differences in assumptions for atmospheric aerosol, solar surface albedo and surface temperature, and/or emittance in observational datasets. At the top of the atmosphere (TOA), observational datasets are less influenced by the ancillary data errors than at the surface. Comparisons of spatial radiative flux distributions at the TOA between observations and climate modeling indicate large deficiencies in the strength and distribution of model-simulated cloud radiative effects. Differences are largest for lower-altitude clouds over low-latitude oceans. Global modeling simulates stronger cloud radiative effects (CRE) by +30Wmexp -2) over trade wind cumulus regions, yet smaller CRE by about -30Wm(exp -2) over (smaller in area) stratocumulus regions. At the surface, climate modeling simulates on average about 15Wm(exp -2) smaller radiative net flux imbalances, as if climate modeling underestimates latent heat release (and precipitation). Relative to observational datasets, simulated surface net fluxes are particularly lower over oceanic trade wind regions (where global modeling tends to overestimate the radiative impact of clouds). Still, with the uncertainty in noncloud ancillary data, observational data do not establish a reliable reference.
Identification of druggable cancer driver genes amplified across TCGA datasets.

PubMed

Chen, Ying; McGee, Jeremy; Chen, Xianming; Doman, Thompson N; Gong, Xueqian; Zhang, Youyan; Hamm, Nicole; Ma, Xiwen; Higgs, Richard E; Bhagwat, Shripad V; Buchanan, Sean; Peng, Sheng-Bin; Staschke, Kirk A; Yadav, Vipin; Yue, Yong; Kouros-Mehr, Hosein

2014-01-01

The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 16 cancer subtypes and identified 486 genes that were amplified in two or more datasets. The list was narrowed to 75 cancer-associated genes with potential "druggable" properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 42 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 42 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapters GRB2 and GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer
Identification of Druggable Cancer Driver Genes Amplified across TCGA Datasets

PubMed Central

Chen, Ying; McGee, Jeremy; Chen, Xianming; Doman, Thompson N.; Gong, Xueqian; Zhang, Youyan; Hamm, Nicole; Ma, Xiwen; Higgs, Richard E.; Bhagwat, Shripad V.; Buchanan, Sean; Peng, Sheng-Bin; Staschke, Kirk A.; Yadav, Vipin; Yue, Yong; Kouros-Mehr, Hosein

2014-01-01

The Cancer Genome Atlas (TCGA) projects have advanced our understanding of the driver mutations, genetic backgrounds, and key pathways activated across cancer types. Analysis of TCGA datasets have mostly focused on somatic mutations and translocations, with less emphasis placed on gene amplifications. Here we describe a bioinformatics screening strategy to identify putative cancer driver genes amplified across TCGA datasets. We carried out GISTIC2 analysis of TCGA datasets spanning 14 cancer subtypes and identified 461 genes that were amplified in two or more datasets. The list was narrowed to 73 cancer-associated genes with potential “druggable” properties. The majority of the genes were localized to 14 amplicons spread across the genome. To identify potential cancer driver genes, we analyzed gene copy number and mRNA expression data from individual patient samples and identified 40 putative cancer driver genes linked to diverse oncogenic processes. Oncogenic activity was further validated by siRNA/shRNA knockdown and by referencing the Project Achilles datasets. The amplified genes represented a number of gene families, including epigenetic regulators, cell cycle-associated genes, DNA damage response/repair genes, metabolic regulators, and genes linked to the Wnt, Notch, Hedgehog, JAK/STAT, NF-KB and MAPK signaling pathways. Among the 40 putative driver genes were known driver genes, such as EGFR, ERBB2 and PIK3CA. Wild-type KRAS was amplified in several cancer types, and KRAS-amplified cancer cell lines were most sensitive to KRAS shRNA, suggesting that KRAS amplification was an independent oncogenic event. A number of MAP kinase adapters were co-amplified with their receptor tyrosine kinases, such as the FGFR adapter FRS2 and the EGFR family adapter GRB7. The ubiquitin-like ligase DCUN1D1 and the histone methyltransferase NSD3 were also identified as novel putative cancer driver genes. We discuss the patient tailoring implications for existing cancer drug
Tree-based approach for exploring marine spatial patterns with raster datasets.

PubMed

Liao, Xiaohan; Xue, Cunjin; Su, Fenzhen

2017-01-01

From multiple raster datasets to spatial association patterns, the data-mining technique is divided into three subtasks, i.e., raster dataset pretreatment, mining algorithm design, and spatial pattern exploration from the mining results. Comparison with the former two subtasks reveals that the latter remains unresolved. Confronted with the interrelated marine environmental parameters, we propose a Tree-based Approach for eXploring Marine Spatial Patterns with multiple raster datasets called TAXMarSP, which includes two models. One is the Tree-based Cascading Organization Model (TCOM), and the other is the Spatial Neighborhood-based CAlculation Model (SNCAM). TCOM designs the "Spatial node→Pattern node" from top to bottom layers to store the table-formatted frequent patterns. Together with TCOM, SNCAM considers the spatial neighborhood contributions to calculate the pattern-matching degree between the specified marine parameters and the table-formatted frequent patterns and then explores the marine spatial patterns. Using the prevalent quantification Apriori algorithm and a real remote sensing dataset from January 1998 to December 2014, a successful application of TAXMarSP to marine spatial patterns in the Pacific Ocean is described, and the obtained marine spatial patterns present not only the well-known but also new patterns to Earth scientists.
Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.

PubMed

Huang, Min-Wei; Lin, Wei-Chao; Tsai, Chih-Fong

2018-01-01

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.
Automatic Diabetic Macular Edema Detection in Fundus Images Using Publicly Available Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Giancardo, Luca; Meriaudeau, Fabrice; Karnowski, Thomas Paul

2011-01-01

Diabetic macular edema (DME) is a common vision threatening complication of diabetic retinopathy. In a large scale screening environment DME can be assessed by detecting exudates (a type of bright lesions) in fundus images. In this work, we introduce a new methodology for diagnosis of DME using a novel set of features based on colour, wavelet decomposition and automatic lesion segmentation. These features are employed to train a classifier able to automatically diagnose DME. We present a new publicly available dataset with ground-truth data containing 169 patients from various ethnic groups and levels of DME. This and other two publiclymore » available datasets are employed to evaluate our algorithm. We are able to achieve diagnosis performance comparable to retina experts on the MESSIDOR (an independently labelled dataset with 1200 images) with cross-dataset testing. Our algorithm is robust to segmentation uncertainties, does not need ground truth at lesion level, and is very fast, generating a diagnosis on an average of 4.4 seconds per image on an 2.6 GHz platform with an unoptimised Matlab implementation.« less
Dataset from Dick et al published in Sawyer et al 2016

EPA Pesticide Factsheets

Dataset is a time course description of lindane disappearance in blood plasma after dermal exposure in human volunteersThis dataset is associated with the following publication:Sawyer, M.E., M.V. Evans , C. Wilson, L.J. Beesley, L. Leon, C. Eklund , E. Croom, and R. Pegram. Development of a Human Physiologically Based Pharmacokinetics (PBPK) Model For Dermal Permeability for Lindane. TOXICOLOGY LETTERS. Elsevier Science Ltd, New York, NY, USA, 14(245): pp106-109, (2016).
Acquisition of thin coronal sectional dataset of cadaveric liver.

PubMed

Lou, Li; Liu, Shu Wei; Zhao, Zhen Mei; Tang, Yu Chun; Lin, Xiang Tao

2014-04-01

To obtain the thin coronal sectional anatomic dataset of the liver by using digital freezing milling technique. The upper abdomen of one Chinese adult cadaver was selected as the specimen. After CT and MRI examinations verification of absent liver lesions, the specimen was embedded with gelatin in stand erect position and frozen under profound hypothermia, and the specimen was then serially sectioned from anterior to posterior layer by layer with digital milling machine in the freezing chamber. The sequential images were captured by means of a digital camera and the dataset was imported to imaging workstation. The thin serial section of the liver added up to 699 layers with each layer being 0.2 mm in thickness. The shape, location, structure, intrahepatic vessels and adjacent structures of the liver was displayed clearly on each layer of the coronal sectional slice. CT and MR images through the body were obtained at 1.0 and 3.0 mm intervals, respectively. The methodology reported here is an adaptation of the milling methods previously described, which is a new data acquisition method for sectional anatomy. The thin coronal sectional anatomic dataset of the liver obtained by this technique is of high precision and good quality.

Satellite-derived pan-Arctic melt onset dataset, 2000-2009

NASA Astrophysics Data System (ADS)

Wang, L.; Derksen, C.; Howell, S.; Wolken, G. J.; Sharp, M. J.; Markus, T.

2009-12-01

The SeaWinds Scatterometer on QuikSCAT (QS) has been in orbit for over a decade since its launch in June 1999. Due to its high sensitivity to the appearance of liquid water in snow and day/night all weather capability, QS data have been successfully used to detect melt onset and melt duration for various elements of the cryosphere. These melt datasets are especially useful in the polar regions where the application of imagery from optical sensors is hindered by polar nights and frequent cloud cover. In this study, we generate a pan-Arctic, pan-cryosphere melt onset dataset by combining estimates from previously published algorithms optimized for individual cryospheric elements and applied to QS and Special Sensor Microwave Imager (SSM/I) data for the northern high latitude land surface, ice caps, large lakes, and sea ice. Comparisons of melt onset along the boundaries between different components of the cryosphere show that in general the integrated dataset provides consistent and spatially coherent melt onset estimates across the pan-Arctic. We present the climatology and the anomaly patterns in melt onset during 2000-2009, and identify synoptic-scale linkages between atmospheric conditions and the observed patterns. We also investigate the possible trends in melt onset in the pan-Arctic during the 10-year period.
Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

NASA Astrophysics Data System (ADS)

Jiang, Y.

2015-12-01

Oceanographic resource discovery is a critical step for developing ocean science applications. With the increasing number of resources available online, many Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to help manage and discover oceanographic resources. However, efficient and accurate resource discovery is still a big challenge because of the lack of data relevancy information. In this article, we propose a search engine framework for mining and utilizing dataset relevancy from oceanographic dataset metadata, usage metrics, and user feedback. The objective is to improve discovery accuracy of oceanographic data and reduce time for scientist to discover, download and reformat data for their projects. Experiments and a search example show that the propose engine helps both scientists and general users search for more accurate results with enhanced performance and user experience through a user-friendly interface.
Enhancing Conservation with High Resolution Productivity Datasets for the Conterminous United States

NASA Astrophysics Data System (ADS)

Robinson, Nathaniel Paul

Human driven alteration of the earth's terrestrial surface is accelerating through land use changes, intensification of human activity, climate change, and other anthropogenic pressures. These changes occur at broad spatio-temporal scales, challenging our ability to effectively monitor and assess the impacts and subsequent conservation strategies. While satellite remote sensing (SRS) products enable monitoring of the earth's terrestrial surface continuously across space and time, the practical applications for conservation and management of these products are limited. Often the processes driving ecological change occur at fine spatial resolutions and are undetectable given the resolution of available datasets. Additionally, the links between SRS data and ecologically meaningful metrics are weak. Recent advances in cloud computing technology along with the growing record of high resolution SRS data enable the development of SRS products that quantify ecologically meaningful variables at relevant scales applicable for conservation and management. The focus of my dissertation is to improve the applicability of terrestrial gross and net primary productivity (GPP/NPP) datasets for the conterminous United States (CONUS). In chapter one, I develop a framework for creating high resolution datasets of vegetation dynamics. I use the entire archive of Landsat 5, 7, and 8 surface reflectance data and a novel gap filling approach to create spatially continuous 30 m, 16-day composites of the normalized difference vegetation index (NDVI) from 1986 to 2016. In chapter two, I integrate this with other high resolution datasets and the MOD17 algorithm to create the first high resolution GPP and NPP datasets for CONUS. I demonstrate the applicability of these products for conservation and management, showing the improvements beyond currently available products. In chapter three, I utilize this dataset to evaluate the relationships between land ownership and terrestrial production
Nanomaterial datasets to advance tomography in scanning transmission electron microscopy

DOE PAGES

Levin, Barnaby D. A.; Padgett, Elliot; Chen, Chien-Chun; ...

2016-06-07

Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co 2 P nanocrystal, platinum nanoparticles on a carbonmore » nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.« less
Nanomaterial datasets to advance tomography in scanning transmission electron microscopy.

PubMed

Levin, Barnaby D A; Padgett, Elliot; Chen, Chien-Chun; Scott, M C; Xu, Rui; Theis, Wolfgang; Jiang, Yi; Yang, Yongsoo; Ophus, Colin; Zhang, Haitao; Ha, Don-Hyung; Wang, Deli; Yu, Yingchao; Abruña, Hector D; Robinson, Richard D; Ercius, Peter; Kourkoutis, Lena F; Miao, Jianwei; Muller, David A; Hovden, Robert

2016-06-07

Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co2P nanocrystal, platinum nanoparticles on a carbon nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data.
A dataset on human navigation strategies in foreign networked systems.

PubMed

Kőrösi, Attila; Csoma, Attila; Rétvári, Gábor; Heszberger, Zalán; Bíró, József; Tapolcai, János; Pelle, István; Klajbár, Dávid; Novák, Márton; Halasi, Valentina; Gulyás, András

2018-03-13

Humans are involved in various real-life networked systems. The most obvious examples are social and collaboration networks but the language and the related mental lexicon they use, or the physical map of their territory can also be interpreted as networks. How do they find paths between endpoints in these networks? How do they obtain information about a foreign networked world they find themselves in, how they build mental model for it and how well they succeed in using it? Large, open datasets allowing the exploration of such questions are hard to find. Here we report a dataset collected by a smartphone application, in which players navigate between fixed length source and destination English words step-by-step by changing only one letter at a time. The paths reflect how the players master their navigation skills in such a foreign networked world. The dataset can be used in the study of human mental models for the world around us, or in a broader scope to investigate the navigation strategies in complex networked systems.
A dataset on human navigation strategies in foreign networked systems

PubMed Central

Kőrösi, Attila; Csoma, Attila; Rétvári, Gábor; Heszberger, Zalán; Bíró, József; Tapolcai, János; Pelle, István; Klajbár, Dávid; Novák, Márton; Halasi, Valentina; Gulyás, András

2018-01-01

Humans are involved in various real-life networked systems. The most obvious examples are social and collaboration networks but the language and the related mental lexicon they use, or the physical map of their territory can also be interpreted as networks. How do they find paths between endpoints in these networks? How do they obtain information about a foreign networked world they find themselves in, how they build mental model for it and how well they succeed in using it? Large, open datasets allowing the exploration of such questions are hard to find. Here we report a dataset collected by a smartphone application, in which players navigate between fixed length source and destination English words step-by-step by changing only one letter at a time. The paths reflect how the players master their navigation skills in such a foreign networked world. The dataset can be used in the study of human mental models for the world around us, or in a broader scope to investigate the navigation strategies in complex networked systems. PMID:29533391
Optimal SVM parameter selection for non-separable and unbalanced datasets.

PubMed

Jiang, Peng; Missoum, Samy; Chen, Zhao

2014-10-01

This article presents a study of three validation metrics used for the selection of optimal parameters of a support vector machine (SVM) classifier in the case of non-separable and unbalanced datasets. This situation is often encountered when the data is obtained experimentally or clinically. The three metrics selected in this work are the area under the ROC curve (AUC), accuracy, and balanced accuracy. These validation metrics are tested using computational data only, which enables the creation of fully separable sets of data. This way, non-separable datasets, representative of a real-world problem, can be created by projection onto a lower dimensional sub-space. The knowledge of the separable dataset, unknown in real-world problems, provides a reference to compare the three validation metrics using a quantity referred to as the "weighted likelihood". As an application example, the study investigates a classification model for hip fracture prediction. The data is obtained from a parameterized finite element model of a femur. The performance of the various validation metrics is studied for several levels of separability, ratios of unbalance, and training set sizes.
Nanomaterial datasets to advance tomography in scanning transmission electron microscopy

PubMed Central

Levin, Barnaby D.A.; Padgett, Elliot; Chen, Chien-Chun; Scott, M.C.; Xu, Rui; Theis, Wolfgang; Jiang, Yi; Yang, Yongsoo; Ophus, Colin; Zhang, Haitao; Ha, Don-Hyung; Wang, Deli; Yu, Yingchao; Abruña, Hector D.; Robinson, Richard D.; Ercius, Peter; Kourkoutis, Lena F.; Miao, Jianwei; Muller, David A.; Hovden, Robert

2016-01-01

Electron tomography in materials science has flourished with the demand to characterize nanoscale materials in three dimensions (3D). Access to experimental data is vital for developing and validating reconstruction methods that improve resolution and reduce radiation dose requirements. This work presents five high-quality scanning transmission electron microscope (STEM) tomography datasets in order to address the critical need for open access data in this field. The datasets represent the current limits of experimental technique, are of high quality, and contain materials with structural complexity. Included are tomographic series of a hyperbranched Co2P nanocrystal, platinum nanoparticles on a carbon nanofibre imaged over the complete 180° tilt range, a platinum nanoparticle and a tungsten needle both imaged at atomic resolution by equal slope tomography, and a through-focal tilt series of PtCu nanoparticles. A volumetric reconstruction from every dataset is provided for comparison and development of post-processing and visualization techniques. Researchers interested in creating novel data processing and reconstruction algorithms will now have access to state of the art experimental test data. PMID:27272459
SPICE: exploration and analysis of post-cytometric complex multivariate datasets.

PubMed

Roederer, Mario; Nozzi, Joshua L; Nason, Martha C

2011-02-01

Polychromatic flow cytometry results in complex, multivariate datasets. To date, tools for the aggregate analysis of these datasets across multiple specimens grouped by different categorical variables, such as demographic information, have not been optimized. Often, the exploration of such datasets is accomplished by visualization of patterns with pie charts or bar charts, without easy access to statistical comparisons of measurements that comprise multiple components. Here we report on algorithms and a graphical interface we developed for these purposes. In particular, we discuss thresholding necessary for accurate representation of data in pie charts, the implications for display and comparison of normalized versus unnormalized data, and the effects of averaging when samples with significant background noise are present. Finally, we define a statistic for the nonparametric comparison of complex distributions to test for difference between groups of samples based on multi-component measurements. While originally developed to support the analysis of T cell functional profiles, these techniques are amenable to a broad range of datatypes. Published 2011 Wiley-Liss, Inc.
Estimation of Missed Statin Prescription Use in an Administrative Claims Dataset.

PubMed

Wade, Rolin L; Patel, Jeetvan G; Hill, Jerrold W; De, Ajita P; Harrison, David J

2017-09-01

Nonadherence to statin medications is associated with increased risk of cardiovascular disease and poses a challenge to lipid management in patients who are at risk for atherosclerotic cardiovascular disease. Numerous studies have examined statin adherence based on administrative claims data; however, these data may underestimate statin use in patients who participate in generic drug discount programs or who have alternative coverage. To estimate the proportion of patients with missing statin claims in a claims database and determine how missing claims affect commonly used utilization metrics. This retrospective cohort study used pharmacy data from the PharMetrics Plus (P+) claims dataset linked to the IMS longitudinal pharmacy point-of-sale prescription database (LRx) from January 1, 2012, through December 31, 2014. Eligible patients were represented in the P+ and LRx datasets, had ≥1 claim for a statin (index claim) in either database, and had ≥ 24 months of continuous enrollment in P+. Patients were linked between P+ and LRx using a deterministic method. Duplicate claims between LRx and P+ were removed to produce a new dataset comprised of P+ claims augmented with LRx claims. Statin use was then compared between P+ and the augmented P+ dataset. Utilization metrics that were evaluated included percentage of patients with ≥ 1 missing statin claim over 12 months in P+; the number of patients misclassified as new users in P+; the number of patients misclassified as nonstatin users in P+; the change in 12-month medication possession ratio (MPR) and proportion of days covered (PDC) in P+; the comparison between P+ and LRx of classifications of statin treatment patterns (statin intensity and patients with treatment modifications); and the payment status for missing statin claims. Data from 965,785 patients with statin claims in P+ were analyzed (mean age 56.6 years; 57% male). In P+, 20.1% had ≥ 1 missing statin claim post-index; 13.7% were misclassified as
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.

PubMed

Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

2018-01-01

The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers

PubMed Central

Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio

2018-01-01

The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556
Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain).

PubMed

Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino

2016-01-01

In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area.
Scientific Datasets: Discovery and Aggregation for Semantic Interpretation.

NASA Astrophysics Data System (ADS)

Lopez, L. A.; Scott, S.; Khalsa, S. J. S.; Duerr, R.

2015-12-01

One of the biggest challenges that interdisciplinary researchers face is finding suitable datasets in order to advance their science; this problem remains consistent across multiple disciplines. A surprising number of scientists, when asked what tool they use for data discovery, reply "Google", which is an acceptable solution in some cases but not even Google can find -or cares to compile- all the data that's relevant for science and particularly geo sciences. If a dataset is not discoverable through a well known search provider it will remain dark data to the scientific world.For the past year, BCube, an EarthCube Building Block project, has been developing, testing and deploying a technology stack capable of data discovery at web-scale using the ultimate dataset: The Internet. This stack has 2 principal components, a web-scale crawling infrastructure and a semantic aggregator. The web-crawler is a modified version of Apache Nutch (the originator of Hadoop and other big data technologies) that has been improved and tailored for data and data service discovery. The second component is semantic aggregation, carried out by a python-based workflow that extracts valuable metadata and stores it in the form of triples through the use semantic technologies.While implementing the BCube stack we have run into several challenges such as a) scaling the project to cover big portions of the Internet at a reasonable cost, b) making sense of very diverse and non-homogeneous data, and lastly, c) extracting facts about these datasets using semantic technologies in order to make them usable for the geosciences community. Despite all these challenges we have proven that we can discover and characterize data that otherwise would have remained in the dark corners of the Internet. Having all this data indexed and 'triplelized' will enable scientists to access a trove of information relevant to their work in a more natural way. An important characteristic of the BCube stack is that all
Soil chemistry in lithologically diverse datasets: the quartz dilution effect

USGS Publications Warehouse

Bern, Carleton R.

2009-01-01

National- and continental-scale soil geochemical datasets are likely to move our understanding of broad soil geochemistry patterns forward significantly. Patterns of chemistry and mineralogy delineated from these datasets are strongly influenced by the composition of the soil parent material, which itself is largely a function of lithology and particle size sorting. Such controls present a challenge by obscuring subtler patterns arising from subsequent pedogenic processes. Here the effect of quartz concentration is examined in moist-climate soils from a pilot dataset of the North American Soil Geochemical Landscapes Project. Due to variable and high quartz contents (6.2–81.7 wt.%), and its residual and inert nature in soil, quartz is demonstrated to influence broad patterns in soil chemistry. A dilution effect is observed whereby concentrations of various elements are significantly and strongly negatively correlated with quartz. Quartz content drives artificial positive correlations between concentrations of some elements and obscures negative correlations between others. Unadjusted soil data show the highly mobile base cations Ca, Mg, and Na to be often strongly positively correlated with intermediately mobile Al or Fe, and generally uncorrelated with the relatively immobile high-field-strength elements (HFS) Ti and Nb. Both patterns are contrary to broad expectations for soils being weathered and leached. After transforming bulk soil chemistry to a quartz-free basis, the base cations are generally uncorrelated with Al and Fe, and negative correlations generally emerge with the HFS elements. Quartz-free element data may be a useful tool for elucidating patterns of weathering or parent-material chemistry in large soil datasets.
Evaluation of Uncertainty in Precipitation Datasets for New Mexico, USA

NASA Astrophysics Data System (ADS)

Besha, A. A.; Steele, C. M.; Fernald, A.

2014-12-01

Climate change, population growth and other factors are endangering water availability and sustainability in semiarid/arid areas particularly in the southwestern United States. Wide coverage of spatial and temporal measurements of precipitation are key for regional water budget analysis and hydrological operations which themselves are valuable tool for water resource planning and management. Rain gauge measurements are usually reliable and accurate at a point. They measure rainfall continuously, but spatial sampling is limited. Ground based radar and satellite remotely sensed precipitation have wide spatial and temporal coverage. However, these measurements are indirect and subject to errors because of equipment, meteorological variability, the heterogeneity of the land surface itself and lack of regular recording. This study seeks to understand precipitation uncertainty and in doing so, lessen uncertainty propagation into hydrological applications and operations. We reviewed, compared and evaluated the TRMM (Tropical Rainfall Measuring Mission) precipitation products, NOAA's (National Oceanic and Atmospheric Administration) Global Precipitation Climatology Centre (GPCC) monthly precipitation dataset, PRISM (Parameter elevation Regression on Independent Slopes Model) data and data from individual climate stations including Cooperative Observer Program (COOP), Remote Automated Weather Stations (RAWS), Soil Climate Analysis Network (SCAN) and Snowpack Telemetry (SNOTEL) stations. Though not yet finalized, this study finds that the uncertainty within precipitation estimates datasets is influenced by regional topography, season, climate and precipitation rate. Ongoing work aims to further evaluate precipitation datasets based on the relative influence of these phenomena so that we can identify the optimum datasets for input to statewide water budget analysis.
Introducing a Web API for Dataset Submission into a NASA Earth Science Data Center

NASA Astrophysics Data System (ADS)

Moroni, D. F.; Quach, N.; Francis-Curley, W.

2016-12-01

As the landscape of data becomes increasingly more diverse in the domain of Earth Science, the challenges of managing and preserving data become more onerous and complex, particularly for data centers on fixed budgets and limited staff. Many solutions already exist to ease the cost burden for the downstream component of the data lifecycle, yet most archive centers are still racing to keep up with the influx of new data that still needs to find a quasi-permanent resting place. For instance, having well-defined metadata that is consistent across the entire data landscape provides for well-managed and preserved datasets throughout the latter end of the data lifecycle. Translators between different metadata dialects are already in operational use, and facilitate keeping older datasets relevant in today's world of rapidly evolving metadata standards. However, very little is done to address the first phase of the lifecycle, which deals with the entry of both data and the corresponding metadata into a system that is traditionally opaque and closed off to external data producers, thus resulting in a significant bottleneck to the dataset submission process. The ATRAC system was the NOAA NCEI's answer to this previously obfuscated barrier to scientists wishing to find a home for their climate data records, providing a web-based entry point to submit timely and accurate metadata and information about a very specific dataset. A couple of NASA's Distributed Active Archive Centers (DAACs) have implemented their own versions of a web-based dataset and metadata submission form including the ASDC and the ORNL DAAC. The Physical Oceanography DAAC is the most recent in the list of NASA-operated DAACs who have begun to offer their own web-based dataset and metadata submission services to data producers. What makes the PO.DAAC dataset and metadata submission service stand out from these pre-existing services is the option of utilizing both a web browser GUI and a RESTful API to
Operational use of spaceborne lidar datasets

NASA Astrophysics Data System (ADS)

Marenco, Franco; Halloran, Gemma; Forsythe, Mary

2018-04-01

The Met Office plans to use space lidar datasets from CALIPSO, CATS, Aeolus and EarthCARE operationally in near real time (NRT), for the detection of aerosols. The first step is the development of NRT imagery for nowcasting of volcanic events, air quality, and mineral dust episodes. Model verification and possibly assimilation will be explored. Assimilation trials of Aeolus winds are also planned. Here we will present our first in-house imagery and our operational requirements.
KCMP Minnesota Tall Tower Nitrous Oxide Inverse Modeling Dataset 2010-2015

DOE Data Explorer

Griffis, Timothy J. [University of Minnesota; Baker, John; Millet, Dylan; Chen, Zichong; Wood, Jeff; Erickson, Matt; Lee, Xuhui

2017-01-01

This dataset contains nitrous oxide mixing ratios and supporting information measured at a tall tower (KCMP, 244 m) site near St. Paul, Minnesot, USA. The data include nitrous oxide and carbon dioxide mixing ratios measured at the 100 m level. Turbulence and wind data were measured using a sonic anemometer at the 185 m level. Also included in this dataset are estimates of the "background" nitrous oxide mixing ratios and monthly concentration source footprints derived from WRF-STILT modeling.

DataMed - an open source discovery index for finding biomedical datasets.

PubMed

Chen, Xiaoling; Gururaj, Anupama E; Ozyurt, Burak; Liu, Ruiling; Soysal, Ergin; Cohen, Trevor; Tiryaki, Firat; Li, Yueling; Zong, Nansu; Jiang, Min; Rogith, Deevakar; Salimi, Mandana; Kim, Hyeon-Eui; Rocca-Serra, Philippe; Gonzalez-Beltran, Alejandra; Farcas, Claudiu; Johnson, Todd; Margolis, Ron; Alter, George; Sansone, Susanna-Assunta; Fore, Ian M; Ohno-Machado, Lucila; Grethe, Jeffrey S; Xu, Hua

2018-01-13

Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community. © The Author 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Map_plot and bgg_plot: software for integration of geoscience datasets

NASA Astrophysics Data System (ADS)

Gaillot, Philippe; Punongbayan, Jane T.; Rea, Brice

2004-02-01

Since 1985, the Ocean Drilling Program (ODP) has been supporting multidisciplinary research in exploring the structure and history of Earth beneath the oceans. After more than 200 Legs, complementary datasets covering different geological environments, periods and space scales have been obtained and distributed world-wide using the ODP-Janus and Lamont Doherty Earth Observatory-Borehole Research Group (LDEO-BRG) database servers. In Earth Sciences, more than in any other science, the ensemble of these data is characterized by heterogeneous formats and graphical representation modes. In order to fully and quickly assess this information, a set of Unix/Linux and Generic Mapping Tool-based C programs has been designed to convert and integrate datasets acquired during the present ODP and the future Integrated ODP (IODP) Legs. Using ODP Leg 199 datasets, we show examples of the capabilities of the proposed programs. The program map_plot is used to easily display datasets onto 2-D maps. The program bgg_plot (borehole geology and geophysics plot) displays data with respect to depth and/or time. The latter program includes depth shifting, filtering and plotting of core summary information, continuous and discrete-sample core measurements (e.g. physical properties, geochemistry, etc.), in situ continuous logs, magneto- and bio-stratigraphies, specific sedimentological analyses (lithology, grain size, texture, porosity, etc.), as well as core and borehole wall images. Outputs from both programs are initially produced in PostScript format that can be easily converted to Portable Document Format (PDF) or standard image formats (GIF, JPEG, etc.) using widely distributed conversion programs. Based on command line operations and customization of parameter files, these programs can be included in other shell- or database-scripts, automating plotting procedures of data requests. As an open source software, these programs can be customized and interfaced to fulfill any specific
a Critical Review of Automated Photogrammetric Processing of Large Datasets

NASA Astrophysics Data System (ADS)

Remondino, F.; Nocerino, E.; Toschi, I.; Menna, F.

2017-08-01

The paper reports some comparisons between commercial software able to automatically process image datasets for 3D reconstruction purposes. The main aspects investigated in the work are the capability to correctly orient large sets of image of complex environments, the metric quality of the results, replicability and redundancy. Different datasets are employed, each one featuring a diverse number of images, GSDs at cm and mm resolutions, and ground truth information to perform statistical analyses of the 3D results. A summary of (photogrammetric) terms is also provided, in order to provide rigorous terms of reference for comparisons and critical analyses.
Traffic sign classification with dataset augmentation and convolutional neural network

NASA Astrophysics Data System (ADS)

Tang, Qing; Kurnianggoro, Laksono; Jo, Kang-Hyun

2018-04-01

This paper presents a method for traffic sign classification using a convolutional neural network (CNN). In this method, firstly we transfer a color image into grayscale, and then normalize it in the range (-1,1) as the preprocessing step. To increase robustness of classification model, we apply a dataset augmentation algorithm and create new images to train the model. To avoid overfitting, we utilize a dropout module before the last fully connection layer. To assess the performance of the proposed method, the German traffic sign recognition benchmark (GTSRB) dataset is utilized. Experimental results show that the method is effective in classifying traffic signs.
UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

PubMed

Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

2015-06-04

Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few
MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

PubMed Central

Qin, Li-Xuan; Zhou, Qin

2014-01-01

MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456
Potential for using regional and global datasets for national scale ecosystem service modelling

NASA Astrophysics Data System (ADS)

Maxwell, Deborah; Jackson, Bethanna

2016-04-01

Ecosystem service models are increasingly being used by planners and policy makers to inform policy development and decisions about national-level resource management. Such models allow ecosystem services to be mapped and quantified, and subsequent changes to these services to be identified and monitored. In some cases, the impact of small scale changes can be modelled at a national scale, providing more detailed information to decision makers about where to best focus investment and management interventions that could address these issues, while moving toward national goals and/or targets. National scale modelling often uses national (or local) data (for example, soils, landcover and topographical information) as input. However, there are some places where fine resolution and/or high quality national datasets cannot be easily obtained, or do not even exist. In the absence of such detailed information, regional or global datasets could be used as input to such models. There are questions, however, about the usefulness of these coarser resolution datasets and the extent to which inaccuracies in this data may degrade predictions of existing and potential ecosystem service provision and subsequent decision making. Using LUCI (the Land Utilisation and Capability Indicator) as an example predictive model, we examine how the reliability of predictions change when national datasets of soil, landcover and topography are substituted with coarser scale regional and global datasets. We specifically look at how LUCI's predictions of where water services, such as flood risk, flood mitigation, erosion and water quality, change when national data inputs are replaced by regional and global datasets. Using the Conwy catchment, Wales, as a case study, the land cover products compared are the UK's Land Cover Map (2007), the European CORINE land cover map and the ESA global land cover map. Soils products include the National Soil Map of England and Wales (NatMap) and the European
Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.

PubMed

Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi

2017-11-02

Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.
A global gridded dataset of daily precipitation going back to 1950, ideal for analysing precipitation extremes

NASA Astrophysics Data System (ADS)

Contractor, S.; Donat, M.; Alexander, L. V.

2017-12-01

Reliable observations of precipitation are necessary to determine past changes in precipitation and validate models, allowing for reliable future projections. Existing gauge based gridded datasets of daily precipitation and satellite based observations contain artefacts and have a short length of record, making them unsuitable to analyse precipitation extremes. The largest limiting factor for the gauge based datasets is a dense and reliable station network. Currently, there are two major data archives of global in situ daily rainfall data, first is Global Historical Station Network (GHCN-Daily) hosted by National Oceanic and Atmospheric Administration (NOAA) and the other by Global Precipitation Climatology Centre (GPCC) part of the Deutsche Wetterdienst (DWD). We combine the two data archives and use automated quality control techniques to create a reliable long term network of raw station data, which we then interpolate using block kriging to create a global gridded dataset of daily precipitation going back to 1950. We compare our interpolated dataset with existing global gridded data of daily precipitation: NOAA Climate Prediction Centre (CPC) Global V1.0 and GPCC Full Data Daily Version 1.0, as well as various regional datasets. We find that our raw station density is much higher than other datasets. To avoid artefacts due to station network variability, we provide multiple versions of our dataset based on various completeness criteria, as well as provide the standard deviation, kriging error and number of stations for each grid cell and timestep to encourage responsible use of our dataset. Despite our efforts to increase the raw data density, the in situ station network remains sparse in India after the 1960s and in Africa throughout the timespan of the dataset. Our dataset would allow for more reliable global analyses of rainfall including its extremes and pave the way for better global precipitation observations with lower and more transparent uncertainties.
Determining similarity of scientific entities in annotation datasets

PubMed Central

Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas

2015-01-01

Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug–drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called ‘AnnSim’ that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1–1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/ PMID:25725057
The Greenwich Photo-heliographic Results (1874 - 1976): Summary of the Observations, Applications, Datasets, Definitions and Errors

NASA Astrophysics Data System (ADS)

Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.

2013-11-01

The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.
Modified Bat Algorithm for Feature Selection with the Wisconsin Diagnosis Breast Cancer (WDBC) Dataset

PubMed

Jeyasingh, Suganthi; Veluchamy, Malathi

2017-05-01

Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License
Using Third Party Data to Update a Reference Dataset in a Quality Evaluation Service

NASA Astrophysics Data System (ADS)

Xavier, E. M. A.; Ariza-López, F. J.; Ureña-Cámara, M. A.

2016-06-01

Nowadays it is easy to find many data sources for various regions around the globe. In this 'data overload' scenario there are few, if any, information available about the quality of these data sources. In order to easily provide these data quality information we presented the architecture of a web service for the automation of quality control of spatial datasets running over a Web Processing Service (WPS). For quality procedures that require an external reference dataset, like positional accuracy or completeness, the architecture permits using a reference dataset. However, this reference dataset is not ageless, since it suffers the natural time degradation inherent to geospatial features. In order to mitigate this problem we propose the Time Degradation & Updating Module which intends to apply assessed data as a tool to maintain the reference database updated. The main idea is to utilize datasets sent to the quality evaluation service as a source of 'candidate data elements' for the updating of the reference database. After the evaluation, if some elements of a candidate dataset reach a determined quality level, they can be used as input data to improve the current reference database. In this work we present the first design of the Time Degradation & Updating Module. We believe that the outcomes can be applied in the search of a full-automatic on-line quality evaluation platform.
Tensor-driven extraction of developmental features from varying paediatric EEG datasets.

PubMed

Kinney-Lang, Eli; Spyrou, Loukianos; Ebied, Ahmed; Chin, Richard Fm; Escudero, Javier

2018-05-21

Constant changes in developing children's brains can pose a challenge in EEG dependant technologies. Advancing signal processing methods to identify developmental differences in paediatric populations could help improve function and usability of such technologies. Taking advantage of the multi-dimensional structure of EEG data through tensor analysis may offer a framework for extracting relevant developmental features of paediatric datasets. A proof of concept is demonstrated through identifying latent developmental features in resting-state EEG. Approach. Three paediatric datasets (n = 50, 17, 44) were analyzed using a two-step constrained parallel factor (PARAFAC) tensor decomposition. Subject age was used as a proxy measure of development. Classification used support vector machines (SVM) to test if PARAFAC identified features could predict subject age. The results were cross-validated within each dataset. Classification analysis was complemented by visualization of the high-dimensional feature structures using t-distributed Stochastic Neighbour Embedding (t-SNE) maps. Main Results. Development-related features were successfully identified for the developmental conditions of each dataset. SVM classification showed the identified features could accurately predict subject at a significant level above chance for both healthy and impaired populations. t-SNE maps revealed suitable tensor factorization was key in extracting the developmental features. Significance. The described methods are a promising tool for identifying latent developmental features occurring throughout childhood EEG. © 2018 IOP Publishing Ltd.
ProDaMa: an open source Python library to generate protein structure datasets.

PubMed

Armano, Giuliano; Manconi, Andrea

2009-10-02

The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements. To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data. ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL http://iasc.diee.unica.it/prodama.
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

PubMed

Ernst, Jason; Kellis, Manolis

2015-04-01

With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.
Dataset of Passerine bird communities in a Mediterranean high mountain (Sierra Nevada, Spain)

PubMed Central

Pérez-Luque, Antonio Jesús; Barea-Azcón, José Miguel; Álvarez-Ruiz, Lola; Bonet-García, Francisco Javier; Zamora, Regino

2016-01-01

Abstract In this data paper, a dataset of passerine bird communities is described in Sierra Nevada, a Mediterranean high mountain located in southern Spain. The dataset includes occurrence data from bird surveys conducted in four representative ecosystem types of Sierra Nevada from 2008 to 2015. For each visit, bird species numbers as well as distance to the transect line were recorded. A total of 27847 occurrence records were compiled with accompanying measurements on distance to the transect and animal counts. All records are of species in the order Passeriformes. Records of 16 different families and 44 genera were collected. Some of the taxa in the dataset are included in the European Red List. This dataset belongs to the Sierra Nevada Global-Change Observatory (OBSNEV), a long-term research project designed to compile socio-ecological information on the major ecosystem types in order to identify the impacts of global change in this area. PMID:26865820
Identifying Martian Hydrothermal Sites: Geological Investigation Utilizing Multiple Datasets

NASA Technical Reports Server (NTRS)

Dohm, J. M.; Baker, V. R.; Anderson, R. C.; Scott, D. H.; Rice, J. W., Jr.; Hare, T. M.

2000-01-01

Comprehensive geological investigations of martian landscapes that may have been modified by magmatic-driven hydrothermal activity, utilizing multiple datasets, will yield prime target sites for future hydrological, mineralogical, and biological investigations.
Scrubchem: Building Bioactivity Datasets from Pubchem ...

EPA Pesticide Factsheets

The PubChem Bioassay database is a non-curated public repository with data from 64 sources, including: ChEMBL, BindingDb, DrugBank, EPA Tox21, NIH Molecular Libraries Screening Program, and various other academic, government, and industrial contributors. Methods for extracting this public data into quality datasets, useable for analytical research, presents several big-data challenges for which we have designed manageable solutions. According to our preliminary work, there are approximately 549 million bioactivity values and related meta-data within PubChem that can be mapped to over 10,000 biological targets. However, this data is not ready for use in data-driven research, mainly due to lack of structured annotations.We used a pragmatic approach that provides increasing access to bioactivity values in the PubChem Bioassay database. This included restructuring of individual PubChem Bioassay files into a relational database (ScrubChem). ScrubChem contains all primary PubChem Bioassay data that was: reparsed; error-corrected (when applicable); enriched with additional data links from other NCBI databases; and improved by adding key biological and assay annotations derived from logic-based language processing rules. The utility of ScrubChem and the curation process were illustrated using an example bioactivity dataset for the androgen receptor protein. This initial work serves as a trial ground for establishing the technical framework for accessing, integrating, cu
The Similarity and Appropriate Usage of Three Honey Bee (Hymenoptera: Apidae) Datasets for Longitudinal Studies.

PubMed

Highland, Steven; James, R R

2016-04-01

Honey bee (Apis mellifera L., Hymenoptera: Apidae) colonies have experienced profound fluctuations, especially declines, in the past few decades. Long-term datasets on honey bees are needed to identify the most important environmental and cultural factors associated with these changes. While a few such datasets exist, scientists have been hesitant to use some of these due to perceived shortcomings in the data. We compared data and trends for three datasets. Two come from the US Department of Agriculture's National Agricultural Statistics Service (NASS), Agricultural Statistics Board: one is the annual survey of honey-producing colonies from the Annual Bee and Honey program (ABH), and the other is colony counts from the Census of Agriculture conducted every five years. The third dataset we developed from the number of colonies registered annually by some states. We compared the long-term patterns of change in colony numbers among the datasets on a state-by-state basis. The three datasets often showed similar hive numbers and trends varied by state, with differences between datasets being greatest for those states receiving a large number of migratory colonies. Dataset comparisons provide a method to estimate the number of colonies in a state used for pollination versus honey production. Some states also had separate data for local and migratory colonies, allowing one to determine whether the migratory colonies were typically used for pollination or honey production. The Census of Agriculture should provide the most accurate long-term data on colony numbers, but only every five years. © The Authors 2016. Published by Oxford University Press on behalf of Entomological Society of America. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Identification and properties of host galaxies of RCR radio sources

NASA Astrophysics Data System (ADS)

Zhelenkova, O. P.; Soboleva, N. S.; Majorova, E. K.; Temirova, A. V.

2013-01-01

FIRST and NVSS radio maps are used to cross identify the radio sources of the RCR catalog, which is based on observational data obtained in several runs of the "Cold" survey, with the SDSS and DPOSS digital optical sky surveys and the 2MASS, LAS UKIDSS, and WISE infrared surveys. Digital images in various filters and the coadded gri-band SDSS images, red and infrared DPOSS images, JHK-band UKIDSS images, and JHK-band 2MASS images are analyzed for the sources with no optical candidates found in the above catalogs. Our choice of optical candidates was based on the data on the structure of the radio source, its photometry, and spectroscopy (where available). We found reliable identifications for 86% of the radio sources; possible counterparts for 8% of the sources, and failed to find any optical counterparts for 6% of the sources because their host objects proved to be fainter than the limiting magnitude of the corresponding surveys. A little over half of all the identifications proved to be galaxies; about one quarter were quasars, and the types of the remaining objects were difficult to determine because of their faintness. A relation between the luminosity and the radioloudness index was derived and used to estimate the 1.4 and 3.94 GHz luminosities for the sources with unknown redshifts. We found 3% and 60% of all the RCR radio sources to be FRI-type objects ( L ≲ 1024 W/Hz at 1.4 GHz) and powerful FRII-type galaxies ( L ≳ 1026.5 W/Hz), respectively, whereas the rest are sources including objects of the FRI, FRII, and mixed FRI-FRII types. Unlike quasars, galaxies show a trend of decreasing luminosity with decreasing flux density. Note that identification would be quite problematic without the software and resources of the virtual observatory.
National Hydropower Plant Dataset, Version 1 (Update FY18Q2)

DOE Data Explorer

Samu, Nicole; Kao, Shih-Chieh; O'Connor, Patrick; Johnson, Megan; Uria-Martinez, Rocio; McManamay, Ryan

2016-09-30

The National Hydropower Plant Dataset, Version 1, Update FY18Q2, includes geospatial point-level locations and key characteristics of existing hydropower plants in the United States that are currently online. These data are a subset extracted from NHAAP’s Existing Hydropower Assets (EHA) dataset, which is a cornerstone of NHAAP’s EHA effort that has supported multiple U.S. hydropower R&D research initiatives related to market acceleration, environmental impact reduction, technology-to-market activities, and climate change impact assessment.
A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts.

PubMed

Nilsson, R Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M; Bengtsson-Palme, Johan; Walker, Donald M; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C; Abarenkov, Kessy

2015-01-01

The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.
A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts

PubMed Central

Nilsson, R. Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M.; Bengtsson-Palme, Johan; Walker, Donald M.; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C.; Abarenkov, Kessy

2015-01-01

The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric—artificially joined—DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation. PMID:25786896
ConTour: Data-Driven Exploration of Multi-Relational Datasets for Drug Discovery.

PubMed

Partl, Christian; Lex, Alexander; Streit, Marc; Strobelt, Hendrik; Wassermann, Anne-Mai; Pfister, Hanspeter; Schmalstieg, Dieter

2014-12-01

Large scale data analysis is nowadays a crucial part of drug discovery. Biologists and chemists need to quickly explore and evaluate potentially effective yet safe compounds based on many datasets that are in relationship with each other. However, there is a lack of tools that support them in these processes. To remedy this, we developed ConTour, an interactive visual analytics technique that enables the exploration of these complex, multi-relational datasets. At its core ConTour lists all items of each dataset in a column. Relationships between the columns are revealed through interaction: selecting one or multiple items in one column highlights and re-sorts the items in other columns. Filters based on relationships enable drilling down into the large data space. To identify interesting items in the first place, ConTour employs advanced sorting strategies, including strategies based on connectivity strength and uniqueness, as well as sorting based on item attributes. ConTour also introduces interactive nesting of columns, a powerful method to show the related items of a child column for each item in the parent column. Within the columns, ConTour shows rich attribute data about the items as well as information about the connection strengths to other datasets. Finally, ConTour provides a number of detail views, which can show items from multiple datasets and their associated data at the same time. We demonstrate the utility of our system in case studies conducted with a team of chemical biologists, who investigate the effects of chemical compounds on cells and need to understand the underlying mechanisms.
Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

PubMed Central

Bhattacherjee, Souvik; Chavan, Amit; Huang, Silu; Deshpande, Amol; Parameswaran, Aditya

2015-01-01

The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios. PMID:28752014
Cross-Dataset Analysis and Visualization Driven by Expressive Web Services

NASA Astrophysics Data System (ADS)

Alexandru Dumitru, Mircea; Catalin Merticariu, Vlad

2015-04-01

The deluge of data that is hitting us every day from satellite and airborne sensors is changing the workflow of environmental data analysts and modelers. Web geo-services play now a fundamental role, and are no longer needed to preliminary download and store the data, but rather they interact in real-time with GIS applications. Due to the very large amount of data that is curated and made available by web services, it is crucial to deploy smart solutions for optimizing network bandwidth, reducing duplication of data and moving the processing closer to the data. In this context we have created a visualization application for analysis and cross-comparison of aerosol optical thickness datasets. The application aims to help researchers identify and visualize discrepancies between datasets coming from various sources, having different spatial and time resolutions. It also acts as a proof of concept for integration of OGC Web Services under a user-friendly interface that provides beautiful visualizations of the explored data. The tool was built on top of the World Wind engine, a Java based virtual globe built by NASA and the open source community. For data retrieval and processing we exploited the OGC Web Coverage Service potential: the most exciting aspect being its processing extension, a.k.a. the OGC Web Coverage Processing Service (WCPS) standard. A WCPS-compliant service allows a client to execute a processing query on any coverage offered by the server. By exploiting a full grammar, several different kinds of information can be retrieved from one or more datasets together: scalar condensers, cross-sectional profiles, comparison maps and plots, etc. This combination of technology made the application versatile and portable. As the processing is done on the server-side, we ensured that the minimal amount of data is transferred and that the processing is done on a fully-capable server, leaving the client hardware resources to be used for rendering the visualization
Integrated Strategy Improves the Prediction Accuracy of miRNA in Large Dataset

PubMed Central

Lipps, David; Devineni, Sree

2016-01-01

MiRNAs are short non-coding RNAs of about 22 nucleotides, which play critical roles in gene expression regulation. The biogenesis of miRNAs is largely determined by the sequence and structural features of their parental RNA molecules. Based on these features, multiple computational tools have been developed to predict if RNA transcripts contain miRNAs or not. Although being very successful, these predictors started to face multiple challenges in recent years. Many predictors were optimized using datasets of hundreds of miRNA samples. The sizes of these datasets are much smaller than the number of known miRNAs. Consequently, the prediction accuracy of these predictors in large dataset becomes unknown and needs to be re-tested. In addition, many predictors were optimized for either high sensitivity or high specificity. These optimization strategies may bring in serious limitations in applications. Moreover, to meet continuously raised expectations on these computational tools, improving the prediction accuracy becomes extremely important. In this study, a meta-predictor mirMeta was developed by integrating a set of non-linear transformations with meta-strategy. More specifically, the outputs of five individual predictors were first preprocessed using non-linear transformations, and then fed into an artificial neural network to make the meta-prediction. The prediction accuracy of meta-predictor was validated using both multi-fold cross-validation and independent dataset. The final accuracy of meta-predictor in newly-designed large dataset is improved by 7% to 93%. The meta-predictor is also proved to be less dependent on datasets, as well as has refined balance between sensitivity and specificity. This study has two folds of importance: First, it shows that the combination of non-linear transformations and artificial neural networks improves the prediction accuracy of individual predictors. Second, a new miRNA predictor with significantly improved prediction accuracy
DAPAGLOCO - A global daily precipitation dataset from satellite and rain-gauge measurements

NASA Astrophysics Data System (ADS)

Spangehl, T.; Danielczok, A.; Dietzsch, F.; Andersson, A.; Schroeder, M.; Fennig, K.; Ziese, M.; Becker, A.

2017-12-01

The BMBF funded project framework MiKlip(Mittelfristige Klimaprognosen) develops a global climate forecast system on decadal time scales for operational applications. Herein, the DAPAGLOCO project (Daily Precipitation Analysis for the validation of Global medium-range Climate predictions Operationalized) provides a global precipitation dataset as a combination of microwave-based satellite measurements over ocean and rain gauge measurements over land on daily scale. The DAPAGLOCO dataset is created for the evaluation of the MiKlip forecast system in the first place. The HOAPS dataset (Hamburg Ocean Atmosphere Parameter and Fluxes from Satellite data) is used for the derivation of precipitation rates over ocean and is extended by the use of measurements from TMI, GMI, and AMSR-E, in addition to measurements from SSM/I and SSMIS. A 1D-Var retrieval scheme is developed to retrieve rain rates from microwave imager data, which also allows for the determination of uncertainty estimates. Over land, the GPCC (Global Precipitation Climatology Center) Full Data Daily product is used. It consists of rain gauge measurements that are interpolated on a regular grid by ordinary Kriging. The currently available dataset is based on a neuronal network approach, consists of 21 years of data from 1988 to 2008 and is currently extended until 2015 using the 1D-Var scheme and with improved sampling. Three different spatial resolved dataset versions are available with 1° and 2.5° global, and 0.5° for Europe. The evaluation of the MiKlip forecast system by DAPAGLOCO is based on ETCCDI (Expert Team on Climate Change and Detection Indices). Hindcasts are used for the index-based comparison between model and observations. These indices allow for the evaluation of precipitation extremes, their spatial and temporal distribution as well as for the duration of dry and wet spells, average precipitation amounts and percentiles on global scale. Besides, an ETCCDI-based climatology of the DAPAGLOCO
The SAIL databank: linking multiple health and social care datasets

PubMed Central

2009-01-01

Background Vast amounts of data are collected about patients and service users in the course of health and social care service delivery. Electronic data systems for patient records have the potential to revolutionise service delivery and research. But in order to achieve this, it is essential that the ability to link the data at the individual record level be retained whilst adhering to the principles of information governance. The SAIL (Secure Anonymised Information Linkage) databank has been established using disparate datasets, and over 500 million records from multiple health and social care service providers have been loaded to date, with further growth in progress. Methods Having established the infrastructure of the databank, the aim of this work was to develop and implement an accurate matching process to enable the assignment of a unique Anonymous Linking Field (ALF) to person-based records to make the databank ready for record-linkage research studies. An SQL-based matching algorithm (MACRAL, Matching Algorithm for Consistent Results in Anonymised Linkage) was developed for this purpose. Firstly the suitability of using a valid NHS number as the basis of a unique identifier was assessed using MACRAL. Secondly, MACRAL was applied in turn to match primary care, secondary care and social services datasets to the NHS Administrative Register (NHSAR), to assess the efficacy of this process, and the optimum matching technique. Results The validation of using the NHS number yielded specificity values > 99.8% and sensitivity values > 94.6% using probabilistic record linkage (PRL) at the 50% threshold, and error rates were < 0.2%. A range of techniques for matching datasets to the NHSAR were applied and the optimum technique resulted in sensitivity values of: 99.9% for a GP dataset from primary care, 99.3% for a PEDW dataset from secondary care and 95.2% for the PARIS database from social care. Conclusion With the infrastructure that has been put in place, the
The SAIL databank: linking multiple health and social care datasets.

PubMed

Lyons, Ronan A; Jones, Kerina H; John, Gareth; Brooks, Caroline J; Verplancke, Jean-Philippe; Ford, David V; Brown, Ginevra; Leake, Ken

2009-01-16

Vast amounts of data are collected about patients and service users in the course of health and social care service delivery. Electronic data systems for patient records have the potential to revolutionise service delivery and research. But in order to achieve this, it is essential that the ability to link the data at the individual record level be retained whilst adhering to the principles of information governance. The SAIL (Secure Anonymised Information Linkage) databank has been established using disparate datasets, and over 500 million records from multiple health and social care service providers have been loaded to date, with further growth in progress. Having established the infrastructure of the databank, the aim of this work was to develop and implement an accurate matching process to enable the assignment of a unique Anonymous Linking Field (ALF) to person-based records to make the databank ready for record-linkage research studies. An SQL-based matching algorithm (MACRAL, Matching Algorithm for Consistent Results in Anonymised Linkage) was developed for this purpose. Firstly the suitability of using a valid NHS number as the basis of a unique identifier was assessed using MACRAL. Secondly, MACRAL was applied in turn to match primary care, secondary care and social services datasets to the NHS Administrative Register (NHSAR), to assess the efficacy of this process, and the optimum matching technique. The validation of using the NHS number yielded specificity values > 99.8% and sensitivity values > 94.6% using probabilistic record linkage (PRL) at the 50% threshold, and error rates were < 0.2%. A range of techniques for matching datasets to the NHSAR were applied and the optimum technique resulted in sensitivity values of: 99.9% for a GP dataset from primary care, 99.3% for a PEDW dataset from secondary care and 95.2% for the PARIS database from social care. With the infrastructure that has been put in place, the reliable matching process that has been
The Role of Datasets on Scientific Influence within Conflict Research.

PubMed

Van Holt, Tracy; Johnson, Jeffery C; Moates, Shiloh; Carley, Kathleen M

2016-01-01

We inductively tested if a coherent field of inquiry in human conflict research emerged in an analysis of published research involving "conflict" in the Web of Science (WoS) over a 66-year period (1945-2011). We created a citation network that linked the 62,504 WoS records and their cited literature. We performed a critical path analysis (CPA), a specialized social network analysis on this citation network (~1.5 million works), to highlight the main contributions in conflict research and to test if research on conflict has in fact evolved to represent a coherent field of inquiry. Out of this vast dataset, 49 academic works were highlighted by the CPA suggesting a coherent field of inquiry; which means that researchers in the field acknowledge seminal contributions and share a common knowledge base. Other conflict concepts that were also analyzed-such as interpersonal conflict or conflict among pharmaceuticals, for example, did not form their own CP. A single path formed, meaning that there was a cohesive set of ideas that built upon previous research. This is in contrast to a main path analysis of conflict from 1957-1971 where ideas didn't persist in that multiple paths existed and died or emerged reflecting lack of scientific coherence (Carley, Hummon, and Harty, 1993). The critical path consisted of a number of key features: 1) Concepts that built throughout include the notion that resource availability drives conflict, which emerged in the 1960s-1990s and continued on until 2011. More recent intrastate studies that focused on inequalities emerged from interstate studies on the democracy of peace earlier on the path. 2) Recent research on the path focused on forecasting conflict, which depends on well-developed metrics and theories to model. 3) We used keyword analysis to independently show how the CP was topically linked (i.e., through democracy, modeling, resources, and geography). Publically available conflict datasets developed early on helped shape the
SatelliteDL: a Toolkit for Analysis of Heterogeneous Satellite Datasets

NASA Astrophysics Data System (ADS)

Galloy, M. D.; Fillmore, D.

2014-12-01

SatelliteDL is an IDL toolkit for the analysis of satellite Earth observations from a diverse set of platforms and sensors. The core function of the toolkit is the spatial and temporal alignment of satellite swath and geostationary data. The design features an abstraction layer that allows for easy inclusion of new datasets in a modular way. Our overarching objective is to create utilities that automate the mundane aspects of satellite data analysis, are extensible and maintainable, and do not place limitations on the analysis itself. IDL has a powerful suite of statistical and visualization tools that can be used in conjunction with SatelliteDL. Toward this end we have constructed SatelliteDL to include (1) HTML and LaTeX API document generation,(2) a unit test framework,(3) automatic message and error logs,(4) HTML and LaTeX plot and table generation, and(5) several real world examples with bundled datasets available for download. For ease of use, datasets, variables and optional workflows may be specified in a flexible format configuration file. Configuration statements may specify, for example, a region and date range, and the creation of images, plots and statistical summary tables for a long list of variables. SatelliteDL enforces data provenance; all data should be traceable and reproducible. The output NetCDF file metadata holds a complete history of the original datasets and their transformations, and a method exists to reconstruct a configuration file from this information. Release 0.1.0 distributes with ingest methods for GOES, MODIS, VIIRS and CERES radiance data (L1) as well as select 2D atmosphere products (L2) such as aerosol and cloud (MODIS and VIIRS) and radiant flux (CERES). Future releases will provide ingest methods for ocean and land surface products, gridded and time averaged datasets (L3 Daily, Monthly and Yearly), and support for 3D products such as temperature and water vapor profiles. Emphasis will be on NPP Sensor, Environmental and
Level 2 Ancillary Products and Datasets Algorithm Theoretical Basis

NASA Technical Reports Server (NTRS)

Diner, D.; Abdou, W.; Gordon, H.; Kahn, R.; Knyazikhin, Y.; Martonchik, J.; McDonald, D.; McMuldroch, S.; Myneni, R.; West, R.

1999-01-01

This Algorithm Theoretical Basis (ATB) document describes the algorithms used to generate the parameters of certain ancillary products and datasets used during Level 2 processing of Multi-angle Imaging SpectroRadiometer (MIST) data.
Treetrimmer: a method for phylogenetic dataset size reduction.

PubMed

Maruyama, Shinichiro; Eveleigh, Robert J M; Archibald, John M

2013-04-12

With rapid advances in genome sequencing and bioinformatics, it is now possible to generate phylogenetic trees containing thousands of operational taxonomic units (OTUs) from a wide range of organisms. However, use of rigorous tree-building methods on such large datasets is prohibitive and manual 'pruning' of sequence alignments is time consuming and raises concerns over reproducibility. There is a need for bioinformatic tools with which to objectively carry out such pruning procedures. Here we present 'TreeTrimmer', a bioinformatics procedure that removes unnecessary redundancy in large phylogenetic datasets, alleviating the size effect on more rigorous downstream analyses. The method identifies and removes user-defined 'redundant' sequences, e.g., orthologous sequences from closely related organisms and 'recently' evolved lineage-specific paralogs. Representative OTUs are retained for more rigorous re-analysis. TreeTrimmer reduces the OTU density of phylogenetic trees without sacrificing taxonomic diversity while retaining the original tree topology, thereby speeding up downstream computer-intensive analyses, e.g., Bayesian and maximum likelihood tree reconstructions, in a reproducible fashion.
Collaboration tools and techniques for large model datasets

USGS Publications Warehouse

Signell, R.P.; Carniel, S.; Chiggiato, J.; Janekovic, I.; Pullen, J.; Sherwood, C.R.

2008-01-01

In MREA and many other marine applications, it is common to have multiple models running with different grids, run by different institutions. Techniques and tools are described for low-bandwidth delivery of data from large multidimensional datasets, such as those from meteorological and oceanographic models, directly into generic analysis and visualization tools. Output is stored using the NetCDF CF Metadata Conventions, and then delivered to collaborators over the web via OPeNDAP. OPeNDAP datasets served by different institutions are then organized via THREDDS catalogs. Tools and procedures are then used which enable scientists to explore data on the original model grids using tools they are familiar with. It is also low-bandwidth, enabling users to extract just the data they require, an important feature for access from ship or remote areas. The entire implementation is simple enough to be handled by modelers working with their webmasters - no advanced programming support is necessary. ?? 2007 Elsevier B.V. All rights reserved.
Datasets collected in general practice: an international comparison using the example of obesity.

PubMed

Sturgiss, Elizabeth; van Boven, Kees

2018-06-04

International datasets from general practice enable the comparison of how conditions are managed within consultations in different primary healthcare settings. The Australian Bettering the Evaluation and Care of Health (BEACH) and TransHIS from the Netherlands collect in-consultation general practice data that have been used extensively to inform local policy and practice. Obesity is a global health issue with different countries applying varying approaches to management. The objective of the present paper is to compare the primary care management of obesity in Australia and the Netherlands using data collected from consultations. Despite the different prevalence in obesity in the two countries, the number of patients per 1000 patient-years seen with obesity is similar. Patients in Australia with obesity are referred to allied health practitioners more often than Dutch patients. Without quality general practice data, primary care researchers will not have data about the management of conditions within consultations. We use obesity to highlight the strengths of these general practice data sources and to compare their differences. What is known about the topic? Australia had one of the longest-running consecutive datasets about general practice activity in the world, but it has recently lost government funding. The Netherlands has a longitudinal general practice dataset of information collected within consultations since 1985. What does this paper add? We discuss the benefits of general practice-collected data in two countries. Using obesity as a case example, we compare management in general practice between Australia and the Netherlands. This type of analysis should start all international collaborations of primary care management of any health condition. Having a national general practice dataset allows international comparisons of the management of conditions with primary care. Without a current, quality general practice dataset, primary care researchers will not
Associating uncertainty with datasets using Linked Data and allowing propagation via provenance chains

NASA Astrophysics Data System (ADS)

Car, Nicholas; Cox, Simon; Fitch, Peter

2015-04-01

With earth-science datasets increasingly being published to enable re-use in projects disassociated from the original data acquisition or generation, there is an urgent need for associated metadata to be connected, in order to guide their application. In particular, provenance traces should support the evaluation of data quality and reliability. However, while standards for describing provenance are emerging (e.g. PROV-O), these do not include the necessary statistical descriptors and confidence assessments. UncertML has a mature conceptual model that may be used to record uncertainty metadata. However, by itself UncertML does not support the representation of uncertainty of multi-part datasets, and provides no direct way of associating the uncertainty information - metadata in relation to a dataset - with dataset objects.We present a method to address both these issues by combining UncertML with PROV-O, and delivering resulting uncertainty-enriched provenance traces through the Linked Data API. UncertProv extends the PROV-O provenance ontology with an RDF formulation of the UncertML conceptual model elements, adds further elements to support uncertainty representation without a conceptual model and the integration of UncertML through links to documents. The Linked ID API provides a systematic way of navigating from dataset objects to their UncertProv metadata and back again. The Linked Data API's 'views' capability enables access to UncertML and non-UncertML uncertainty metadata representations for a dataset. With this approach, it is possible to access and navigate the uncertainty metadata associated with a published dataset using standard semantic web tools, such as SPARQL queries. Where the uncertainty data follows the UncertML model it can be automatically interpreted and may also support automatic uncertainty propagation . Repositories wishing to enable uncertainty propagation for all datasets must ensure that all elements that are associated with uncertainty
Comparison of present global reanalysis datasets in the context of a statistical downscaling method for precipitation prediction

NASA Astrophysics Data System (ADS)

Horton, Pascal; Weingartner, Rolf; Brönnimann, Stefan

2017-04-01

The analogue method is a statistical downscaling method for precipitation prediction. It uses similarity in terms of synoptic-scale predictors with situations in the past in order to provide a probabilistic prediction for the day of interest. It has been used for decades in a context of weather or flood forecasting, and is more recently also applied to climate studies, whether for reconstruction of past weather conditions or future climate impact studies. In order to evaluate the relationship between synoptic scale predictors and the local weather variable of interest, e.g. precipitation, reanalysis datasets are necessary. Nowadays, the number of available reanalysis datasets increases. These are generated by different atmospheric models with different assimilation techniques and offer various spatial and temporal resolutions. A major difference between these datasets is also the length of the archive they provide. While some datasets start at the beginning of the satellite era (1980) and assimilate these data, others aim at homogeneity on a longer period (e.g. 20th century) and only assimilate conventional observations. The context of the application of analogue methods might drive the choice of an appropriate dataset, for example when the archive length is a leading criterion. However, in many studies, a reanalysis dataset is subjectively chosen, according to the user's preferences or the ease of access. The impact of this choice on the results of the downscaling procedure is rarely considered and no comprehensive comparison has been undertaken so far. In order to fill this gap and to advise on the choice of appropriate datasets, nine different global reanalysis datasets were compared in seven distinct versions of analogue methods, over 300 precipitation stations in Switzerland. Significant differences in terms of prediction performance were identified. Although the impact of the reanalysis dataset on the skill score varies according to the chosen predictor, be
PLPD: reliable protein localization prediction from imbalanced and overlapped datasets

PubMed Central

Lee, KiYoung; Kim, Dae-Won; Na, DoKyun; Lee, Kwang H.; Lee, Doheon

2006-01-01

Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003). PMID:16966337

Discovery and Analysis of Intersecting Datasets: JMARS as a Comparative Science Platform

NASA Astrophysics Data System (ADS)

Carter, S.; Christensen, P. R.; Dickenshied, S.; Anwar, S.; Noss, D.

2014-12-01

A great deal can be discovered from comparing and studying a chosen region or area on a planetary body. In this age, science has an enormous amount of instruments and data to study from; often the first obstacle can be finding the right information. Developed at Arizona State University, Java Mission-planning and Analysis for Remote Sensing (JMARS), enables users to easily find and study related datasets. JMARS supports a long list of planetary bodies in our solar system, including Earth, the Moon, Mars, and other planets, satellites, and asteroids. Within JMARS a user can start with a particular area and search for all datasets that have images/information intersecting that region of interest. Once a user has found data they are interested in comparing, they can view the image at once and see the numeric information at that location. This information can be analyzed in a few powerful ways. If the dataset of interest varies with time but the location stays constant, then the user may want to compare specific locations through time. This can be done the Investigate Tool in JMARS. Users can create a Data Spike and the information at that point will be plotted through time. If the region does not have a temporal dataset, then a different method would be suitable and involves a profile line. Also using the Investigate Tool, a user can create a Data Profile (a line which can contain as many vertices as necessary) and all numeric data underneath the line will be plotted on one graph for easy comparison. This can be used to compare differences between similar datasets - perhaps the same measurement but from different instruments - or to find correlations from one dataset to another. A third form of analysis is planned for future development. This method involves entire areas (polygons). Sampling of the different data sources beneath an area can reveal statistics like maximum, minimum, and average values, and standard deviation. These values can be compared to other data
GLEAM v3: updated land evaporation and root-zone soil moisture datasets

NASA Astrophysics Data System (ADS)

Martens, Brecht; Miralles, Diego; Lievens, Hans; van der Schalie, Robin; de Jeu, Richard; Fernández-Prieto, Diego; Verhoest, Niko

2016-04-01

Evaporation determines the availability of surface water resources and the requirements for irrigation. In addition, through its impacts on the water, carbon and energy budgets, evaporation influences the occurrence of rainfall and the dynamics of air temperature. Therefore, reliable estimates of this flux at regional to global scales are of major importance for water management and meteorological forecasting of extreme events. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to the limited global coverage of in situ measurements. Remote sensing techniques can help to overcome the lack of ground data. However, evaporation is not directly observable from satellite systems. As a result, recent efforts have focussed on combining the observable drivers of evaporation within process-based models. The Global Land Evaporation Amsterdam Model (GLEAM, www.gleam.eu) estimates terrestrial evaporation based on daily satellite observations of meteorological drivers of terrestrial evaporation, vegetation characteristics and soil moisture. Since the publication of the first version of the model in 2011, GLEAM has been widely applied for the study of trends in the water cycle, interactions between land and atmosphere and hydrometeorological extreme events. A third version of the GLEAM global datasets will be available from the beginning of 2016 and will be distributed using www.gleam.eu as gateway. The updated datasets include separate estimates for the different components of the evaporative flux (i.e. transpiration, bare-soil evaporation, interception loss, open-water evaporation and snow sublimation), as well as variables like the evaporative stress, potential evaporation, root-zone soil moisture and surface soil moisture. A new dataset using SMOS-based input data of surface soil moisture and vegetation optical depth will also be
Identification of fungi in shotgun metagenomics datasets

PubMed Central

Donovan, Paul D.; Gonzalez, Gabriel; Higgins, Desmond G.

2018-01-01

Metagenomics uses nucleic acid sequencing to characterize species diversity in different niches such as environmental biomes or the human microbiome. Most studies have used 16S rRNA amplicon sequencing to identify bacteria. However, the decreasing cost of sequencing has resulted in a gradual shift away from amplicon analyses and towards shotgun metagenomic sequencing. Shotgun metagenomic data can be used to identify a wide range of species, but have rarely been applied to fungal identification. Here, we develop a sequence classification pipeline, FindFungi, and use it to identify fungal sequences in public metagenome datasets. We focus primarily on animal metagenomes, especially those from pig and mouse microbiomes. We identified fungi in 39 of 70 datasets comprising 71 fungal species. At least 11 pathogenic species with zoonotic potential were identified, including Candida tropicalis. We identified Pseudogymnoascus species from 13 Antarctic soil samples initially analyzed for the presence of bacteria capable of degrading diesel oil. We also show that Candida tropicalis and Candida loboi are likely the same species. In addition, we identify several examples where contaminating DNA was erroneously included in fungal genome assemblies. PMID:29444186
A Dataset of Three Educational Technology Experiments on Differentiation, Formative Testing and Feedback

ERIC Educational Resources Information Center

Haelermans, Carla; Ghysels, Joris; Prince, Fernao

2015-01-01

This paper describes a dataset with data from three individually randomized educational technology experiments on differentiation, formative testing and feedback during one school year for a group of 8th grade students in the Netherlands, using administrative data and the online motivation questionnaire of Boekaerts. The dataset consists of pre-…
New public dataset for spotting patterns in medieval document images

NASA Astrophysics Data System (ADS)

En, Sovann; Nicolas, Stéphane; Petitjean, Caroline; Jurie, Frédéric; Heutte, Laurent

2017-01-01

With advances in technology, a large part of our cultural heritage is becoming digitally available. In particular, in the field of historical document image analysis, there is now a growing need for indexing and data mining tools, thus allowing us to spot and retrieve the occurrences of an object of interest, called a pattern, in a large database of document images. Patterns may present some variability in terms of color, shape, or context, making the spotting of patterns a challenging task. Pattern spotting is a relatively new field of research, still hampered by the lack of available annotated resources. We present a new publicly available dataset named DocExplore dedicated to spotting patterns in historical document images. The dataset contains 1500 images and 1464 queries, and allows the evaluation of two tasks: image retrieval and pattern localization. A standardized benchmark protocol along with ad hoc metrics is provided for a fair comparison of the submitted approaches. We also provide some first results obtained with our baseline system on this new dataset, which show that there is room for improvement and that should encourage researchers of the document image analysis community to design new systems and submit improved results.
Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets.

PubMed

de Vent, Nathalie R; Agelink van Rentergem, Joost A; Schmand, Ben A; Murre, Jaap M J; Huizenga, Hilde M

2016-01-01

In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given.
GLEAM version 3: Global Land Evaporation Datasets and Model

NASA Astrophysics Data System (ADS)

Martens, B.; Miralles, D. G.; Lievens, H.; van der Schalie, R.; de Jeu, R.; Fernandez-Prieto, D.; Verhoest, N.

2015-12-01

Terrestrial evaporation links energy, water and carbon cycles over land and is therefore a key variable of the climate system. However, the global-scale magnitude and variability of the flux, and the sensitivity of the underlying physical process to changes in environmental factors, are still poorly understood due to limitations in in situ measurements. As a result, several methods have risen to estimate global patterns of land evaporation from satellite observations. However, these algorithms generally differ in their approach to model evaporation, resulting in large differences in their estimates. One of these methods is GLEAM, the Global Land Evaporation: the Amsterdam Methodology. GLEAM estimates terrestrial evaporation based on daily satellite observations of meteorological variables, vegetation characteristics and soil moisture. Since the publication of the first version of the algorithm (2011), the model has been widely applied to analyse trends in the water cycle and land-atmospheric feedbacks during extreme hydrometeorological events. A third version of the GLEAM global datasets is foreseen by the end of 2015. Given the relevance of having a continuous and reliable record of global-scale evaporation estimates for climate and hydrological research, the establishment of an online data portal to host these data to the public is also foreseen. In this new release of the GLEAM datasets, different components of the model have been updated, with the most significant change being the revision of the data assimilation algorithm. In this presentation, we will highlight the most important changes of the methodology and present three new GLEAM datasets and their validation against in situ observations and an alternative dataset of terrestrial evaporation (ERA-Land). Results of the validation exercise indicate that the magnitude and the spatiotemporal variability of the modelled evaporation agree reasonably well with the estimates of ERA-Land and the in situ
Dataset of herbarium specimens of threatened vascular plants in Catalonia

PubMed Central

Nualart, Neus; Ibáñez, Neus; Luque, Pere; Pedrol, Joan; Vilar, Lluís; Guàrdia, Roser

2017-01-01

Abstract This data paper describes a specimens’ dataset of the Catalonian threatened vascular plants conserved in five public Catalonian herbaria (BC, BCN, HGI, HBIL and MTTE). Catalonia is an administrative region of Spain that includes large autochthon plants diversity and 199 taxa with IUCN threatened categories (EX, EW, RE, CR, EN and VU). This dataset includes 1,618 records collected from 17th century to nowadays. For each specimen, the species name, locality indication, collection date, collector, ecology and revision label are recorded. More than 94% of the taxa are represented in the herbaria, which evidence the paper of the botanical collections as an essential source of occurrence data. PMID:28814919
Detecting and Quantifying Forest Change: The Potential of Existing C- and X-Band Radar Datasets.

PubMed

Tanase, Mihai A; Ismail, Ismail; Lowell, Kim; Karyanto, Oka; Santoro, Maurizio

2015-01-01

This paper evaluates the opportunity provided by global interferometric radar datasets for monitoring deforestation, degradation and forest regrowth in tropical and semi-arid environments. The paper describes an easy to implement method for detecting forest spatial changes and estimating their magnitude. The datasets were acquired within space-borne high spatial resolutions radar missions at near-global scales thus being significant for monitoring systems developed under the United Framework Convention on Climate Change (UNFCCC). The approach presented in this paper was tested in two areas located in Indonesia and Australia. Forest change estimation was based on differences between a reference dataset acquired in February 2000 by the Shuttle Radar Topography Mission (SRTM) and TanDEM-X mission (TDM) datasets acquired in 2011 and 2013. The synergy between SRTM and TDM datasets allowed not only identifying changes in forest extent but also estimating their magnitude with respect to the reference through variations in forest height.
A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection.

PubMed

Li, Jia; Xia, Changqun; Chen, Xiaowu

2017-10-12

Image-based salient object detection (SOD) has been extensively studied in past decades. However, video-based SOD is much less explored due to the lack of large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos. In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects who free-view all videos. From the user data, we find that salient objects in a video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for videobased salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliencyguided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at the pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are constructed in an unsupervised manner that automatically infers a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In experiments, the proposed unsupervised approach is compared with 31 state-of-the-art models on the proposed dataset and outperforms 30 of them, including 19 imagebased classic (unsupervised or non-deep learning) models, six image-based deep learning models, and five video-based unsupervised models. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.
Evaluation of Remote Sensing and Hydrological Model Based Soil Moisture Datasets in Drought Perspective

NASA Astrophysics Data System (ADS)

Hüsami Afşar, M.; Bulut, B.; Yilmaz, M. T.

2017-12-01

Soil moisture is one of the fundamental parameters of the environment that plays a major role in carbon, energy, and water cycles. Spatial distribution and temporal changes of soil moisture is one of the important components in climatic, ecological and natural hazards at global, regional and local levels scales. Therefore retrieval of soil moisture datasets has a great importance in these studies. Given soil moisture can be retrieved through different platforms (i.e., in-situ measurements, numerical modeling, and remote sensing) for the same location and time period, it is often desirable to evaluate these different datasets to assign the most accurate estimates for different purposes. During last decades, efforts have been given to provide evaluations about different soil moisture products based on various statistical analysis of the soil moisture time series (i.e., comparison of correlation, bias, and their error standard deviation). On the other hand, there is still need for the comparisons of the soil moisture products in drought analysis context. In this study, LPRM and NOAH Land Surface Model soil moisture datasets are investigated in drought analysis context using station-based watershed average datasets obtained over four USDA ARS watersheds as ground truth. Here, the drought analysis are performed using the standardized soil moisture datasets (i.e., zero mean and one standard deviation) while the droughts are defined as consecutive negative anomalies less than -1 for longer than 3 months duration. Accordingly, the drought characteristics (duration and severity) and false alarm and hit/miss ratios of LPRM and NOAH datasets are validated using station-based datasets as ground truth. Results showed that although the NOAH soil moisture products have better correlations, LPRM based soil moisture retrievals show better consistency in drought analysis. This project is supported by TUBITAK Project number 114Y676.
ECOALIM: A Dataset of Environmental Impacts of Feed Ingredients Used in French Animal Production.

PubMed

Wilfart, Aurélie; Espagnol, Sandrine; Dauguet, Sylvie; Tailleur, Aurélie; Gac, Armelle; Garcia-Launay, Florence

2016-01-01

Feeds contribute highly to environmental impacts of livestock products. Therefore, formulating low-impact feeds requires data on environmental impacts of feed ingredients with consistent perimeters and methodology for life cycle assessment (LCA). We created the ECOALIM dataset of life cycle inventories (LCIs) and associated impacts of feed ingredients used in animal production in France. It provides several perimeters for LCIs (field gate, storage agency gate, plant gate and harbour gate) with homogeneously collected data from French R&D institutes covering the 2005-2012 period. The dataset of environmental impacts is available as a Microsoft® Excel spreadsheet on the ECOALIM website and provides climate change, acidification, eutrophication, non-renewable and total cumulative energy demand, phosphorus demand, and land occupation. LCIs in the ECOALIM dataset are available in the AGRIBALYSE® database in SimaPro® software. The typology performed on the dataset classified the 149 average feed ingredients into categories of low impact (co-products of plant origin and minerals), high impact (feed-use amino acids, fats and vitamins) and intermediate impact (cereals, oilseeds, oil meals and protein crops). Therefore, the ECOALIM dataset can be used by feed manufacturers and LCA practitioners to investigate formulation of low-impact feeds. It also provides data for environmental evaluation of feeds and animal production systems. Included in AGRIBALYSE® database and SimaPro®, the ECOALIM dataset will benefit from their procedures for maintenance and regular updating. Future use can also include environmental labelling of commercial products from livestock production.
Integrating disparate lidar datasets for a regional storm tide inundation analysis of Hurricane Katrina

USGS Publications Warehouse

Stoker, Jason M.; Tyler, Dean J.; Turnipseed, D. Phil; Van Wilson, K.; Oimoen, Michael J.

2009-01-01

Hurricane Katrina was one of the largest natural disasters in U.S. history. Due to the sheer size of the affected areas, an unprecedented regional analysis at very high resolution and accuracy was needed to properly quantify and understand the effects of the hurricane and the storm tide. Many disparate sources of lidar data were acquired and processed for varying environmental reasons by pre- and post-Katrina projects. The datasets were in several formats and projections and were processed to varying phases of completion, and as a result the task of producing a seamless digital elevation dataset required a high level of coordination, research, and revision. To create a seamless digital elevation dataset, many technical issues had to be resolved before producing the desired 1/9-arc-second (3meter) grid needed as the map base for projecting the Katrina peak storm tide throughout the affected coastal region. This report presents the methodology that was developed to construct seamless digital elevation datasets from multipurpose, multi-use, and disparate lidar datasets, and describes an easily accessible Web application for viewing the maximum storm tide caused by Hurricane Katrina in southeastern Louisiana, Mississippi, and Alabama.
Data Sharing and the Development of the Cleveland Clinic Statistical Education Dataset Repository

ERIC Educational Resources Information Center

Nowacki, Amy S.

2013-01-01

Examples are highly sought by both students and teachers. This is particularly true as many statistical instructors aim to engage their students and increase active participation. While simulated datasets are functional, they lack real perspective and the intricacies of actual data. In order to obtain real datasets, the principal investigator of a…
Dataset for reporting of thymic epithelial tumours: recommendations from the International Collaboration on Cancer Reporting (ICCR).

PubMed

Nicholson, Andrew G; Detterbeck, Frank; Marx, Alexander; Roden, Anja C; Marchevsky, Alberto M; Mukai, Kiyoshi; Chen, Gang; Marino, Mirella; den Bakker, Michael A; Yang, Woo-Ick; Judge, Meagan; Hirschowitz, Lynn

2017-03-01

The International Collaboration on Cancer Reporting (ICCR) is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom, the College of American Pathologists, the Canadian Association of Pathologists-Association Canadienne des Pathologists in association with the Canadian Partnership Against Cancer, and the European Society of Pathology. Its goal is to produce standardized, internationally agreed, evidence-based datasets for use throughout the world. This article describes the development of a cancer dataset by the multidisciplinary ICCR expert panel for the reporting of thymic epithelial tumours. The dataset includes 'required' (mandatory) and 'recommended' (non-mandatory) elements, which are validated by a review of current evidence and supported by explanatory text. Seven required elements and 12 recommended elements were agreed by the international dataset authoring committee to represent the essential information for the reporting of thymic epithelial tumours. The use of an internationally agreed, structured pathology dataset for reporting thymic tumours provides all of the necessary information for optimal patient management, facilitates consistent and accurate data collection, and provides valuable data for research and international benchmarking. The dataset also provides a valuable resource for those countries and institutions that are not in a position to develop their own datasets. © 2016 John Wiley & Sons Ltd.
Field Research Facility Data Integration Framework Data Management Plan: Survey Lines Dataset

DTIC Science & Technology

2016-08-01

CHL and its District partners. The beach morphology surveys on which this report focuses provide quantitative measures of the dynamic nature of...topography • volume change 1.4 Data description The morphology surveys are conducted over a series of 26 shore- perpendicular profile lines spaced 50...dataset input data and products. Table 1. FRF survey lines dataset input data and products. Input Data FDIF Product Description ASCII LARC survey text
Determining similarity of scientific entities in annotation datasets.

PubMed

Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas

2015-01-01

Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/ © The Author(s) 2015. Published by Oxford University Press.
Actitudes de los candidatos y maestros de ciencias en servicio acerca del uso de las herramientas computadorizadas en las clases de ciencias

NASA Astrophysics Data System (ADS)

Bayuelo, Ezequiel

Este estudio examino y comparo las actitudes de los candidatos a maestros de ciencias y los maestros de ciencias en servicio acerca de la utilizacion de las herramientas computadorizadas en las clases de ciencias. Tambien identifico y diferencio el uso que ellos dan a estas herramientas en las clases de ciencias. Este estudio presenta un diseno descriptivo exploratorio. Constituyeron la muestra trescientos diez sujetos que fueron candidatos a maestros de ciencias o maestros de ciencias en servicio. Para recoger los datos se construyo y valido un cuestionario de treinta y un itemes. Se utilizaron las pruebas estadisticas no parametricas Kruskal Wallis y Chi-cuadrado (test de homogeneidad) para establecer las diferencias entre las actitudes de los sujetos con relacion al uso de las herramientas computadorizadas en las clases de ciencias. Los hallazgos evidenciaron que son positivas y muy parecidas las actitudes de los candidatos a maestros y maestros en servicio hacia el uso de las herramientas computadorizadas. No hubo diferencias entre los candidatos y maestros en servicio en terminos de las actitudes de confianza y empatia hacia el uso de las herramientas computadorizadas en las clases de ciencias. En aspectos como el uso del banco de datos bibliografico Eric y el uso de las herramientas computadorizadas en actividades educativas como explorar conceptos, conceptuar, aplicar lo aprendido y hacer asignaciones hubo diferencias estadisticamente significativas entre los candidatos y los maestros en servicio. Al comparar las frecuencias observadas con las esperadas hubo mas maestros en servicio y menos candidatos que indicaron usar el anterior banco de datos y las herramientas computadorizadas en las mencionadas actividades educativas.
Wild Type and PPAR KO Dataset

EPA Pesticide Factsheets

Data set 1 consists of the experimental data for the Wild Type and PPAR KO animal study and includes data used to prepare Figures 1-4 and Table 1 of the Das et al, 2016 paper.This dataset is associated with the following publication:Das, K., C. Wood, M. Lin, A.A. Starkov, C. Lau, K.B. Wallace, C. Corton, and B. Abbott. Perfluoroalky acids-induced liver steatosis: Effects on genes controlling lipid homeostasis. TOXICOLOGY. Elsevier Science Ltd, New York, NY, USA, 378: 32-52, (2017).
The Most Common Geometric and Semantic Errors in CityGML Datasets

NASA Astrophysics Data System (ADS)

Biljecki, F.; Ledoux, H.; Du, X.; Stoter, J.; Soon, K. H.; Khoo, V. H. S.

2016-10-01

To be used as input in most simulation and modelling software, 3D city models should be geometrically and topologically valid, and semantically rich. We investigate in this paper what is the quality of currently available CityGML datasets, i.e. we validate the geometry/topology of the 3D primitives (Solid and MultiSurface), and we validate whether the semantics of the boundary surfaces of buildings is correct or not. We have analysed all the CityGML datasets we could find, both from portals of cities and on different websites, plus a few that were made available to us. We have thus validated 40M surfaces in 16M 3D primitives and 3.6M buildings found in 37 CityGML datasets originating from 9 countries, and produced by several companies with diverse software and acquisition techniques. The results indicate that CityGML datasets without errors are rare, and those that are nearly valid are mostly simple LOD1 models. We report on the most common errors we have found, and analyse them. One main observation is that many of these errors could be automatically fixed or prevented with simple modifications to the modelling software. Our principal aim is to highlight the most common errors so that these are not repeated in the future. We hope that our paper and the open-source software we have developed will help raise awareness for data quality among data providers and 3D GIS software producers.

A modern plant-climate research dataset for modelling eastern North American plant taxa.

NASA Astrophysics Data System (ADS)

Gonzales, L. M.; Grimm, E. C.; Williams, J. W.; Nordheim, E. V.

2008-12-01

Continental-scale modern pollen-climate data repositories are a primary data source for paleoclimate reconstructions. However, these repositories can contain artifacts, such as records from different depositional environment and replicate records, that can influence the observed pollen-climate relationships as well as the paleoclimate reconstructions derived from these relationships. In this paper, we address the issues related to these artifacts as we define the methods used to create a research dataset from the North American Modern Pollen Database (Whitmore et al., 2005). Additionally, we define the methods used to select the environmental variables that are best for modeling regional pollen-climate relationships from the research dataset. Because the depositional environment determines the relative strengths of the local and regional pollen signals, combining data from different depositional environments results in pollen abundances that can be influenced by the local pollen signal. Replicate records in pollen-climate datasets can skew pollen-climate relationships by causing an over- or under- representation of pollen abundances in climate space. When these two artifacts are combined, the errors introduced into pollen-climate relationship modeling are compounded. The research dataset we present consists of 2,613 records in eastern North America, of which 70.9% are lacustrine sites. We demonstrate that this new research database improves upon the modeling of regional pollen-climate relationships for eastern North American taxa. The research dataset encompasses the majority of the temperature and mean summer precipitation ranges of the NAMPD's climatic range and 40% of its mean winter precipitation range. NAMPD sites with higher winter precipitation are located along the northwestern coast of North America where a rainshadow effect produces abundant winter precipitation. We present our analysis of the research dataset for use in paleoclimate reconstructions, and
Meta-Analysis in Genome-Wide Association Datasets: Strategies and Application in Parkinson Disease

PubMed Central

Evangelou, Evangelos; Maraganore, Demetrius M.; Ioannidis, John P.A.

2007-01-01

Background Genome-wide association studies hold substantial promise for identifying common genetic variants that regulate susceptibility to complex diseases. However, for the detection of small genetic effects, single studies may be underpowered. Power may be improved by combining genome-wide datasets with meta-analytic techniques. Methodology/Principal Findings Both single and two-stage genome-wide data may be combined and there are several possible strategies. In the two-stage framework, we considered the options of (1) enhancement of replication data and (2) enhancement of first-stage data, and then, we also considered (3) joint meta-analyses including all first-stage and second-stage data. These strategies were examined empirically using data from two genome-wide association studies (three datasets) on Parkinson disease. In the three strategies, we derived 12, 5, and 49 single nucleotide polymorphisms that show significant associations at conventional levels of statistical significance. None of these remained significant after conservative adjustment for the number of performed analyses in each strategy. However, some may warrant further consideration: 6 SNPs were identified with at least 2 of the 3 strategies and 3 SNPs [rs1000291 on chromosome 3, rs2241743 on chromosome 4 and rs3018626 on chromosome 11] were identified with all 3 strategies and had no or minimal between-dataset heterogeneity (I2 = 0, 0 and 15%, respectively). Analyses were primarily limited by the suboptimal overlap of tested polymorphisms across different datasets (e.g., only 31,192 shared polymorphisms between the two tier 1 datasets). Conclusions/Significance Meta-analysis may be used to improve the power and examine the between-dataset heterogeneity of genome-wide association studies. Prospective designs may be most efficient, if they try to maximize the overlap of genotyping platforms and anticipate the combination of data across many genome-wide association studies. PMID:17332845
The impact of the resolution of meteorological datasets on catchment-scale drought studies

NASA Astrophysics Data System (ADS)

Hellwig, Jost; Stahl, Kerstin

2017-04-01

Gridded meteorological datasets provide the basis to study drought at a range of scales, including catchment scale drought studies in hydrology. They are readily available to study past weather conditions and often serve real time monitoring as well. As these datasets differ in spatial/temporal coverage and spatial/temporal resolution, for most studies there is a tradeoff between these features. Our investigation examines whether biases occur when studying drought on catchment scale with low resolution input data. For that, a comparison among the datasets HYRAS (covering Central Europe, 1x1 km grid, daily data, 1951 - 2005), E-OBS (Europe, 0.25° grid, daily data, 1950-2015) and GPCC (whole world, 0.5° grid, monthly data, 1901 - 2013) is carried out. Generally, biases in precipitation increase with decreasing resolution. Most important variations are found during summer. In low mountain range of Central Europe the datasets of sparse resolution (E-OBS, GPCC) overestimate dry days and underestimate total precipitation since they are not able to describe high spatial variability. However, relative measures like the correlation coefficient reveal good consistencies of dry and wet periods, both for absolute precipitation values and standardized indices like the Standardized Precipitation Index (SPI) or Standardized Precipitation Evaporation Index (SPEI). Particularly the most severe droughts derived from the different datasets match very well. These results indicate that absolute values of sparse resolution datasets applied to catchment scale might be critical to use for an assessment of the hydrological drought at catchment scale, whereas relative measures for determining periods of drought are more trustworthy. Therefore, studies on drought, that downscale meteorological data, should carefully consider their data needs and focus on relative measures for dry periods if sufficient for the task.
Handling limited datasets with neural networks in medical applications: A small-data approach.

PubMed

Shaikhina, Torgyn; Khovanova, Natalia A

2017-01-01

Single-centre studies in medical domain are often characterised by limited samples due to the complexity and high costs of patient data collection. Machine learning methods for regression modelling of small datasets (less than 10 observations per predictor variable) remain scarce. Our work bridges this gap by developing a novel framework for application of artificial neural networks (NNs) for regression tasks involving small medical datasets. In order to address the sporadic fluctuations and validation issues that appear in regression NNs trained on small datasets, the method of multiple runs and surrogate data analysis were proposed in this work. The approach was compared to the state-of-the-art ensemble NNs; the effect of dataset size on NN performance was also investigated. The proposed framework was applied for the prediction of compressive strength (CS) of femoral trabecular bone in patients suffering from severe osteoarthritis. The NN model was able to estimate the CS of osteoarthritic trabecular bone from its structural and biological properties with a standard error of 0.85MPa. When evaluated on independent test samples, the NN achieved accuracy of 98.3%, outperforming an ensemble NN model by 11%. We reproduce this result on CS data of another porous solid (concrete) and demonstrate that the proposed framework allows for an NN modelled with as few as 56 samples to generalise on 300 independent test samples with 86.5% accuracy, which is comparable to the performance of an NN developed with 18 times larger dataset (1030 samples). The significance of this work is two-fold: the practical application allows for non-destructive prediction of bone fracture risk, while the novel methodology extends beyond the task considered in this study and provides a general framework for application of regression NNs to medical problems characterised by limited dataset sizes. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
Looking Deep with Infrared Eyes

NASA Astrophysics Data System (ADS)

2006-07-01

Today, British astronomers are releasing the first data from the largest and most sensitive survey of the heavens in infrared light to the ESO user community. The UKIRT Infrared Deep Sky Survey (UKIDSS) has completed the first of seven years of data collection, studying objects that are too faint to see at visible wavelengths, such as very distant or very cool objects. New data on young galaxies is already challenging current thinking on galaxy formation, revealing galaxies that are massive at a much earlier stage of development than expected. These first science results already show how powerful the full survey will be at finding rare objects that hold vital clues to how stars and galaxies in our Universe formed. UKIDSS will make an atlas of large areas of the sky in the infrared. The data become available to the entire ESO user community immediately after they are entered into the archive [2]. Release to the world follows 18 months after each release to ESO. "Astronomers across Europe will jump on these exciting new data. We are moving into new territory - our survey is both wide and deep, so we are mapping huge volumes of space. That's how we will locate rare objects - the very nearest and smallest stars, and young galaxies at the edge of the universe," said Andy Lawrence from the University of Edinburgh, UKIDSS Principal Investigator. The UKIDSS data are collected by the United Kingdom Infrared Telescope [3] situated near the summit of Mauna Kea in Hawaii using the Wide Field Camera (WFCAM) built by the United Kingdom Astronomy Technology Centre (UKATC) in Edinburgh. WFCAM is the most powerful infrared imager in the world, generating enormous amounts of data - 150 gigabytes per night (equivalent to more than 200 CDs) - and approximately 10.5 Terabytes in total so far (or 15,000 CDs). Mark Casali, now at ESO, was the Project Scientist in charge of the WFCAM instrument construction at the UKATC. "WFCAM was a bold technological undertaking," said Mark Casali
Using routine clinical and administrative data to produce a dataset of attendances at Emergency Departments following self-harm.

PubMed

Polling, C; Tulloch, A; Banerjee, S; Cross, S; Dutta, R; Wood, D M; Dargan, P I; Hotopf, M

2015-07-16

Self-harm is a significant public health concern in the UK. This is reflected in the recent addition to the English Public Health Outcomes Framework of rates of attendance at Emergency Departments (EDs) following self-harm. However there is currently no source of data to measure this outcome. Routinely available data for inpatient admissions following self-harm miss the majority of cases presenting to services. We aimed to investigate (i) if a dataset of ED presentations could be produced using a combination of routinely collected clinical and administrative data and (ii) to validate this dataset against another one produced using methods similar to those used in previous studies. Using the Clinical Record Interactive Search system, the electronic health records (EHRs) used in four EDs were linked to Hospital Episode Statistics to create a dataset of attendances following self-harm. This dataset was compared with an audit dataset of ED attendances created by manual searching of ED records. The proportion of total cases detected by each dataset was compared. There were 1932 attendances detected by the EHR dataset and 1906 by the audit. The EHR and audit datasets detected 77% and 76 of all attendances respectively and both detected 82% of individual patients. There were no differences in terms of age, sex, ethnicity or marital status between those detected and those missed using the EHR method. Both datasets revealed more than double the number of self-harm incidents than could be identified from inpatient admission records. It was possible to use routinely collected EHR data to create a dataset of attendances at EDs following self-harm. The dataset detected the same proportion of attendances and individuals as the audit dataset, proved more comprehensive than the use of inpatient admission records, and did not show a systematic bias in those cases it missed.
Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval Challenge

PubMed Central

Scerri, Antony; Kuriakose, John; Deshmane, Amit Ajit; Stanger, Mark; Moore, Rebekah; Naik, Raj; de Waard, Anita

2017-01-01

Abstract We developed a two-stream, Apache Solr-based information retrieval system in response to the bioCADDIE 2016 Dataset Retrieval Challenge. One stream was based on the principle of word embeddings, the other was rooted in ontology based indexing. Despite encountering several issues in the data, the evaluation procedure and the technologies used, the system performed quite well. We provide some pointers towards future work: in particular, we suggest that more work in query expansion could benefit future biomedical search engines. Database URL: https://data.mendeley.com/datasets/zd9dxpyybg/1 PMID:29220454
The UKIRT Hemisphere Survey: definition and J-band data release

NASA Astrophysics Data System (ADS)

Dye, S.; Lawrence, A.; Read, M. A.; Fan, X.; Kerr, T.; Varricatt, W.; Furnell, K. E.; Edge, A. C.; Irwin, M.; Hambly, N.; Lucas, P.; Almaini, O.; Chambers, K.; Green, R.; Hewett, P.; Liu, M. C.; McGreer, I.; Best, W.; Zhang, Z.; Sutorius, E.; Froebrich, D.; Magnier, E.; Hasinger, G.; Lederer, S. M.; Bold, M.; Tedds, J. A.

2018-02-01

This paper defines the UK Infra-Red Telescope (UKIRT) Hemisphere Survey (UHS) and release of the remaining ∼12 700 deg2 of J-band survey data products. The UHS will provide continuous J- and K-band coverage in the Northern hemisphere from a declination of 0° to 60° by combining the existing Large Area Survey, Galactic Plane Survey and Galactic Clusters Survey conducted under the UKIRT Infra-red Deep Sky Survey (UKIDSS) programme with this new additional area not covered by UKIDSS. The released data include J-band imaging and source catalogues over the new area, which, together with UKIDSS, completes the J-band UHS coverage over the full ∼17 900 deg2 area. 98 per cent of the data in this release have passed quality control criteria. The remaining 2 per cent have been scheduled for re-observation. The median 5σ point source sensitivity of the released data is 19.6 mag (Vega). The median full width at half-maximum of the point spread function across the data set is 0.75 arcsec. In this paper, we outline the survey management, data acquisition, processing and calibration, quality control and archiving as well as summarizing the characteristics of the released data products. The data are initially available to a limited consortium with a world-wide release scheduled for 2018 August.
On the comparison of the strength of morphological integration across morphometric datasets.

PubMed

Adams, Dean C; Collyer, Michael L

2016-11-01

Evolutionary morphologists frequently wish to understand the extent to which organisms are integrated, and whether the strength of morphological integration among subsets of phenotypic variables differ among taxa or other groups. However, comparisons of the strength of integration across datasets are difficult, in part because the summary measures that characterize these patterns (RV coefficient and r PLS ) are dependent both on sample size and on the number of variables. As a solution to this issue, we propose a standardized test statistic (a z-score) for measuring the degree of morphological integration between sets of variables. The approach is based on a partial least squares analysis of trait covariation, and its permutation-based sampling distribution. Under the null hypothesis of a random association of variables, the method displays a constant expected value and confidence intervals for datasets of differing sample sizes and variable number, thereby providing a consistent measure of integration suitable for comparisons across datasets. A two-sample test is also proposed to statistically determine whether levels of integration differ between datasets, and an empirical example examining cranial shape integration in Mediterranean wall lizards illustrates its use. Some extensions of the procedure are also discussed. © 2016 The Author(s). Evolution © 2016 The Society for the Study of Evolution.
Evaluation of Global Observations-Based Evapotranspiration Datasets and IPCC AR4 Simulations

NASA Technical Reports Server (NTRS)

Mueller, B.; Seneviratne, S. I.; Jimenez, C.; Corti, T.; Hirschi, M.; Balsamo, G.; Ciais, P.; Dirmeyer, P.; Fisher, J. B.; Guo, Z.;

2011-01-01

Quantification of global land evapotranspiration (ET) has long been associated with large uncertainties due to the lack of reference observations. Several recently developed products now provide the capacity to estimate ET at global scales. These products, partly based on observational data, include satellite ]based products, land surface model (LSM) simulations, atmospheric reanalysis output, estimates based on empirical upscaling of eddycovariance flux measurements, and atmospheric water balance datasets. The LandFlux-EVAL project aims to evaluate and compare these newly developed datasets. Additionally, an evaluation of IPCC AR4 global climate model (GCM) simulations is presented, providing an assessment of their capacity to reproduce flux behavior relative to the observations ]based products. Though differently constrained with observations, the analyzed reference datasets display similar large-scale ET patterns. ET from the IPCC AR4 simulations was significantly smaller than that from the other products for India (up to 1 mm/d) and parts of eastern South America, and larger in the western USA, Australia and China. The inter-product variance is lower across the IPCC AR4 simulations than across the reference datasets in several regions, which indicates that uncertainties may be underestimated in the IPCC AR4 models due to shared biases of these simulations.

Multimedia Content Development as a Facial Expression Datasets for Recognition of Human Emotions

NASA Astrophysics Data System (ADS)

Mamonto, N. E.; Maulana, H.; Liliana, D. Y.; Basaruddin, T.

2018-02-01

Datasets that have been developed before contain facial expression from foreign people. The development of multimedia content aims to answer the problems experienced by the research team and other researchers who will conduct similar research. The method used in the development of multimedia content as facial expression datasets for human emotion recognition is the Villamil-Molina version of the multimedia development method. Multimedia content developed with 10 subjects or talents with each talent performing 3 shots with each capturing talent having to demonstrate 19 facial expressions. After the process of editing and rendering, tests are carried out with the conclusion that the multimedia content can be used as a facial expression dataset for recognition of human emotions.
Development of a consensus core dataset in juvenile dermatomyositis for clinical use to inform research

PubMed Central

McCann, Liza J; Pilkington, Clarissa A; Huber, Adam M; Ravelli, Angelo; Appelbe, Duncan; Kirkham, Jamie J; Williamson, Paula R; Aggarwal, Amita; Christopher-Stine, Lisa; Constantin, Tamas; Feldman, Brian M; Lundberg, Ingrid; Maillard, Sue; Mathiesen, Pernille; Murphy, Ruth; Pachman, Lauren M; Reed, Ann M; Rider, Lisa G; van Royen-Kerkof, Annet; Russo, Ricardo; Spinty, Stefan; Wedderburn, Lucy R

2018-01-01

Objectives This study aimed to develop consensus on an internationally agreed dataset for juvenile dermatomyositis (JDM), designed for clinical use, to enhance collaborative research and allow integration of data between centres. Methods A prototype dataset was developed through a formal process that included analysing items within existing databases of patients with idiopathic inflammatory myopathies. This template was used to aid a structured multistage consensus process. Exploiting Delphi methodology, two web-based questionnaires were distributed to healthcare professionals caring for patients with JDM identified through email distribution lists of international paediatric rheumatology and myositis research groups. A separate questionnaire was sent to parents of children with JDM and patients with JDM, identified through established research networks and patient support groups. The results of these parallel processes informed a face-to-face nominal group consensus meeting of international myositis experts, tasked with defining the content of the dataset. This developed dataset was tested in routine clinical practice before review and finalisation. Results A dataset containing 123 items was formulated with an accompanying glossary. Demographic and diagnostic data are contained within form A collected at baseline visit only, disease activity measures are included within form B collected at every visit and disease damage items within form C collected at baseline and annual visits thereafter. Conclusions Through a robust international process, a consensus dataset for JDM has been formulated that can capture disease activity and damage over time. This dataset can be incorporated into national and international collaborative efforts, including existing clinical research databases. PMID:29084729
Advanced Neuropsychological Diagnostics Infrastructure (ANDI): A Normative Database Created from Control Datasets

PubMed Central

de Vent, Nathalie R.; Agelink van Rentergem, Joost A.; Schmand, Ben A.; Murre, Jaap M. J.; Huizenga, Hilde M.

2016-01-01

In the Advanced Neuropsychological Diagnostics Infrastructure (ANDI), datasets of several research groups are combined into a single database, containing scores on neuropsychological tests from healthy participants. For most popular neuropsychological tests the quantity, and range of these data surpasses that of traditional normative data, thereby enabling more accurate neuropsychological assessment. Because of the unique structure of the database, it facilitates normative comparison methods that were not feasible before, in particular those in which entire profiles of scores are evaluated. In this article, we describe the steps that were necessary to combine the separate datasets into a single database. These steps involve matching variables from multiple datasets, removing outlying values, determining the influence of demographic variables, and finding appropriate transformations to normality. Also, a brief description of the current contents of the ANDI database is given. PMID:27812340
Resolution testing and limitations of geodetic and tsunami datasets for finite fault inversions along subduction zones

NASA Astrophysics Data System (ADS)

Williamson, A.; Newman, A. V.

2017-12-01

Finite fault inversions utilizing multiple datasets have become commonplace for large earthquakes pending data availability. The mixture of geodetic datasets such as Global Navigational Satellite Systems (GNSS) and InSAR, seismic waveforms, and when applicable, tsunami waveforms from Deep-Ocean Assessment and Reporting of Tsunami (DART) gauges, provide slightly different observations that when incorporated together lead to a more robust model of fault slip distribution. The merging of different datasets is of particular importance along subduction zones where direct observations of seafloor deformation over the rupture area are extremely limited. Instead, instrumentation measures related ground motion from tens to hundreds of kilometers away. The distance from the event and dataset type can lead to a variable degree of resolution, affecting the ability to accurately model the spatial distribution of slip. This study analyzes the spatial resolution attained individually from geodetic and tsunami datasets as well as in a combined dataset. We constrain the importance of distance between estimated parameters and observed data and how that varies between land-based and open ocean datasets. Analysis focuses on accurately scaled subduction zone synthetic models as well as analysis of the relationship between slip and data in recent large subduction zone earthquakes. This study shows that seafloor deformation sensitive datasets, like open-ocean tsunami waveforms or seafloor geodetic instrumentation, can provide unique offshore resolution for understanding most large and particularly tsunamigenic megathrust earthquake activity. In most environments, we simply lack the capability to resolve static displacements using land-based geodetic observations.
Watershed Boundary Dataset for Mississippi

USGS Publications Warehouse

Wilson, K. Van; Clair, Michael G.; Turnipseed, D. Phil; Rebich, Richard A.

2009-01-01

The U.S. Geological Survey, in cooperation with the Mississippi Department of Environmental Quality, U.S. Department of Agriculture-Natural Resources Conservation Service, Mississippi Department of Transportation, U.S. Department of Agriculture-Forest Service, and the Mississippi Automated Resource Information System developed a 1:24,000-scale Watershed Boundary Dataset for Mississippi including watershed and subwatershed boundaries, codes, names, and areas. The Watershed Boundary Dataset for Mississippi provides a standard geographical framework for water-resources and selected land-resources planning. The original 8-digit subbasins (Hydrologic Unit Codes) were further subdivided into 10-digit watersheds (62.5 to 391 square miles (mi2)) and 12-digit subwatersheds (15.6 to 62.5 mi2) - the exceptions being the Delta part of Mississippi and the Mississippi River inside levees, which were subdivided into 10-digit watersheds only. Also, large water bodies in the Mississippi Sound along the coast were not delineated as small as a typical 12-digit subwatershed. All of the data - including watershed and subwatershed boundaries, subdivision codes and names, and drainage-area data - are stored in a Geographic Information System database, which are available at: http://ms.water.usgs.gov/. This map shows information on drainage and hydrography in the form of U.S. Geological Survey hydrologic unit boundaries for water-resource 2-digit regions, 4-digit subregions, 6-digit basins (formerly called accounting units), 8-digit subbasins (formerly called cataloging units), 10-digit watershed, and 12-digit subwatersheds in Mississippi. A description of the project study area, methods used in the development of watershed and subwatershed boundaries for Mississippi, and results are presented in Wilson and others (2008). The data presented in this map and by Wilson and others (2008) supersede the data presented for Mississippi by Seaber and others (1987) and U.S. Geological Survey (1977).
An integrated pan-tropical biomass map using multiple reference datasets.

PubMed

Avitabile, Valerio; Herold, Martin; Heuvelink, Gerard B M; Lewis, Simon L; Phillips, Oliver L; Asner, Gregory P; Armston, John; Ashton, Peter S; Banin, Lindsay; Bayol, Nicolas; Berry, Nicholas J; Boeckx, Pascal; de Jong, Bernardus H J; DeVries, Ben; Girardin, Cecile A J; Kearsley, Elizabeth; Lindsell, Jeremy A; Lopez-Gonzalez, Gabriela; Lucas, Richard; Malhi, Yadvinder; Morel, Alexandra; Mitchard, Edward T A; Nagy, Laszlo; Qie, Lan; Quinones, Marcela J; Ryan, Casey M; Ferry, Slik J W; Sunderland, Terry; Laurin, Gaia Vaglio; Gatti, Roberto Cazzolla; Valentini, Riccardo; Verbeeck, Hans; Wijaya, Arief; Willcock, Simon

2016-04-01

We combined two existing datasets of vegetation aboveground biomass (AGB) (Proceedings of the National Academy of Sciences of the United States of America, 108, 2011, 9899; Nature Climate Change, 2, 2012, 182) into a pan-tropical AGB map at 1-km resolution using an independent reference dataset of field observations and locally calibrated high-resolution biomass maps, harmonized and upscaled to 14 477 1-km AGB estimates. Our data fusion approach uses bias removal and weighted linear averaging that incorporates and spatializes the biomass patterns indicated by the reference data. The method was applied independently in areas (strata) with homogeneous error patterns of the input (Saatchi and Baccini) maps, which were estimated from the reference data and additional covariates. Based on the fused map, we estimated AGB stock for the tropics (23.4 N-23.4 S) of 375 Pg dry mass, 9-18% lower than the Saatchi and Baccini estimates. The fused map also showed differing spatial patterns of AGB over large areas, with higher AGB density in the dense forest areas in the Congo basin, Eastern Amazon and South-East Asia, and lower values in Central America and in most dry vegetation areas of Africa than either of the input maps. The validation exercise, based on 2118 estimates from the reference dataset not used in the fusion process, showed that the fused map had a RMSE 15-21% lower than that of the input maps and, most importantly, nearly unbiased estimates (mean bias 5 Mg dry mass ha(-1) vs. 21 and 28 Mg ha(-1) for the input maps). The fusion method can be applied at any scale including the policy-relevant national level, where it can provide improved biomass estimates by integrating existing regional biomass maps as input maps and additional, country-specific reference datasets. © 2015 John Wiley & Sons Ltd.
New fuzzy support vector machine for the class imbalance problem in medical datasets classification.

PubMed

Gu, Xiaoqing; Ni, Tongguang; Wang, Hongyuan

2014-01-01

In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.
Retinal fundus images for glaucoma analysis: the RIGA dataset

NASA Astrophysics Data System (ADS)

Almazroa, Ahmed; Alodhayb, Sami; Osman, Essameldin; Ramadan, Eslam; Hummadi, Mohammed; Dlaim, Mohammed; Alkatee, Muhannad; Raahemifar, Kaamran; Lakshminarayanan, Vasudevan

2018-03-01

Glaucoma neuropathy is a major cause of irreversible blindness worldwide. Current models of chronic care will not be able to close the gap of growing prevalence of glaucoma and challenges for access to healthcare services. Teleophthalmology is being developed to close this gap. In order to develop automated techniques for glaucoma detection which can be used in tele-ophthalmology we have developed a large retinal fundus dataset. A de-identified dataset of retinal fundus images for glaucoma analysis (RIGA) was derived from three sources for a total of 750 images. The optic cup and disc boundaries for each image was marked and annotated manually by six experienced ophthalmologists and included the cup to disc (CDR) estimates. Six parameters were extracted and assessed (the disc area and centroid, cup area and centroid, horizontal and vertical cup to disc ratios) among the ophthalmologists. The inter-observer annotations were compared by calculating the standard deviation (SD) for every image between the six ophthalmologists in order to determine if the outliers amongst the six and was used to filter the corresponding images. The data set will be made available to the research community in order to crowd source other analysis from other research groups in order to develop, validate and implement analysis algorithms appropriate for tele-glaucoma assessment. The RIGA dataset can be freely accessed online through University of Michigan, Deep Blue website (doi:10.7302/Z23R0R29).
WIND Toolkit Offshore Summary Dataset

DOE Office of Scientific and Technical Information (OSTI.GOV)

Draxl, Caroline; Musial, Walt; Scott, George

This dataset contains summary statistics for offshore wind resources for the continental United States derived from the Wind Integration National Datatset (WIND) Toolkit. These data are available in two formats: GDB - Compressed geodatabases containing statistical summaries aligned with lease blocks (aliquots) stored in a GIS format. These data are partitioned into Pacific, Atlantic, and Gulf resource regions. HDF5 - Statistical summaries of all points in the offshore Pacific, Atlantic, and Gulf offshore regions. These data are located on the original WIND Toolkit grid and have not been reassigned or downsampled to lease blocks. These data were developed under contractmore » by NREL for the Bureau of Oceanic Energy Management (BOEM).« less
Highlights of the Version 8 SBUV and TOMS Datasets Released at this Symposium

NASA Technical Reports Server (NTRS)

Bhartia, Pawan K.; McPeters, Richard D.; Flynn, Lawrence E.; Wellemeyer, Charles G.

2004-01-01

Last October was the 25th anniversary of the launch of the SBUV and TOMS instruments on NASA's Nimbus-7 satellite. Total Ozone and ozone profile datasets produced by these and following instruments have produced a quarter century long record. Over time we have released several versions of these datasets to incorporate advances in UV radiative transfer, inverse modeling, and instrument characterization. In this meeting we are releasing datasets produced from the version 8 algorithms. They replace the previous versions (V6 SBUV, and V7 TOMS) released about a decade ago. About a dozen companion papers in this meeting provide details of the new algorithms and intercomparison of the new data with external data. In this paper we present key features of the new algorithm, and discuss how the new results differ from those released previously. We show that the new datasets have better internal consistency and also agree better with external datasets. A key feature of the V8 SBUV algorithm is that the climatology has no influence on inter-annual variability and trends; it only affects the mean values and, to a limited extent, the seasonal dependence. By contrast, climatology does have some influence on TOMS total O3 trends, particularly at large solar zenith angles. For this reason, and also because TOMS record has gaps, md EP/TOMS is suffering from data quality problems, we recommend using SBUV total ozone data for applications where the high spatial resolution of TOMS is not essential.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.