Deriving novel relationships from the scientific literature is an important adjunct to datamining activities for complex datasets in genomics and high-throughput screening activities. Automated text-mining algorithms can be used to extract relevant content from the literature and...
Improve Data Mining and Knowledge Discovery Through the Use of MatLab
NASA Technical Reports Server (NTRS)
Shaykhian, Gholam Ali; Martin, Dawn (Elliott); Beil, Robert
2011-01-01
Data mining is widely used to mine business, engineering, and scientific data. Data mining uses pattern based queries, searches, or other analyses of one or more electronic databases/datasets in order to discover or locate a predictive pattern or anomaly indicative of system failure, criminal or terrorist activity, etc. There are various algorithms, techniques and methods used to mine data; including neural networks, genetic algorithms, decision trees, nearest neighbor method, rule induction association analysis, slice and dice, segmentation, and clustering. These algorithms, techniques and methods used to detect patterns in a dataset, have been used in the development of numerous open source and commercially available products and technology for data mining. Data mining is best realized when latent information in a large quantity of data stored is discovered. No one technique solves all data mining problems; challenges are to select algorithms or methods appropriate to strengthen data/text mining and trending within given datasets. In recent years, throughout industry, academia and government agencies, thousands of data systems have been designed and tailored to serve specific engineering and business needs. Many of these systems use databases with relational algebra and structured query language to categorize and retrieve data. In these systems, data analyses are limited and require prior explicit knowledge of metadata and database relations; lacking exploratory data mining and discoveries of latent information. This presentation introduces MatLab(R) (MATrix LABoratory), an engineering and scientific data analyses tool to perform data mining. MatLab was originally intended to perform purely numerical calculations (a glorified calculator). Now, in addition to having hundreds of mathematical functions, it is a programming language with hundreds built in standard functions and numerous available toolboxes. MatLab's ease of data processing, visualization and its enormous availability of built in functionalities and toolboxes make it suitable to perform numerical computations and simulations as well as a data mining tool. Engineers and scientists can take advantage of the readily available functions/toolboxes to gain wider insight in their perspective data mining experiments.
Improve Data Mining and Knowledge Discovery through the use of MatLab
NASA Technical Reports Server (NTRS)
Shaykahian, Gholan Ali; Martin, Dawn Elliott; Beil, Robert
2011-01-01
Data mining is widely used to mine business, engineering, and scientific data. Data mining uses pattern based queries, searches, or other analyses of one or more electronic databases/datasets in order to discover or locate a predictive pattern or anomaly indicative of system failure, criminal or terrorist activity, etc. There are various algorithms, techniques and methods used to mine data; including neural networks, genetic algorithms, decision trees, nearest neighbor method, rule induction association analysis, slice and dice, segmentation, and clustering. These algorithms, techniques and methods used to detect patterns in a dataset, have been used in the development of numerous open source and commercially available products and technology for data mining. Data mining is best realized when latent information in a large quantity of data stored is discovered. No one technique solves all data mining problems; challenges are to select algorithms or methods appropriate to strengthen data/text mining and trending within given datasets. In recent years, throughout industry, academia and government agencies, thousands of data systems have been designed and tailored to serve specific engineering and business needs. Many of these systems use databases with relational algebra and structured query language to categorize and retrieve data. In these systems, data analyses are limited and require prior explicit knowledge of metadata and database relations; lacking exploratory data mining and discoveries of latent information. This presentation introduces MatLab(TradeMark)(MATrix LABoratory), an engineering and scientific data analyses tool to perform data mining. MatLab was originally intended to perform purely numerical calculations (a glorified calculator). Now, in addition to having hundreds of mathematical functions, it is a programming language with hundreds built in standard functions and numerous available toolboxes. MatLab's ease of data processing, visualization and its enormous availability of built in functionalities and toolboxes make it suitable to perform numerical computations and simulations as well as a data mining tool. Engineers and scientists can take advantage of the readily available functions/toolboxes to gain wider insight in their perspective data mining experiments.
Distributed data mining on grids: services, tools, and applications.
Cannataro, Mario; Congiusta, Antonio; Pugliese, Andrea; Talia, Domenico; Trunfio, Paolo
2004-12-01
Data mining algorithms are widely used today for the analysis of large corporate and scientific datasets stored in databases and data archives. Industry, science, and commerce fields often need to analyze very large datasets maintained over geographically distributed sites by using the computational power of distributed and parallel systems. The grid can play a significant role in providing an effective computational support for distributed knowledge discovery applications. For the development of data mining applications on grids we designed a system called Knowledge Grid. This paper describes the Knowledge Grid framework and presents the toolset provided by the Knowledge Grid for implementing distributed knowledge discovery. The paper discusses how to design and implement data mining applications by using the Knowledge Grid tools starting from searching grid resources, composing software and data components, and executing the resulting data mining process on a grid. Some performance results are also discussed.
Proceedings: Fourth Workshop on Mining Scientific Datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kamath, C
Commercial applications of data mining in areas such as e-commerce, market-basket analysis, text-mining, and web-mining have taken on a central focus in the JCDD community. However, there is a significant amount of innovative data mining work taking place in the context of scientific and engineering applications that is not well represented in the mainstream KDD conferences. For example, scientific data mining techniques are being developed and applied to diverse fields such as remote sensing, physics, chemistry, biology, astronomy, structural mechanics, computational fluid dynamics etc. In these areas, data mining frequently complements and enhances existing analysis methods based on statistics, exploratorymore » data analysis, and domain-specific approaches. On the surface, it may appear that data from one scientific field, say genomics, is very different from another field, such as physics. However, despite their diversity, there is much that is common across the mining of scientific and engineering data. For example, techniques used to identify objects in images are very similar, regardless of whether the images came from a remote sensing application, a physics experiment, an astronomy observation, or a medical study. Further, with data mining being applied to new types of data, such as mesh data from scientific simulations, there is the opportunity to apply and extend data mining to new scientific domains. This one-day workshop brings together data miners analyzing science data and scientists from diverse fields to share their experiences, learn how techniques developed in one field can be applied in another, and better understand some of the newer techniques being developed in the KDD community. This is the fourth workshop on the topic of Mining Scientific Data sets; for information on earlier workshops, see http://www.ahpcrc.org/conferences/. This workshop continues the tradition of addressing challenging problems in a field where the diversity of applications is matched only by the opportunities that await a practitioner.« less
Blazing Signature Filter: a library for fast pairwise similarity comparisons
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Joon-Yong; Fujimoto, Grant M.; Wilson, Ryan
Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majoritymore » of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.« less
SparkText: Biomedical Text Mining on Big Data Framework.
Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M
Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.
SparkText: Biomedical Text Mining on Big Data Framework
He, Karen Y.; Wang, Kai
2016-01-01
Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. PMID:27685652
Process mining in oncology using the MIMIC-III dataset
NASA Astrophysics Data System (ADS)
Prima Kurniati, Angelina; Hall, Geoff; Hogg, David; Johnson, Owen
2018-03-01
Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.
User Guidelines for the Brassica Database: BRAD.
Wang, Xiaobo; Cheng, Feng; Wang, Xiaowu
2016-01-01
The genome sequence of Brassica rapa was first released in 2011. Since then, further Brassica genomes have been sequenced or are undergoing sequencing. It is therefore necessary to develop tools that help users to mine information from genomic data efficiently. This will greatly aid scientific exploration and breeding application, especially for those with low levels of bioinformatic training. Therefore, the Brassica database (BRAD) was built to collect, integrate, illustrate, and visualize Brassica genomic datasets. BRAD provides useful searching and data mining tools, and facilitates the search of gene annotation datasets, syntenic or non-syntenic orthologs, and flanking regions of functional genomic elements. It also includes genome-analysis tools such as BLAST and GBrowse. One of the important aims of BRAD is to build a bridge between Brassica crop genomes with the genome of the model species Arabidopsis thaliana, thus transferring the bulk of A. thaliana gene study information for use with newly sequenced Brassica crops.
Abandoned Uranium Mine (AUM) Points, Navajo Nation, 2016, US EPA Region 9
This GIS dataset contains point features of all Abandoned Uranium Mines (AUMs) on or within one mile of the Navajo Nation. Points are centroids developed from the Navajo Nation production mines polygon dataset that comprise of productive or unproductive Abandoned Uranium Mines. Attributes include mine names, aliases, links to AUM reports, indicators whether an AUM was mined above or below ground, indicators whether an AUM was mined above or below the local water table, and the region in which an AUM is located. This dataset contains 608 features.
NASA Astrophysics Data System (ADS)
Li, Y.; Jiang, Y.; Yang, C. P.; Armstrong, E. M.; Huang, T.; Moroni, D. F.; McGibbney, L. J.
2016-12-01
Big oceanographic data have been produced, archived and made available online, but finding the right data for scientific research and application development is still a significant challenge. A long-standing problem in data discovery is how to find the interrelationships between keywords and data, as well as the intrarelationships of the two individually. Most previous research attempted to solve this problem by building domain-specific ontology either manually or through automatic machine learning techniques. The former is costly, labor intensive and hard to keep up-to-date, while the latter is prone to noise and may be difficult for human to understand. Large-scale user behavior data modelling represents a largely untapped, unique, and valuable source for discovering semantic relationships among domain-specific vocabulary. In this article, we propose a search engine framework for mining and utilizing dataset relevancy from oceanographic dataset metadata, user behaviors, and existing ontology. The objective is to improve discovery accuracy of oceanographic data and reduce time for scientist to discover, download and reformat data for their projects. Experiments and a search example show that the proposed search engine helps both scientists and general users search with better ranking results, recommendation, and ontology navigation.
Abandoned Uranium Mine (AUM) Trust Mine Points, Navajo Nation, 2016, US EPA Region 9
This GIS dataset contains point features that represent mines included in the Navajo Environmental Response Trust. This mine category also includes Priority mines. USEPA and NNEPA prioritized mines based on gamma radiation levels, proximity to homes and potential for water contamination identified in the preliminary assessments. Attributes include mine names, reclaimed status, links to US EPA AUM reports, and the region in which the mine is located. This dataset contains 19 features.
APPLYING DATA MINING APPROACHES TO FURTHER ...
This dataset will be used to illustrate various data mining techniques to biologically profile the chemical space. This dataset will be used to illustrate various data mining techniques to biologically profile the chemical space.
NASA Astrophysics Data System (ADS)
Mote, P.; Foster, J. G.; Daley-Laursen, S. B.
2014-12-01
The Northwest has the nation's strongest geographic, institutional, and scientific alignment between NOAA RISA, DOI Climate Science Center, USDA Climate Hub, and participating universities. Considering each of those institutions' distinct mission, funding structures, governance, stakeholder engagement, methods of priority-setting, and deliverables, it is a challenge to find areas of common interest and ways for these institutions to work together. In view of the rich history of stakeholder engagement and the deep base of previous research on climate change in the region, these institutions are cooperating in developing a regional capacity to mine the vast available data in ways that are mutually beneficial, synergistic, and regionally relevant. Fundamentally, data mining means exploring connections across and within multiple datasets using advanced statistical techniques, development of multidimensional indices, machine learning, and more. The challenge is not just what we do with big datasets, but how we integrate the wide variety and types of data coming out of scenario analyses to create knowledge and inform decision-making. Federal agencies and their partners need to learn integrate big data on climate change and develop useful tools for important stake-holders to assist them in anticipating the main stresses of climate change to their own resources and preparing to abate those stresses.
Abandoned Uranium Mine (AUM) Surface Areas, Navajo Nation, 2016, US EPA Region 9
This GIS dataset contains polygon features that represent all Abandoned Uranium Mines (AUMs) on or within one mile of the Navajo Nation. Attributes include mine names, aliases, Potentially Responsible Parties, reclaimation status, EPA mine status, links to AUM reports, and the region in which an AUM is located. This dataset contains 608 features.
Bray, Mark-Anthony; Gustafsdottir, Sigrun M; Rohban, Mohammad H; Singh, Shantanu; Ljosa, Vebjorn; Sokolnicki, Katherine L; Bittker, Joshua A; Bodycombe, Nicole E; Dančík, Vlado; Hasaka, Thomas P; Hon, Cindy S; Kemp, Melissa M; Li, Kejie; Walpita, Deepika; Wawer, Mathias J; Golub, Todd R; Schreiber, Stuart L; Clemons, Paul A; Shamji, Alykhan F
2017-01-01
Abstract Background Large-scale image sets acquired by automated microscopy of perturbed samples enable a detailed comparison of cell states induced by each perturbation, such as a small molecule from a diverse library. Highly multiplexed measurements of cellular morphology can be extracted from each image and subsequently mined for a number of applications. Findings This microscopy dataset includes 919 265 five-channel fields of view, representing 30 616 tested compounds, available at “The Cell Image Library” (CIL) repository. It also includes data files containing morphological features derived from each cell in each image, both at the single-cell level and population-averaged (i.e., per-well) level; the image analysis workflows that generated the morphological features are also provided. Quality-control metrics are provided as metadata, indicating fields of view that are out-of-focus or containing highly fluorescent material or debris. Lastly, chemical annotations are supplied for the compound treatments applied. Conclusions Because computational algorithms and methods for handling single-cell morphological measurements are not yet routine, the dataset serves as a useful resource for the wider scientific community applying morphological (image-based) profiling. The dataset can be mined for many purposes, including small-molecule library enrichment and chemical mechanism-of-action studies, such as target identification. Integration with genetically perturbed datasets could enable identification of small-molecule mimetics of particular disease- or gene-related phenotypes that could be useful as probes or potential starting points for development of future therapeutics. PMID:28327978
Mining high-throughput experimental data to link gene and function
Blaby-Haas, Crysten E.; de Crécy-Lagard, Valérie
2011-01-01
Nearly 2200 genomes encoding some 6 million proteins have now been sequenced. Around 40% of these proteins are of unknown function even when function is loosely and minimally defined as “belonging to a superfamily”. In addition to in silico methods, the swelling stream of high-throughput experimental data can give valuable clues for linking these “unknowns” with precise biological roles. The goal is to develop integrative data-mining platforms that allow the scientific community at large to access and utilize this rich source of experimental knowledge. To this end, we review recent advances in generating whole-genome experimental datasets, where this data can be accessed, and how it can be used to drive prediction of gene function. PMID:21310501
Determining similarity of scientific entities in annotation datasets
Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas
2015-01-01
Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug–drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called ‘AnnSim’ that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1–1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/ PMID:25725057
Determining similarity of scientific entities in annotation datasets.
Palma, Guillermo; Vidal, Maria-Esther; Haag, Eric; Raschid, Louiqa; Thor, Andreas
2015-01-01
Linked Open Data initiatives have made available a diversity of scientific collections where scientists have annotated entities in the datasets with controlled vocabulary terms from ontologies. Annotations encode scientific knowledge, which is captured in annotation datasets. Determining relatedness between annotated entities becomes a building block for pattern mining, e.g. identifying drug-drug relationships may depend on the similarity of the targets that interact with each drug. A diversity of similarity measures has been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including the name, function, relationships with other entities, taxonomic neighborhood and semantic knowledge. We propose a novel general-purpose annotation similarity measure called 'AnnSim' that measures the relatedness between two entities based on the similarity of their annotations. We model AnnSim as a 1-1 maximum weight bipartite match and exploit properties of existing solvers to provide an efficient solution. We empirically study the performance of AnnSim on real-world datasets of drugs and disease associations from clinical trials and relationships between drugs and (genomic) targets. Using baselines that include a variety of measures, we identify where AnnSim can provide a deeper understanding of the semantics underlying the relatedness of a pair of entities or where it could lead to predicting new links or identifying potential novel patterns. Although AnnSim does not exploit knowledge or properties of a particular domain, its performance compares well with a variety of state-of-the-art domain-specific measures. Database URL: http://www.yeastgenome.org/ © The Author(s) 2015. Published by Oxford University Press.
Boosting association rule mining in large datasets via Gibbs sampling.
Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua
2016-05-03
Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.
Olorisade, Babatunde Kazeem; Brereton, Pearl; Andras, Peter
2017-09-01
Independent validation of published scientific results through study replication is a pre-condition for accepting the validity of such results. In computation research, full replication is often unrealistic for independent results validation, therefore, study reproduction has been justified as the minimum acceptable standard to evaluate the validity of scientific claims. The application of text mining techniques to citation screening in the context of systematic literature reviews is a relatively young and growing computational field with high relevance for software engineering, medical research and other fields. However, there is little work so far on reproduction studies in the field. In this paper, we investigate the reproducibility of studies in this area based on information contained in published articles and we propose reporting guidelines that could improve reproducibility. The study was approached in two ways. Initially we attempted to reproduce results from six studies, which were based on the same raw dataset. Then, based on this experience, we identified steps considered essential to successful reproduction of text mining experiments and characterized them to measure how reproducible is a study given the information provided on these steps. 33 articles were systematically assessed for reproducibility using this approach. Our work revealed that it is currently difficult if not impossible to independently reproduce the results published in any of the studies investigated. The lack of information about the datasets used limits reproducibility of about 80% of the studies assessed. Also, information about the machine learning algorithms is inadequate in about 27% of the papers. On the plus side, the third party software tools used are mostly free and available. The reproducibility potential of most of the studies can be significantly improved if more attention is paid to information provided on the datasets used, how they were partitioned and utilized, and how any randomization was controlled. We introduce a checklist of information that needs to be provided in order to ensure that a published study can be reproduced. Copyright © 2017 Elsevier Inc. All rights reserved.
Finding Spatio-Temporal Patterns in Large Sensor Datasets
ERIC Educational Resources Information Center
McGuire, Michael Patrick
2010-01-01
Spatial or temporal data mining tasks are performed in the context of the relevant space, defined by a spatial neighborhood, and the relevant time period, defined by a specific time interval. Furthermore, when mining large spatio-temporal datasets, interesting patterns typically emerge where the dataset is most dynamic. This dissertation is…
Mining high-throughput experimental data to link gene and function.
Blaby-Haas, Crysten E; de Crécy-Lagard, Valérie
2011-04-01
Nearly 2200 genomes that encode around 6 million proteins have now been sequenced. Around 40% of these proteins are of unknown function, even when function is loosely and minimally defined as 'belonging to a superfamily'. In addition to in silico methods, the swelling stream of high-throughput experimental data can give valuable clues for linking these unknowns with precise biological roles. The goal is to develop integrative data-mining platforms that allow the scientific community at large to access and utilize this rich source of experimental knowledge. To this end, we review recent advances in generating whole-genome experimental datasets, where this data can be accessed, and how it can be used to drive prediction of gene function. Copyright © 2011 Elsevier Ltd. All rights reserved.
Scalable Machine Learning for Massive Astronomical Datasets
NASA Astrophysics Data System (ADS)
Ball, Nicholas M.; Gray, A.
2014-04-01
We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.
Scalable Machine Learning for Massive Astronomical Datasets
NASA Astrophysics Data System (ADS)
Ball, Nicholas M.; Astronomy Data Centre, Canadian
2014-01-01
We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.
Knowledge mining from clinical datasets using rough sets and backpropagation neural network.
Nahato, Kindie Biredagn; Harichandran, Khanna Nehemiah; Arputharaj, Kannan
2015-01-01
The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.
Abar, Orhan; Charnigo, Richard J.; Rayapati, Abner
2017-01-01
Association rule mining has received significant attention from both the data mining and machine learning communities. While data mining researchers focus more on designing efficient algorithms to mine rules from large datasets, the learning community has explored applications of rule mining to classification. A major problem with rule mining algorithms is the explosion of rules even for moderate sized datasets making it very difficult for end users to identify both statistically significant and potentially novel rules that could lead to interesting new insights and hypotheses. Researchers have proposed many domain independent interestingness measures using which, one can rank the rules and potentially glean useful rules from the top ranked ones. However, these measures have not been fully explored for rule mining in clinical datasets owing to the relatively large sizes of the datasets often encountered in healthcare and also due to limited access to domain experts for review/analysis. In this paper, using an electronic medical record (EMR) dataset of diagnoses and medications from over three million patient visits to the University of Kentucky medical center and affiliated clinics, we conduct a thorough evaluation of dozens of interestingness measures proposed in data mining literature, including some new composite measures. Using cumulative relevance metrics from information retrieval, we compare these interestingness measures against human judgments obtained from a practicing psychiatrist for association rules involving the depressive disorders class as the consequent. Our results not only surface new interesting associations for depressive disorders but also indicate classes of interestingness measures that weight rule novelty and statistical strength in contrasting ways, offering new insights for end users in identifying interesting rules. PMID:28736771
Use Hierarchical Storage and Analysis to Exploit Intrinsic Parallelism
NASA Astrophysics Data System (ADS)
Zender, C. S.; Wang, W.; Vicente, P.
2013-12-01
Big Data is an ugly name for the scientific opportunities and challenges created by the growing wealth of geoscience data. How to weave large, disparate datasets together to best reveal their underlying properties, to exploit their strengths and minimize their weaknesses, to continually aggregate more information than the world knew yesterday and less than we will learn tomorrow? Data analytics techniques (statistics, data mining, machine learning, etc.) can accelerate pattern recognition and discovery. However, often researchers must, prior to analysis, organize multiple related datasets into a coherent framework. Hierarchical organization permits entire dataset to be stored in nested groups that reflect their intrinsic relationships and similarities. Hierarchical data can be simpler and faster to analyze by coding operators to automatically parallelize processes over isomorphic storage units, i.e., groups. The newest generation of netCDF Operators (NCO) embody this hierarchical approach, while still supporting traditional analysis approaches. We will use NCO to demonstrate the trade-offs involved in processing a prototypical Big Data application (analysis of CMIP5 datasets) using hierarchical and traditional analysis approaches.
Analysis and Visualization of ChIP-Seq and RNA-Seq Sequence Alignments Using ngs.plot.
Loh, Yong-Hwee Eddie; Shen, Li
2016-01-01
The continual maturation and increasing applications of next-generation sequencing technology in scientific research have yielded ever-increasing amounts of data that need to be effectively and efficiently analyzed and innovatively mined for new biological insights. We have developed ngs.plot-a quick and easy-to-use bioinformatics tool that performs visualizations of the spatial relationships between sequencing alignment enrichment and specific genomic features or regions. More importantly, ngs.plot is customizable beyond the use of standard genomic feature databases to allow the analysis and visualization of user-specified regions of interest generated by the user's own hypotheses. In this protocol, we demonstrate and explain the use of ngs.plot using command line executions, as well as a web-based workflow on the Galaxy framework. We replicate the underlying commands used in the analysis of a true biological dataset that we had reported and published earlier and demonstrate how ngs.plot can easily generate publication-ready figures. With ngs.plot, users would be able to efficiently and innovatively mine their own datasets without having to be involved in the technical aspects of sequence coverage calculations and genomic databases.
MetReS, an Efficient Database for Genomic Applications.
Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc
2018-02-01
MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.
Active Storage with Analytics Capabilities and I/O Runtime System for Petascale Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhary, Alok
Computational scientists must understand results from experimental, observational and computational simulation generated data to gain insights and perform knowledge discovery. As systems approach the petascale range, problems that were unimaginable a few years ago are within reach. With the increasing volume and complexity of data produced by ultra-scale simulations and high-throughput experiments, understanding the science is largely hampered by the lack of comprehensive I/O, storage, acceleration of data manipulation, analysis, and mining tools. Scientists require techniques, tools and infrastructure to facilitate better understanding of their data, in particular the ability to effectively perform complex data analysis, statistical analysis and knowledgemore » discovery. The goal of this work is to enable more effective analysis of scientific datasets through the integration of enhancements in the I/O stack, from active storage support at the file system layer to MPI-IO and high-level I/O library layers. We propose to provide software components to accelerate data analytics, mining, I/O, and knowledge discovery for large-scale scientific applications, thereby increasing productivity of both scientists and the systems. Our approaches include 1) design the interfaces in high-level I/O libraries, such as parallel netCDF, for applications to activate data mining operations at the lower I/O layers; 2) Enhance MPI-IO runtime systems to incorporate the functionality developed as a part of the runtime system design; 3) Develop parallel data mining programs as part of runtime library for server-side file system in PVFS file system; and 4) Prototype an active storage cluster, which will utilize multicore CPUs, GPUs, and FPGAs to carry out the data mining workload.« less
Automated Analysis of Renewable Energy Datasets ('EE/RE Data Mining')
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bush, Brian; Elmore, Ryan; Getman, Dan
This poster illustrates methods to substantially improve the understanding of renewable energy data sets and the depth and efficiency of their analysis through the application of statistical learning methods ('data mining') in the intelligent processing of these often large and messy information sources. The six examples apply methods for anomaly detection, data cleansing, and pattern mining to time-series data (measurements from metering points in buildings) and spatiotemporal data (renewable energy resource datasets).
A Big Data Guide to Understanding Climate Change: The Case for Theory-Guided Data Science.
Faghmous, James H; Kumar, Vipin
2014-09-01
Global climate change and its impact on human life has become one of our era's greatest challenges. Despite the urgency, data science has had little impact on furthering our understanding of our planet in spite of the abundance of climate data. This is a stark contrast from other fields such as advertising or electronic commerce where big data has been a great success story. This discrepancy stems from the complex nature of climate data as well as the scientific questions climate science brings forth. This article introduces a data science audience to the challenges and opportunities to mine large climate datasets, with an emphasis on the nuanced difference between mining climate data and traditional big data approaches. We focus on data, methods, and application challenges that must be addressed in order for big data to fulfill their promise with regard to climate science applications. More importantly, we highlight research showing that solely relying on traditional big data techniques results in dubious findings, and we instead propose a theory-guided data science paradigm that uses scientific theory to constrain both the big data techniques as well as the results-interpretation process to extract accurate insight from large climate data .
Biomedical text mining and its applications in cancer research.
Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong
2013-04-01
Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. Copyright © 2012 Elsevier Inc. All rights reserved.
A review of contrast pattern based data mining
NASA Astrophysics Data System (ADS)
Zhu, Shiwei; Ju, Meilong; Yu, Junfeng; Cai, Binlei; Wang, Aiping
2015-07-01
Contrast pattern based data mining is concerned with the mining of patterns and models that contrast two or more datasets. Contrast patterns can describe similarities or differences between the datasets. They represent strong contrast knowledge and have been shown to be very successful for constructing accurate and robust clusters and classifiers. The increasing use of contrast pattern data mining has initiated a great deal of research and development attempts in the field of data mining. A comprehensive revision on the existing contrast pattern based data mining research is given in this paper. They are generally categorized into background and representation, definitions and mining algorithms, contrast pattern based classification, clustering, and other applications, the research trends in future. The primary of this paper is to server as a glossary for interested researchers to have an overall picture on the current contrast based data mining development and identify their potential research direction to future investigation.
Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L
2015-02-01
Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal.
Cutts, Rosalind J; Guerra-Assunção, José Afonso; Gadaleta, Emanuela; Dayem Ullah, Abu Z; Chelala, Claude
2015-01-01
BCCTBbp (http://bioinformatics.breastcancertissue bank.org) was initially developed as the data-mining portal of the Breast Cancer Campaign Tissue Bank (BCCTB), a vital resource of breast cancer tissue for researchers to support and promote cutting-edge research. BCCTBbp is dedicated to maximising research on patient tissues by initially storing genomics, methylomics, transcriptomics, proteomics and microRNA data that has been mined from the literature and linking to pathways and mechanisms involved in breast cancer. Currently, the portal holds 146 datasets comprising over 227,795 expression/genomic measurements from various breast tissues (e.g. normal, malignant or benign lesions), cell lines and body fluids. BCCTBbp can be used to build on breast cancer knowledge and maximise the value of existing research. By recording a large number of annotations on samples and studies, and linking to other databases, such as NCBI, Ensembl and Reactome, a wide variety of different investigations can be carried out. Additionally, BCCTBbp has a dedicated analytical layer allowing researchers to further analyse stored datasets. A future important role for BCCTBbp is to make available all data generated on BCCTB tissues thus building a valuable resource of information on the tissues in BCCTB that will save repetition of experiments and expand scientific knowledge. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Yu, Yao; Tu, Kang; Zheng, Siyuan; Li, Yun; Ding, Guohui; Ping, Jie; Hao, Pei; Li, Yixue
2009-08-25
In the post-genomic era, the development of high-throughput gene expression detection technology provides huge amounts of experimental data, which challenges the traditional pipelines for data processing and analyzing in scientific researches. In our work, we integrated gene expression information from Gene Expression Omnibus (GEO), biomedical ontology from Medical Subject Headings (MeSH) and signaling pathway knowledge from sigPathway entries to develop a context mining tool for gene expression analysis - GEOGLE. GEOGLE offers a rapid and convenient way for searching relevant experimental datasets, pathways and biological terms according to multiple types of queries: including biomedical vocabularies, GDS IDs, gene IDs, pathway names and signature list. Moreover, GEOGLE summarizes the signature genes from a subset of GDSes and estimates the correlation between gene expression and the phenotypic distinction with an integrated p value. This approach performing global searching of expression data may expand the traditional way of collecting heterogeneous gene expression experiment data. GEOGLE is a novel tool that provides researchers a quantitative way to understand the correlation between gene expression and phenotypic distinction through meta-analysis of gene expression datasets from different experiments, as well as the biological meaning behind. The web site and user guide of GEOGLE are available at: http://omics.biosino.org:14000/kweb/workflow.jsp?id=00020.
ERIC Educational Resources Information Center
Schoech, Dick; Quinn, Andrew; Rycraft, Joan R.
2000-01-01
Examines the historical and larger context of data mining and describes data mining processes, techniques, and tools. Illustrates these using a child welfare dataset concerning the employee turnover that is mined, using logistic regression and a Bayesian neural network. Discusses the data mining process, the resulting models, their predictive…
Confidentiality issues for medical data miners.
Berman, Jules J
2002-01-01
The first task in any medical data mining effort is ensuring patient confidentiality. In the past, most data mining efforts ensured confidentiality by the dubious policy of withholding their raw data from colleagues and the public. A cursory review of medical informatics literature in the past decade reveals that much of what we have "learned" consists of assertions derived from confidential datasets unavailable for anyone's review. Without access to the original data, it is impossible to validate or improve upon a researcher's conclusions. Without access to research data, we are asked to accept findings as an act of faith, rather than as a scientific conclusion. This special issue of Artificial Intelligence in Medicine is devoted to medical data mining. The medical data miner has an obligation to conduct valid research in a way that protects human subjects. Today, data miners have the technical tools to merge large data collections and to distribute queries over disparate databases. In order to include patient-related data in shared databases, data miners will need methods to anonymize and deidentify data. This article reviews the human subject risks associated with medical data mining. This article also describes some of the innovative computational remedies that will permit researchers to conduct research AND share their data without risk to patient or institution.
Phobos spectral clustering: first results using the MRO-CRISM 0.4-2.5 micron dataset
NASA Astrophysics Data System (ADS)
Pajola, M.; Roush, T. L.; Marzo, G. A.; Simioni, E.
2016-12-01
Whether Phobos is a captured asteroid or it formed in situ around Mars, is still an outstanding question within the scientific community. The proposed Japanese Mars Moon eXploration (MMX) sample return mission has the chief scientific objective to solve this conundrum, reaching Phobos in early 2020s and returning Phobos samples to Earth few years later. Nonetheless, well before surface samples are returned to Earth, there are important spectral datasets that can be mined in order to constrain Phobos' surface properties and address implications regarding Phobos' origin. One of these is the MRO-CRISM multispectral observations of Phobos. The MRO-CRISM visible and infrared observations (0.4-2.5 micron) are here corrected for incidence and emission angles of the observation. Unlike previous studies of the MRO-CRISM data that selected specific regions for analyses, we apply a statistical technique that identifies different clusters based on a K-means partitioning algorithm. Selecting specific wavelength ranges of Phobos' reflectance spectra permits identification of possible mineralogical compounds and the spatial distribution of these on the surface of Phobos. This work paves the way to a deeper analysis of the available dataset regarding Phobos, potentially identifying regions of interest on the surface of Phobos that may warrant more detailed investigation by the MXX mission as potential sampling areas. Acknowledgments: M. Pajola was supported for this research by an appointment to the NASA Postdoctoral Program at the Ames Research Center administered by USRA.
NASA Technical Reports Server (NTRS)
Robinson, Julie A.; Slack, Kelley; Olson, V.; Trenchard, M.; Willis, K.; Baskin, P.
2006-01-01
This viewgraph presentation asks the question "Is the observation of earth from the ISS a positive (salutogenic) experience for crew members?"All images are distributed to the public via the "Gateway to Astronaut Photography of Earth at http://eol.jsc.nasa.gov. The objectives of the study are (1) Mine the dataset of Earth Observation photography--What can it tell us about the importance of viewing the Earth as a positive experience for the crewmembers? (2) Quantify extent to which photography was self-initiated (not requested by scientists) (3) Identify patterns photography activities versus scientific requested photography.
Data Mining and Optimization Tools for Developing Engine Parameters Tools
NASA Technical Reports Server (NTRS)
Dhawan, Atam P.
1998-01-01
This project was awarded for understanding the problem and developing a plan for Data Mining tools for use in designing and implementing an Engine Condition Monitoring System. Tricia Erhardt and I studied the problem domain for developing an Engine Condition Monitoring system using the sparse and non-standardized datasets to be available through a consortium at NASA Lewis Research Center. We visited NASA three times to discuss additional issues related to dataset which was not made available to us. We discussed and developed a general framework of data mining and optimization tools to extract useful information from sparse and non-standard datasets. These discussions lead to the training of Tricia Erhardt to develop Genetic Algorithm based search programs which were written in C++ and used to demonstrate the capability of GA algorithm in searching an optimal solution in noisy, datasets. From the study and discussion with NASA LeRC personnel, we then prepared a proposal, which is being submitted to NASA for future work for the development of data mining algorithms for engine conditional monitoring. The proposed set of algorithm uses wavelet processing for creating multi-resolution pyramid of tile data for GA based multi-resolution optimal search.
Rare itemsets mining algorithm based on RP-Tree and spark framework
NASA Astrophysics Data System (ADS)
Liu, Sainan; Pan, Haoan
2018-05-01
For the issues of the rare itemsets mining in big data, this paper proposed a rare itemsets mining algorithm based on RP-Tree and Spark framework. Firstly, it arranged the data vertically according to the transaction identifier, in order to solve the defects of scan the entire data set, the vertical datasets are divided into frequent vertical datasets and rare vertical datasets. Then, it adopted the RP-Tree algorithm to construct the frequent pattern tree that contains rare items and generate rare 1-itemsets. After that, it calculated the support of the itemsets by scanning the two vertical data sets, finally, it used the iterative process to generate rare itemsets. The experimental show that the algorithm can effectively excavate rare itemsets and have great superiority in execution time.
A Big Data Guide to Understanding Climate Change: The Case for Theory-Guided Data Science
Kumar, Vipin
2014-01-01
Abstract Global climate change and its impact on human life has become one of our era's greatest challenges. Despite the urgency, data science has had little impact on furthering our understanding of our planet in spite of the abundance of climate data. This is a stark contrast from other fields such as advertising or electronic commerce where big data has been a great success story. This discrepancy stems from the complex nature of climate data as well as the scientific questions climate science brings forth. This article introduces a data science audience to the challenges and opportunities to mine large climate datasets, with an emphasis on the nuanced difference between mining climate data and traditional big data approaches. We focus on data, methods, and application challenges that must be addressed in order for big data to fulfill their promise with regard to climate science applications. More importantly, we highlight research showing that solely relying on traditional big data techniques results in dubious findings, and we instead propose a theory-guided data science paradigm that uses scientific theory to constrain both the big data techniques as well as the results-interpretation process to extract accurate insight from large climate data. PMID:25276499
Konias, Sokratis; Chouvarda, Ioanna; Vlahavas, Ioannis; Maglaveras, Nicos
2005-09-01
Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.
Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi
2017-11-02
Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.
@Note: a workbench for biomedical text mining.
Lourenço, Anália; Carreira, Rafael; Carneiro, Sónia; Maia, Paulo; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Ferreira, Eugénio C; Rocha, Isabel; Rocha, Miguel
2009-08-01
Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.
Toward edge minability for role mining in bipartite networks
NASA Astrophysics Data System (ADS)
Dong, Lijun; Wang, Yi; Liu, Ran; Pi, Benjie; Wu, Liuyi
2016-11-01
Bipartite network models have been extensively used in information security to automatically generate role-based access control (RBAC) from dataset. This process is called role mining. However, not all the topologies of bipartite networks are suitable for role mining; some edges may even reduce the quality of role mining. This causes unnecessary time consumption as role mining is NP-hard. Therefore, to promote the quality of role mining results, the capability that an edge composes roles with other edges, called the minability of edge, needs to be identified. We tackle the problem from an angle of edge importance in complex networks; that is an edge easily covered by roles is considered to be more important. Based on this idea, the k-shell decomposition of complex networks is extended to reveal the different minability of edges. By this way, a bipartite network can be quickly purified by excluding the low-minability edges from role mining, and thus the quality of role mining can be effectively improved. Extensive experiments via the real-world datasets are conducted to confirm the above claims.
NASA Astrophysics Data System (ADS)
Pierce, S. A.
2014-12-01
Geosciences are becoming increasingly data intensive, particularly in relation to sustainability problems, which are multi-dimensional, weakly structured and characterized by high levels of uncertainty. In the case of complex resource management problems, the challenge is to extract meaningful information from data and make sense of it. Simultaneously, scientific knowledge alone is insufficient to change practice. Creating tools, and group decision support processes for end users to interact with data are key challenges to transforming science-based information into actionable knowledge. The ENCOMPASS project began as a multi-year case study in the Atacama Desert of Chile to design and implement a knowledge transfer model for energy-water-mining conflicts in the region. ENCOMPASS combines the use of cyberinfrastructure (CI), automated data collection, interactive interfaces for dynamic decision support, and participatory modelling to support social learning. A pilot version of the ENCOMPASS CI uses open source systems and serves as a structure to integrate and store multiple forms of data and knowledge, such as DEM, meteorological, water quality, geomicrobiological, energy demand, and groundwater models. In the case study, informatics and data fusion needs related to scientific uncertainty around deep groundwater flowpaths and energy-water connections. Users may upload data from field sites with handheld devices or desktops. Once uploaded, data assets are accessible for a variety of uses. To address multi-attributed decision problems in the Atacama region a standalone application with touch-enabled interfaces was created to improve real-time interactions with datasets by groups. The tool was used to merge datasets from the ENCOMPASS CI to support exploration among alternatives and build shared understanding among stakeholders. To date, the project has increased technical capacity among stakeholders, resulted in the creation of both for-profit and non-profit entities, enabled cross-sector collaboration with mining-indigenous stakeholders, and produced an interactive application for group decision support. ENCOMPASS leverages advances in computational tools to deliver data and models for group decision support applied to sustainability science problems.
Tree-based approach for exploring marine spatial patterns with raster datasets.
Liao, Xiaohan; Xue, Cunjin; Su, Fenzhen
2017-01-01
From multiple raster datasets to spatial association patterns, the data-mining technique is divided into three subtasks, i.e., raster dataset pretreatment, mining algorithm design, and spatial pattern exploration from the mining results. Comparison with the former two subtasks reveals that the latter remains unresolved. Confronted with the interrelated marine environmental parameters, we propose a Tree-based Approach for eXploring Marine Spatial Patterns with multiple raster datasets called TAXMarSP, which includes two models. One is the Tree-based Cascading Organization Model (TCOM), and the other is the Spatial Neighborhood-based CAlculation Model (SNCAM). TCOM designs the "Spatial node→Pattern node" from top to bottom layers to store the table-formatted frequent patterns. Together with TCOM, SNCAM considers the spatial neighborhood contributions to calculate the pattern-matching degree between the specified marine parameters and the table-formatted frequent patterns and then explores the marine spatial patterns. Using the prevalent quantification Apriori algorithm and a real remote sensing dataset from January 1998 to December 2014, a successful application of TAXMarSP to marine spatial patterns in the Pacific Ocean is described, and the obtained marine spatial patterns present not only the well-known but also new patterns to Earth scientists.
Scalable Regression Tree Learning on Hadoop using OpenPlanet
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yin, Wei; Simmhan, Yogesh; Prasanna, Viktor
As scientific and engineering domains attempt to effectively analyze the deluge of data arriving from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. Map Reduce is well suited for addressing such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, using MapReduce has been proposed earlier. In this paper, we describe an open source implement of this algorithm, OpenPlanet, on the Hadoop framework usingmore » a hybrid approach. Further, we evaluate the performance of OpenPlanet using realworld datasets from the Smart Power Grid domain to perform energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.« less
HC StratoMineR: A Web-Based Tool for the Rapid Analysis of High-Content Datasets.
Omta, Wienand A; van Heesbeen, Roy G; Pagliero, Romina J; van der Velden, Lieke M; Lelieveld, Daphne; Nellen, Mehdi; Kramer, Maik; Yeong, Marley; Saeidi, Amir M; Medema, Rene H; Spruit, Marco; Brinkkemper, Sjaak; Klumperman, Judith; Egan, David A
2016-10-01
High-content screening (HCS) can generate large multidimensional datasets and when aligned with the appropriate data mining tools, it can yield valuable insights into the mechanism of action of bioactive molecules. However, easy-to-use data mining tools are not widely available, with the result that these datasets are frequently underutilized. Here, we present HC StratoMineR, a web-based tool for high-content data analysis. It is a decision-supportive platform that guides even non-expert users through a high-content data analysis workflow. HC StratoMineR is built by using My Structured Query Language for storage and querying, PHP: Hypertext Preprocessor as the main programming language, and jQuery for additional user interface functionality. R is used for statistical calculations, logic and data visualizations. Furthermore, C++ and graphical processor unit power is diffusely embedded in R by using the rcpp and rpud libraries for operations that are computationally highly intensive. We show that we can use HC StratoMineR for the analysis of multivariate data from a high-content siRNA knock-down screen and a small-molecule screen. It can be used to rapidly filter out undesirable data; to select relevant data; and to perform quality control, data reduction, data exploration, morphological hit picking, and data clustering. Our results demonstrate that HC StratoMineR can be used to functionally categorize HCS hits and, thus, provide valuable information for hit prioritization.
Ontology for Transforming Geo-Spatial Data for Discovery and Integration of Scientific Data
NASA Astrophysics Data System (ADS)
Nguyen, L.; Chee, T.; Minnis, P.
2013-12-01
Discovery and access to geo-spatial scientific data across heterogeneous repositories and multi-discipline datasets can present challenges for scientist. We propose to build a workflow for transforming geo-spatial datasets into semantic environment by using relationships to describe the resource using OWL Web Ontology, RDF, and a proposed geo-spatial vocabulary. We will present methods for transforming traditional scientific dataset, use of a semantic repository, and querying using SPARQL to integrate and access datasets. This unique repository will enable discovery of scientific data by geospatial bound or other criteria.
Adverse Drug Event Discovery Using Biomedical Literature: A Big Data Neural Network Adventure
Badger, Jonathan; LaRose, Eric; Shirzadi, Ehsan; Mahnke, Andrea; Mayer, John; Ye, Zhan; Page, David; Peissig, Peggy
2017-01-01
Background The study of adverse drug events (ADEs) is a tenured topic in medical literature. In recent years, increasing numbers of scientific articles and health-related social media posts have been generated and shared daily, albeit with very limited use for ADE study and with little known about the content with respect to ADEs. Objective The aim of this study was to develop a big data analytics strategy that mines the content of scientific articles and health-related Web-based social media to detect and identify ADEs. Methods We analyzed the following two data sources: (1) biomedical articles and (2) health-related social media blog posts. We developed an intelligent and scalable text mining solution on big data infrastructures composed of Apache Spark, natural language processing, and machine learning. This was combined with an Elasticsearch No-SQL distributed database to explore and visualize ADEs. Results The accuracy, precision, recall, and area under receiver operating characteristic of the system were 92.7%, 93.6%, 93.0%, and 0.905, respectively, and showed better results in comparison with traditional approaches in the literature. This work not only detected and classified ADE sentences from big data biomedical literature but also scientifically visualized ADE interactions. Conclusions To the best of our knowledge, this work is the first to investigate a big data machine learning strategy for ADE discovery on massive datasets downloaded from PubMed Central and social media. This contribution illustrates possible capacities in big data biomedical text analysis using advanced computational methods with real-time update from new data published on a daily basis. PMID:29222076
An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.
ERIC Educational Resources Information Center
Trybula, Walter J.; Wyllys, Ronald E.
2000-01-01
Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)
Özgür, Arzucan; Hur, Junguk; He, Yongqun
2016-01-01
The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.
Improving integrative searching of systems chemical biology data using semantic annotation.
Chen, Bin; Ding, Ying; Wild, David J
2012-03-08
Systems chemical biology and chemogenomics are considered critical, integrative disciplines in modern biomedical research, but require data mining of large, integrated, heterogeneous datasets from chemistry and biology. We previously developed an RDF-based resource called Chem2Bio2RDF that enabled querying of such data using the SPARQL query language. Whilst this work has proved useful in its own right as one of the first major resources in these disciplines, its utility could be greatly improved by the application of an ontology for annotation of the nodes and edges in the RDF graph, enabling a much richer range of semantic queries to be issued. We developed a generalized chemogenomics and systems chemical biology OWL ontology called Chem2Bio2OWL that describes the semantics of chemical compounds, drugs, protein targets, pathways, genes, diseases and side-effects, and the relationships between them. The ontology also includes data provenance. We used it to annotate our Chem2Bio2RDF dataset, making it a rich semantic resource. Through a series of scientific case studies we demonstrate how this (i) simplifies the process of building SPARQL queries, (ii) enables useful new kinds of queries on the data and (iii) makes possible intelligent reasoning and semantic graph mining in chemogenomics and systems chemical biology. Chem2Bio2OWL is available at http://chem2bio2rdf.org/owl. The document is available at http://chem2bio2owl.wikispaces.com.
ASSISTments Dataset from Multiple Randomized Controlled Experiments
ERIC Educational Resources Information Center
Selent, Douglas; Patikorn, Thanaporn; Heffernan, Neil
2016-01-01
In this paper, we present a dataset consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSISTments online learning platform. This dataset provides data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.…
NASA Astrophysics Data System (ADS)
Place, Joachim; Malehmir, Alireza; Högdahl, Karin; Juhlin, Christopher; Persson Nilsson, Katarina
2014-05-01
Reflection seismic investigation has been conducted on the Grängesberg apatite iron deposit where over 150 Mt of iron ore were produced until the mine closed in 1989. The mine infrastructure with shafts and tunnels extend down to ca. 650 m below the surface. Both natural and mine induced fracture and fault systems are today water-filled (some of them extending to the surface). The disputed ore genesis of the apatite-iron ores and its exploration potential due to large remaining quantities once again attracts both scientific and commercial interests. A good understanding of the geometry of mineral deposits and their hostrock structures at depth is essential for optimizing their exploration and exploitation. In addition, deep understanding of the fracture system is vital if mining activity is resumed as these may impact the terrain stability and seismicity, which may put at risk new populated and industrial areas. To address some of these challenging issues related to the past mining and also to obtain information about the depth continuation of the existing deposit, two E-W oriented reflection lines with a total length of 3.5 km were acquired in May 2013 by Uppsala University. A weight drop mounted on an hydraulic bobcat truck (traditionally used for concrete breaking in demolition sector) was used to generate seismic signal. In order to increase the signal-to-noise ratio, several impacts were generated at each shot point and stacked together. The seismic lines intersect at high angle the Grängesberg ore body and open pit, as well as several mining-induced faults. A combination of cabled and wireless receivers placed at every 10 m was used for the data recording. Use of wireless receivers was necessary as deploying cabled sensors was not possible due to city infrastructures, roads and houses. A careful analysis of the data suggested that several field-related issues such as (1) the crooked geometry of the lines (due to the available path and road network), (2) electric and vibration noise due to populated areas and roads, as well as a rock crusher working close to the line, (3) significant static variations (due to the variable nature of the terrain: forest, roads, open pit filled with 10s of meters of unconsolidated materials...), make this dataset similar to a city-kind dataset. Preprocessing of the data first required the cable- and wireless- recorded sub-datasets to be merged using GPS time stamps (nanoseconds accuracy) registered in the active data. Vertical shift and stack was then carried out to stack all data available at each shot point. Preliminary results from brute stacks exhibit evidences of reflections or diffractions. It is anticipated that further processing, involving especially appropriate static corrections and velocity analysis, would help interpreting such seismic events in relation to the ore body and its surrounding structures. Possible along-profile compartmentalization is also detected; further processing is however necessary before interpreting such features in terms of recent faulting. Acknowledgements: Formas, Grängesberg Iron AB, Spendrups and Geological Survey of Sweden
Schoech, D; Quinn, A; Rycraft, J R
2000-01-01
Data mining is the sifting through of voluminous data to extract knowledge for decision making. This article illustrates the context, concepts, processes, techniques, and tools of data mining, using statistical and neural network analyses on a dataset concerning employee turnover. The resulting models and their predictive capability, advantages and disadvantages, and implications for decision support are highlighted.
Data Mining and Privacy of Social Network Sites' Users: Implications of the Data Mining Problem.
Al-Saggaf, Yeslam; Islam, Md Zahidul
2015-08-01
This paper explores the potential of data mining as a technique that could be used by malicious data miners to threaten the privacy of social network sites (SNS) users. It applies a data mining algorithm to a real dataset to provide empirically-based evidence of the ease with which characteristics about the SNS users can be discovered and used in a way that could invade their privacy. One major contribution of this article is the use of the decision forest data mining algorithm (SysFor) to the context of SNS, which does not only build a decision tree but rather a forest allowing the exploration of more logic rules from a dataset. One logic rule that SysFor built in this study, for example, revealed that anyone having a profile picture showing just the face or a picture showing a family is less likely to be lonely. Another contribution of this article is the discussion of the implications of the data mining problem for governments, businesses, developers and the SNS users themselves.
A Node Linkage Approach for Sequential Pattern Mining
Navarro, Osvaldo; Cumplido, René; Villaseñor-Pineda, Luis; Feregrino-Uribe, Claudia; Carrasco-Ochoa, Jesús Ariel
2014-01-01
Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT), has better performance and scalability in comparison with state of the art algorithms. PMID:24933123
Bravo, Àlex; Piñero, Janet; Queralt-Rosinach, Núria; Rautschka, Michael; Furlong, Laura I
2015-02-21
Current biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. By exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.
Adverse Drug Event Discovery Using Biomedical Literature: A Big Data Neural Network Adventure.
P Tafti, Ahmad; Badger, Jonathan; LaRose, Eric; Shirzadi, Ehsan; Mahnke, Andrea; Mayer, John; Ye, Zhan; Page, David; Peissig, Peggy
2017-12-08
The study of adverse drug events (ADEs) is a tenured topic in medical literature. In recent years, increasing numbers of scientific articles and health-related social media posts have been generated and shared daily, albeit with very limited use for ADE study and with little known about the content with respect to ADEs. The aim of this study was to develop a big data analytics strategy that mines the content of scientific articles and health-related Web-based social media to detect and identify ADEs. We analyzed the following two data sources: (1) biomedical articles and (2) health-related social media blog posts. We developed an intelligent and scalable text mining solution on big data infrastructures composed of Apache Spark, natural language processing, and machine learning. This was combined with an Elasticsearch No-SQL distributed database to explore and visualize ADEs. The accuracy, precision, recall, and area under receiver operating characteristic of the system were 92.7%, 93.6%, 93.0%, and 0.905, respectively, and showed better results in comparison with traditional approaches in the literature. This work not only detected and classified ADE sentences from big data biomedical literature but also scientifically visualized ADE interactions. To the best of our knowledge, this work is the first to investigate a big data machine learning strategy for ADE discovery on massive datasets downloaded from PubMed Central and social media. This contribution illustrates possible capacities in big data biomedical text analysis using advanced computational methods with real-time update from new data published on a daily basis. ©Ahmad P Tafti, Jonathan Badger, Eric LaRose, Ehsan Shirzadi, Andrea Mahnke, John Mayer, Zhan Ye, David Page, Peggy Peissig. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 08.12.2017.
NASA Astrophysics Data System (ADS)
Jiang, Y.
2015-12-01
Oceanographic resource discovery is a critical step for developing ocean science applications. With the increasing number of resources available online, many Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) have been developed to help manage and discover oceanographic resources. However, efficient and accurate resource discovery is still a big challenge because of the lack of data relevancy information. In this article, we propose a search engine framework for mining and utilizing dataset relevancy from oceanographic dataset metadata, usage metrics, and user feedback. The objective is to improve discovery accuracy of oceanographic data and reduce time for scientist to discover, download and reformat data for their projects. Experiments and a search example show that the propose engine helps both scientists and general users search for more accurate results with enhanced performance and user experience through a user-friendly interface.
Mining Gene Regulatory Networks by Neural Modeling of Expression Time-Series.
Rubiolo, Mariano; Milone, Diego H; Stegmayer, Georgina
2015-01-01
Discovering gene regulatory networks from data is one of the most studied topics in recent years. Neural networks can be successfully used to infer an underlying gene network by modeling expression profiles as times series. This work proposes a novel method based on a pool of neural networks for obtaining a gene regulatory network from a gene expression dataset. They are used for modeling each possible interaction between pairs of genes in the dataset, and a set of mining rules is applied to accurately detect the subjacent relations among genes. The results obtained on artificial and real datasets confirm the method effectiveness for discovering regulatory networks from a proper modeling of the temporal dynamics of gene expression profiles.
Dataset of Scientific Inquiry Learning Environment
ERIC Educational Resources Information Center
Ting, Choo-Yee; Ho, Chiung Ching
2015-01-01
This paper presents the dataset collected from student interactions with INQPRO, a computer-based scientific inquiry learning environment. The dataset contains records of 100 students and is divided into two portions. The first portion comprises (1) "raw log data", capturing the student's name, interfaces visited, the interface…
Exploiting PubChem for Virtual Screening
Xie, Xiang-Qun
2011-01-01
Importance of the field PubChem is a public molecular information repository, a scientific showcase of the NIH Roadmap Initiative. The PubChem database holds over 27 million records of unique chemical structures of compounds (CID) derived from nearly 70 million substance depositions (SID), and contains more than 449,000 bioassay records with over thousands of in vitro biochemical and cell-based screening bioassays established, with targeting more than 7000 proteins and genes linking to over 1.8 million of substances. Areas covered in this review This review builds on recent PubChem-related computational chemistry research reported by other authors while providing readers with an overview of the PubChem database, focusing on its increasing role in cheminformatics, virtual screening and toxicity prediction modeling. What the reader will gain These publicly available datasets in PubChem provide great opportunities for scientists to perform cheminformatics and virtual screening research for computer-aided drug design. However, the high volume and complexity of the datasets, in particular the bioassay-associated false positives/negatives and highly imbalanced datasets in PubChem, also creates major challenges. Several approaches regarding the modeling of PubChem datasets and development of virtual screening models for bioactivity and toxicity predictions are also reviewed. Take home message Novel data-mining cheminformatics tools and virtual screening algorithms are being developed and used to retrieve, annotate and analyze the large-scale and highly complex PubChem biological screening data for drug design. PMID:21691435
Handwritten mathematical symbols dataset.
Chajri, Yassine; Bouikhalene, Belaid
2016-06-01
Due to the technological advances in recent years, paper scientific documents are used less and less. Thus, the trend in the scientific community to use digital documents has increased considerably. Among these documents, there are scientific documents and more specifically mathematics documents. In this context, we present our own dataset of handwritten mathematical symbols composed of 10,379 images. This dataset gathers Arabic characters, Latin characters, Arabic numerals, Latin numerals, arithmetic operators, set-symbols, comparison symbols, delimiters, etc.
Mining of Business-Oriented Conversations at a Call Center
NASA Astrophysics Data System (ADS)
Takeuchi, Hironori; Nasukawa, Tetsuya; Watanabe, Hideo
Recently it has become feasible to transcribe textual records from telephone conversations at call centers by using automatic speech recognition. In this research, we extended a text mining system for call summary records and constructed a conversation mining system for the business-oriented conversations at the call center. To acquire useful business insights from the conversational data through the text mining system, it is critical to identify appropriate textual segments and expressions as the viewpoints to focus on. In the analysis of call summary data using a text mining system, some experts defined the viewpoints for the analysis by looking at some sample records and by preparing the dictionaries based on frequent keywords in the sample dataset. However with conversations it is difficult to identify such viewpoints manually and in advance because the target data consists of complete transcripts that are often lengthy and redundant. In this research, we defined a model of the business-oriented conversations and proposed a mining method to identify segments that have impacts on the outcomes of the conversations and can then extract useful expressions in each of these identified segments. In the experiment, we processed the real datasets from a car rental service center and constructed a mining system. With this system, we show the effectiveness of the method based on the defined conversation model.
Mapping extent and change in surface mines within the United States for 2001 to 2006
Soulard, Christopher E.; Acevedo, William; Stehman, Stephen V.; Parker, Owen P.
2016-01-01
A complete, spatially explicit dataset illustrating the 21st century mining footprint for the conterminous United States does not exist. To address this need, we developed a semi-automated procedure to map the country's mining footprint (30-m pixel) and establish a baseline to monitor changes in mine extent over time. The process uses mine seed points derived from the U.S. Energy Information Administration (EIA), U.S. Geological Survey (USGS) Mineral Resources Data System (MRDS), and USGS National Land Cover Dataset (NLCD) and recodes patches of barren land that meet a “distance to seed” requirement and a patch area requirement before mapping a pixel as mining. Seed points derived from EIA coal points, an edited MRDS point file, and 1992 NLCD mine points were used in three separate efforts using different distance and patch area parameters for each. The three products were then merged to create a 2001 map of moderate-to-large mines in the United States, which was subsequently manually edited to reduce omission and commission errors. This process was replicated using NLCD 2006 barren pixels as a base layer to create a 2006 mine map and a 2001–2006 mine change map focusing on areas with surface mine expansion. In 2001, 8,324 km2 of surface mines were mapped. The footprint increased to 9,181 km2 in 2006, representing a 10·3% increase over 5 years. These methods exhibit merit as a timely approach to generate wall-to-wall, spatially explicit maps representing the recent extent of a wide range of surface mining activities across the country.
Booma, P M; Prabhakaran, S; Dhanalakshmi, R
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.
Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661
Data Mining Citizen Science Results
NASA Astrophysics Data System (ADS)
Borne, K. D.
2012-12-01
Scientific discovery from big data is enabled through multiple channels, including data mining (through the application of machine learning algorithms) and human computation (commonly implemented through citizen science tasks). We will describe the results of new data mining experiments on the results from citizen science activities. Discovering patterns, trends, and anomalies in data are among the powerful contributions of citizen science. Establishing scientific algorithms that can subsequently re-discover the same types of patterns, trends, and anomalies in automatic data processing pipelines will ultimately result from the transformation of those human algorithms into computer algorithms, which can then be applied to much larger data collections. Scientific discovery from big data is thus greatly amplified through the marriage of data mining with citizen science.
Tele-infrasonic studies of hard-rock mining explosions.
McKenna, Mihan H; Stump, Brian W; Hayek, Sylvia; McKenna, Jason R; Stanton, Terry R
2007-07-01
The Lac-du-Bonnet infrasound station, IS-10, and the Minnesota iron mines 390 km to the southeast are ideally located to assess the accuracy of atmospheric profiles needed for infrasound modeling. Infrasonic data from 2003 associated with explosions at the iron mine were analyzed for effects of explosion size and atmospheric conditions on observations with well-constrained ground truth. Noise was the determining factor for observation; high noise conditions sometimes prevented unequivocal identification of infrasound arrivals. Observed arrivals had frequencies of 0.5 to 5 Hz, with a dominant frequency of 2 Hz, and generally had durations on the order of 10 s or less. There was no correlation between explosive amount and observability. Tele-infrasonic propagation distances (greater than 250 km) produce thermospheric ray paths. Modeling is based upon MSIS/HWM (Mass Spectrometer Incoherent Scatter/Horizontal Wind Model) and NRL-G2S (Naval Research Laboratory Ground to Space) datasets. The NRL-G2S dataset provided more accurate travel time predictions that the MSIS/HWM dataset. PE modeling for the NRL-G2S dataset indicates energy loss at higher frequencies (around 4 Hz). Additionally, applying the Sutherland/Bass model through the NRL-G2S realization of the atmosphere in InfraMAP results in predicted amplitudes too small to be observed.
Westergaard, David; Stærfeldt, Hans-Henrik; Tønsberg, Christian; Jensen, Lars Juhl; Brunak, Søren
2018-02-01
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
Westergaard, David; Stærfeldt, Hans-Henrik
2018-01-01
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only. PMID:29447159
Data Mining and Optimization Tools for Developing Engine Parameters Tools
NASA Technical Reports Server (NTRS)
Dhawan, Atam P.
1998-01-01
This project was awarded for understanding the problem and developing a plan for Data Mining tools for use in designing and implementing an Engine Condition Monitoring System. From the total budget of $5,000, Tricia and I studied the problem domain for developing ail Engine Condition Monitoring system using the sparse and non-standardized datasets to be available through a consortium at NASA Lewis Research Center. We visited NASA three times to discuss additional issues related to dataset which was not made available to us. We discussed and developed a general framework of data mining and optimization tools to extract useful information from sparse and non-standard datasets. These discussions lead to the training of Tricia Erhardt to develop Genetic Algorithm based search programs which were written in C++ and used to demonstrate the capability of GA algorithm in searching an optimal solution in noisy datasets. From the study and discussion with NASA LERC personnel, we then prepared a proposal, which is being submitted to NASA for future work for the development of data mining algorithms for engine conditional monitoring. The proposed set of algorithm uses wavelet processing for creating multi-resolution pyramid of the data for GA based multi-resolution optimal search. Wavelet processing is proposed to create a coarse resolution representation of data providing two advantages in GA based search: 1. We will have less data to begin with to make search sub-spaces. 2. It will have robustness against the noise because at every level of wavelet based decomposition, we will be decomposing the signal into low pass and high pass filters.
Harvey, Matthew J; Mason, Nicholas J; McLean, Andrew; Rzepa, Henry S
2015-01-01
We describe three different procedures based on metadata standards for enabling automated retrieval of scientific data from digital repositories utilising the persistent identifier of the dataset with optional specification of the attributes of the data document such as filename or media type. The procedures are demonstrated using the JSmol molecular visualizer as a component of a web page and Avogadro as a stand-alone modelling program. We compare our methods for automated retrieval of data from a standards-compliant data repository with those currently in operation for a selection of existing molecular databases and repositories. Our methods illustrate the importance of adopting a standards-based approach of using metadata declarations to increase access to and discoverability of repository-based data. Graphical abstract.
Data mining and the human genome
DOE Office of Scientific and Technical Information (OSTI.GOV)
Abarbanel, Henry; Callan, Curtis; Dally, William
2000-01-07
As genomics research moves from an era of data acquisition to one of both acquisition and interpretation, new methods are required for organizing and prioritizing the data. These methods would allow an initial level of data analysis to be carried out before committing resources to a particular genetic locus. This JASON study sought to delineate the main problems that must be faced in bioinformatics and to identify information technologies that can help to overcome those problems. While the current influx of data greatly exceeds what biologists have experienced in the past, other scientific disciplines and the commercial sector have beenmore » handling much larger datasets for many years. Powerful datamining techniques have been developed in other fields that, with appropriate modification, could be applied to the biological sciences.« less
Scientific Visualization Tools for Enhancement of Undergraduate Research
NASA Astrophysics Data System (ADS)
Rodriguez, W. J.; Chaudhury, S. R.
2001-05-01
Undergraduate research projects that utilize remote sensing satellite instrument data to investigate atmospheric phenomena pose many challenges. A significant challenge is processing large amounts of multi-dimensional data. Remote sensing data initially requires mining; filtering of undesirable spectral, instrumental, or environmental features; and subsequently sorting and reformatting to files for easy and quick access. The data must then be transformed according to the needs of the investigation(s) and displayed for interpretation. These multidimensional datasets require views that can range from two-dimensional plots to multivariable-multidimensional scientific visualizations with animations. Science undergraduate students generally find these data processing tasks daunting. Generally, researchers are required to fully understand the intricacies of the dataset and write computer programs or rely on commercially available software, which may not be trivial to use. In the time that undergraduate researchers have available for their research projects, learning the data formats, programming languages, and/or visualization packages is impractical. When dealing with large multi-dimensional data sets appropriate Scientific Visualization tools are imperative in allowing students to have a meaningful and pleasant research experience, while producing valuable scientific research results. The BEST Lab at Norfolk State University has been creating tools for multivariable-multidimensional analysis of Earth Science data. EzSAGE and SAGE4D have been developed to sort, analyze and visualize SAGE II (Stratospheric Aerosol and Gas Experiment) data with ease. Three- and four-dimensional visualizations in interactive environments can be produced. EzSAGE provides atmospheric slices in three-dimensions where the researcher can change the scales in the three-dimensions, color tables and degree of smoothing interactively to focus on particular phenomena. SAGE4D provides a navigable four-dimensional interactive environment. These tools allow students to make higher order decisions based on large multidimensional sets of data while diminishing the level of frustration that results from dealing with the details of processing large data sets.
Handwritten mathematical symbols dataset
Chajri, Yassine; Bouikhalene, Belaid
2016-01-01
Due to the technological advances in recent years, paper scientific documents are used less and less. Thus, the trend in the scientific community to use digital documents has increased considerably. Among these documents, there are scientific documents and more specifically mathematics documents. In this context, we present our own dataset of handwritten mathematical symbols composed of 10,379 images. This dataset gathers Arabic characters, Latin characters, Arabic numerals, Latin numerals, arithmetic operators, set-symbols, comparison symbols, delimiters, etc. PMID:27006975
Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, R; McCallen, S; Almaas, E
2007-05-28
Complex networks have been used successfully in scientific disciplines ranging from sociology to microbiology to describe systems of interacting units. Until recently, studies of complex networks have mainly focused on their network topology. However, in many real world applications, the edges and vertices have associated attributes that are frequently represented as vertex or edge weights. Furthermore, these weights are often not static, instead changing with time and forming a time series. Hence, to fully understand the dynamics of the complex network, we have to consider both network topology and related time series data. In this work, we propose a motifmore » mining approach to identify trend motifs for such purposes. Simply stated, a trend motif describes a recurring subgraph where each of its vertices or edges displays similar dynamics over a userdefined period. Given this, each trend motif occurrence can help reveal significant events in a complex system; frequent trend motifs may aid in uncovering dynamic rules of change for the system, and the distribution of trend motifs may characterize the global dynamics of the system. Here, we have developed efficient mining algorithms to extract trend motifs. Our experimental validation using three disparate empirical datasets, ranging from the stock market, world trade, to a protein interaction network, has demonstrated the efficiency and effectiveness of our approach.« less
NASA Astrophysics Data System (ADS)
Znikina, Ludmila; Rozhneva, Elena
2017-11-01
The article deals with the distribution of informative intensity of the English-language scientific text based on its structural features contributing to the process of formalization of the scientific text and the preservation of the adequacy of the text with derived semantic information in relation to the primary. Discourse analysis is built on specific compositional and meaningful examples of scientific texts taken from the mining field. It also analyzes the adequacy of the translation of foreign texts into another language, the relationships between elements of linguistic systems, the degree of a formal conformance, translation with the specific objectives and information needs of the recipient. Some key words and ideas are emphasized in the paragraphs of the English-language mining scientific texts. The article gives the characteristic features of the structure of paragraphs of technical text and examples of constructions in English scientific texts based on a mining theme with the aim to explain the possible ways of their adequate translation.
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2017-04-01
Open access to research data is an increasing international request and includes not only data underlying scholarly publication, but also raw and curated data. Especially in the framework of the observed shift in many scientific fields towards data science and data mining, data repositories are becoming important player as data archives and access point to curated research data. While general and institutional data repositories are available across all scientific disciplines, domain-specific data repositories are specialised for scientific disciplines, like, e.g., bio- or geosciences, with the possibility to use more discipline-specific and richer metadata models than general repositories. Data publication is increasingly regarded as important scientific achievement, and datasets with digital object identifier (DOI) are now fully citable in journal articles. Moreover, following in their signature of the "Statement of Commitment of the Coalition on Publishing Data in the Earth and Space Sciences" (COPDESS), many publishers have adopted their data policies and recommend and even request to store and publish data underlying scholarly publications in (domain-specific) data repositories and not as classical supplementary material directly attached to the respective article. The curation of large dynamic data from global networks in, e.g., seismology, magnetics or geodesy, always required a high grade of professional, IT-supported data management, simply to be able to store and access the huge number of files and manage dynamic datasets. In contrast to these, the vast amount of research data acquired by individual investigators or small teams known as 'long-tail data' was often not the focus for the development of data curation infrastructures. Nevertheless, even though they are small in size and highly variable, in total they represent a significant portion of the total scientific outcome. The curation of long-tail data requires more individual approaches and personal involvement of the data curator, especially regarding the data description. Here we will introduce best practices for the publication of long-tail data that are helping to reduce the individual effort, improve the quality of the data description. The data repository of GFZ Data Services, which is hosted at GFZ German Research Centre for Geosciences in Potsdam, is a domain-specific data repository for geosciences. In addition to large dynamic datasets from different disciplines, it has a large focus on the DOI-referenced publication of long-tail data with the aim to reach a high grade of reusability through a comprehensive data description and in the same time provide and distribute standardised, machine actionable metadata for data discovery (FAIR data). The development of templates for data reports, metadata provision by scientists via an XML Metadata Editor and discipline-specific DOI landing pages are helping both, the data curators to handle all kinds of datasets and enabling the scientists, i.e. user, to quickly decide whether a published dataset is fulfilling their needs. In addition, GFZ Data Services have developed DOI-registration services for several international networks (e.g. ICGEM, World Stress Map, IGETS, etc.). In addition, we have developed project-or network-specific designs of the DOI landing pages with the logo or design of the networks or project
Parallel Index and Query for Large Scale Data Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chou, Jerry; Wu, Kesheng; Ruebel, Oliver
2011-07-18
Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for process- ing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process mas- sive datasets on modern supercomputing platforms. We apply FastQuery to processing ofmore » a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for inter- esting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.« less
A software tool for determination of breast cancer treatment methods using data mining approach.
Cakır, Abdülkadir; Demirel, Burçin
2011-12-01
In this work, breast cancer treatment methods are determined using data mining. For this purpose, software is developed to help to oncology doctor for the suggestion of application of the treatment methods about breast cancer patients. 462 breast cancer patient data, obtained from Ankara Oncology Hospital, are used to determine treatment methods for new patients. This dataset is processed with Weka data mining tool. Classification algorithms are applied one by one for this dataset and results are compared to find proper treatment method. Developed software program called as "Treatment Assistant" uses different algorithms (IB1, Multilayer Perception and Decision Table) to find out which one is giving better result for each attribute to predict and by using Java Net beans interface. Treatment methods are determined for the post surgical operation of breast cancer patients using this developed software tool. At modeling step of data mining process, different Weka algorithms are used for output attributes. For hormonotherapy output IB1, for tamoxifen and radiotherapy outputs Multilayer Perceptron and for the chemotherapy output decision table algorithm shows best accuracy performance compare to each other. In conclusion, this work shows that data mining approach can be a useful tool for medical applications particularly at the treatment decision step. Data mining helps to the doctor to decide in a short time.
Wang, Chao; Guo, Xiao-Jing; Xu, Jin-Fang; Wu, Cheng; Sun, Ya-Lin; Ye, Xiao-Fei; Qian, Wei; Ma, Xiu-Qiang; Du, Wen-Min; He, Jia
2012-01-01
The detection of signals of adverse drug events (ADEs) has increased because of the use of data mining algorithms in spontaneous reporting systems (SRSs). However, different data mining algorithms have different traits and conditions for application. The objective of our study was to explore the application of association rule (AR) mining in ADE signal detection and to compare its performance with that of other algorithms. Monte Carlo simulation was applied to generate drug-ADE reports randomly according to the characteristics of SRS datasets. Thousand simulated datasets were mined by AR and other algorithms. On average, 108,337 reports were generated by the Monte Carlo simulation. Based on the predefined criterion that 10% of the drug-ADE combinations were true signals, with RR equaling to 10, 4.9, 1.5, and 1.2, AR detected, on average, 284 suspected associations with a minimum support of 3 and a minimum lift of 1.2. The area under the receiver operating characteristic (ROC) curve of the AR was 0.788, which was equivalent to that shown for other algorithms. Additionally, AR was applied to reports submitted to the Shanghai SRS in 2009. Five hundred seventy combinations were detected using AR from 24,297 SRS reports, and they were compared with recognized ADEs identified by clinical experts and various other sources. AR appears to be an effective method for ADE signal detection, both in simulated and real SRS datasets. The limitations of this method exposed in our study, i.e., a non-uniform thresholds setting and redundant rules, require further research.
Becnel, Lauren B; Darlington, Yolanda F; Ochsner, Scott A; Easton-Marks, Jeremy R; Watkins, Christopher M; McOwiti, Apollo; Kankanamge, Wasula H; Wise, Michael W; DeHart, Michael; Margolis, Ronald N; McKenna, Neil J
2015-01-01
Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse 'omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy "Web 2.0" technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA's Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field.
Fast Spatio-Temporal Data Mining from Large Geophysical Datasets
NASA Technical Reports Server (NTRS)
Stolorz, P.; Mesrobian, E.; Muntz, R.; Santos, J. R.; Shek, E.; Yi, J.; Mechoso, C.; Farrara, J.
1995-01-01
Use of the UCLA CONQUEST (CONtent-based Querying in Space and Time) is reviewed for performance of automatic cyclone extraction and detection of spatio-temporal blocking conditions on MPP. CONQUEST is a data analysis environment for knowledge and data mining to aid in high-resolution modeling of climate modeling.
Mining on Big Data Using Hadoop MapReduce Model
NASA Astrophysics Data System (ADS)
Salman Ahmed, G.; Bhattacharya, Sweta
2017-11-01
Customary parallel calculations for mining nonstop item create opportunity to adjust stack of similar data among hubs. The paper aims to review this process by analyzing the critical execution downside of the common parallel recurrent item-set mining calculations. Given a larger than average dataset, data apportioning strategies inside the current arrangements endure high correspondence and mining overhead evoked by repetitive exchanges transmitted among registering hubs. We tend to address this downside by building up a learning apportioning approach referred as Hadoop abuse using the map-reduce programming model. All objectives of Hadoop are to zest up the execution of parallel recurrent item-set mining on Hadoop bunches. Fusing the comparability metric and furthermore the locality-sensitive hashing procedure, Hadoop puts to a great degree comparative exchanges into an information segment to lift neighborhood while not making AN exorbitant assortment of excess exchanges. We tend to execute Hadoop on a 34-hub Hadoop bunch, driven by a decent change of datasets made by IBM quest market-basket manufactured data generator. Trial uncovers the fact that Hadoop contributes towards lessening system and processing masses by the uprightness of dispensing with excess exchanges on Hadoop hubs. Hadoop impressively outperforms and enhances the other models considerably.
Semantically-enabled Knowledge Discovery in the Deep Carbon Observatory
NASA Astrophysics Data System (ADS)
Wang, H.; Chen, Y.; Ma, X.; Erickson, J. S.; West, P.; Fox, P. A.
2013-12-01
The Deep Carbon Observatory (DCO) is a decadal effort aimed at transforming scientific and public understanding of carbon in the complex deep earth system from the perspectives of Deep Energy, Deep Life, Extreme Physics and Chemistry, and Reservoirs and Fluxes. Over the course of the decade DCO scientific activities will generate a massive volume of data across a variety of disciplines, presenting significant challenges in terms of data integration, management, analysis and visualization, and ultimately limiting the ability of scientists across disciplines to make insights and unlock new knowledge. The DCO Data Science Team (DCO-DS) is applying Semantic Web methodologies to construct a knowledge representation focused on the DCO Earth science disciplines, and use it together with other technologies (e.g. natural language processing and data mining) to create a more expressive representation of the distributed corpus of DCO artifacts including datasets, metadata, instruments, sensors, platforms, deployments, researchers, organizations, funding agencies, grants and various awards. The embodiment of this knowledge representation is the DCO Data Science Infrastructure, in which unique entities within the DCO domain and the relations between them are recognized and explicitly identified. The DCO-DS Infrastructure will serve as a platform for more efficient and reliable searching, discovery, access, and publication of information and knowledge for the DCO scientific community and beyond.
Gerhard, Stephan; Daducci, Alessandro; Lemkaddem, Alia; Meuli, Reto; Thiran, Jean-Philippe; Hagmann, Patric
2011-01-01
Advanced neuroinformatics tools are required for methods of connectome mapping, analysis, and visualization. The inherent multi-modality of connectome datasets poses new challenges for data organization, integration, and sharing. We have designed and implemented the Connectome Viewer Toolkit - a set of free and extensible open source neuroimaging tools written in Python. The key components of the toolkit are as follows: (1) The Connectome File Format is an XML-based container format to standardize multi-modal data integration and structured metadata annotation. (2) The Connectome File Format Library enables management and sharing of connectome files. (3) The Connectome Viewer is an integrated research and development environment for visualization and analysis of multi-modal connectome data. The Connectome Viewer's plugin architecture supports extensions with network analysis packages and an interactive scripting shell, to enable easy development and community contributions. Integration with tools from the scientific Python community allows the leveraging of numerous existing libraries for powerful connectome data mining, exploration, and comparison. We demonstrate the applicability of the Connectome Viewer Toolkit using Diffusion MRI datasets processed by the Connectome Mapper. The Connectome Viewer Toolkit is available from http://www.cmtk.org/
TISSUES 2.0: an integrative web resource on mammalian tissue expression
Palasca, Oana; Santos, Alberto; Stolte, Christian; Gorodkin, Jan; Jensen, Lars Juhl
2018-01-01
Abstract Physiological and molecular similarities between organisms make it possible to translate findings from simpler experimental systems—model organisms—into more complex ones, such as human. This translation facilitates the understanding of biological processes under normal or disease conditions. Researchers aiming to identify the similarities and differences between organisms at the molecular level need resources collecting multi-organism tissue expression data. We have developed a database of gene–tissue associations in human, mouse, rat and pig by integrating multiple sources of evidence: transcriptomics covering all four species and proteomics (human only), manually curated and mined from the scientific literature. Through a scoring scheme, these associations are made comparable across all sources of evidence and across organisms. Furthermore, the scoring produces a confidence score assigned to each of the associations. The TISSUES database (version 2.0) is publicly accessible through a user-friendly web interface and as part of the STRING app for Cytoscape. In addition, we analyzed the agreement between datasets, across and within organisms, and identified that the agreement is mainly affected by the quality of the datasets rather than by the technologies used or organisms compared. Database URL: http://tissues.jensenlab.org/ PMID:29617745
Gerhard, Stephan; Daducci, Alessandro; Lemkaddem, Alia; Meuli, Reto; Thiran, Jean-Philippe; Hagmann, Patric
2011-01-01
Advanced neuroinformatics tools are required for methods of connectome mapping, analysis, and visualization. The inherent multi-modality of connectome datasets poses new challenges for data organization, integration, and sharing. We have designed and implemented the Connectome Viewer Toolkit – a set of free and extensible open source neuroimaging tools written in Python. The key components of the toolkit are as follows: (1) The Connectome File Format is an XML-based container format to standardize multi-modal data integration and structured metadata annotation. (2) The Connectome File Format Library enables management and sharing of connectome files. (3) The Connectome Viewer is an integrated research and development environment for visualization and analysis of multi-modal connectome data. The Connectome Viewer's plugin architecture supports extensions with network analysis packages and an interactive scripting shell, to enable easy development and community contributions. Integration with tools from the scientific Python community allows the leveraging of numerous existing libraries for powerful connectome data mining, exploration, and comparison. We demonstrate the applicability of the Connectome Viewer Toolkit using Diffusion MRI datasets processed by the Connectome Mapper. The Connectome Viewer Toolkit is available from http://www.cmtk.org/ PMID:21713110
NASA Astrophysics Data System (ADS)
Heather, David
2016-07-01
Introduction: The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces (e.g. FTP browser, Map based, Advanced search, and Machine interface): http://archives.esac.esa.int/psa All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. Updating the PSA: The PSA is currently implementing a number of significant changes, both to its web-based interface to the scientific community, and to its database structure. The new PSA will be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's upcoming ExoMars and BepiColombo missions. The newly designed PSA homepage will provide direct access to scientific datasets via a text search for targets or missions. This will significantly reduce the complexity for users to find their data and will promote one-click access to the datasets. Additionally, the homepage will provide direct access to advanced views and searches of the datasets. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). Queries to the PSA database will be possible either via the homepage (for simple searches of missions or targets), or through a filter menu for more tailored queries. The filter menu will offer multiple options to search for a particular dataset or product, and will manage queries for both in-situ and remote sensing instruments. Parameters such as start-time, phase angle, and heliocentric distance will be emphasized. A further advanced search function will allow users to query all the metadata present in the PSA database. Results will be displayed in 3 different ways: 1) A table listing all the corresponding data matching the criteria in the filter menu, 2) a projection of the products onto the surface of the object when applicable (i.e. planets, small bodies), and 3) a list of images for the relevant instruments to enjoy the beauty of our Solar System. These different ways of viewing the datasets will ensure that scientists and non-professionals alike will have access to the specific data they are looking for, regardless of their background. Conclusions: The new PSA will maintain the various interfaces and services it had in the past, and will include significant improvements designed to allow easier and more effective access to the scientific data and supporting materials. The new PSA is expected to be released by mid-2016. It will support the past, present and future missions, ancillary datasets, and will enhance the scientific output of ESA's missions. As such, the PSA will become a unique archive ensuring the long-term preservation and usage of scientific datasets together with user-friendly access.
NASA Astrophysics Data System (ADS)
Heather, David; Besse, Sebastien; Barbarisi, Isa; Arviset, Christophe; de Marchi, Guido; Barthelemy, Maud; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; Macfarlane, Alan; Martinez, Santa; Rios, Carlos
2016-04-01
Introduction: The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces (e.g. FTP browser, Map based, Advanced search, and Machine interface): http://archives.esac.esa.int/psa All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. Updating the PSA: The PSA is currently implementing a number of significant changes, both to its web-based interface to the scientific community, and to its database structure. The new PSA will be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's upcoming ExoMars and BepiColombo missions. The newly designed PSA homepage will provide direct access to scientific datasets via a text search for targets or missions. This will significantly reduce the complexity for users to find their data and will promote one-click access to the datasets. Additionally, the homepage will provide direct access to advanced views and searches of the datasets. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). Queries to the PSA database will be possible either via the homepage (for simple searches of missions or targets), or through a filter menu for more tailored queries. The filter menu will offer multiple options to search for a particular dataset or product, and will manage queries for both in-situ and remote sensing instruments. Parameters such as start-time, phase angle, and heliocentric distance will be emphasized. A further advanced search function will allow users to query all the metadata present in the PSA database. Results will be displayed in 3 different ways: 1) A table listing all the corresponding data matching the criteria in the filter menu, 2) a projection of the products onto the surface of the object when applicable (i.e. planets, small bodies), and 3) a list of images for the relevant instruments to enjoy the beauty of our Solar System. These different ways of viewing the datasets will ensure that scientists and non-professionals alike will have access to the specific data they are looking for, regardless of their background. Conclusions: The new PSA will maintain the various interfaces and services it had in the past, and will include significant improvements designed to allow easier and more effective access to the scientific data and supporting materials. The new PSA is expected to be released by mid-2016. It will support the past, present and future missions, ancillary datasets, and will enhance the scientific output of ESA's missions. As such, the PSA will become a unique archive ensuring the long-term preservation and usage of scientific datasets together with user-friendly access.
Toward an integrated knowledge environment to support modern oncology.
Blake, Patrick M; Decker, David A; Glennon, Timothy M; Liang, Yong Michael; Losko, Sascha; Navin, Nicholas; Suh, K Stephen
2011-01-01
Around the world, teams of researchers continue to develop a wide range of systems to capture, store, and analyze data including treatment, patient outcomes, tumor registries, next-generation sequencing, single-nucleotide polymorphism, copy number, gene expression, drug chemistry, drug safety, and toxicity. Scientists mine, curate, and manually annotate growing mountains of data to produce high-quality databases, while clinical information is aggregated in distant systems. Databases are currently scattered, and relationships between variables coded in disparate datasets are frequently invisible. The challenge is to evolve oncology informatics from a "systems" orientation of standalone platforms and silos into an "integrated knowledge environments" that will connect "knowable" research data with patient clinical information. The aim of this article is to review progress toward an integrated knowledge environment to support modern oncology with a focus on supporting scientific discovery and improving cancer care.
Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics.
Yoon, Sunmoo; Gutierrez, Jose
Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions.
Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics
Yoon, Sunmoo; Gutierrez, Jose
2015-01-01
Purpose Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Methods Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Results Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Conclusions Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions. PMID:26835413
Enhancing the Performance of LibSVM Classifier by Kernel F-Score Feature Selection
NASA Astrophysics Data System (ADS)
Sarojini, Balakrishnan; Ramaraj, Narayanasamy; Nickolas, Savarimuthu
Medical Data mining is the search for relationships and patterns within the medical datasets that could provide useful knowledge for effective clinical decisions. The inclusion of irrelevant, redundant and noisy features in the process model results in poor predictive accuracy. Much research work in data mining has gone into improving the predictive accuracy of the classifiers by applying the techniques of feature selection. Feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of significant features. The objective of this work is to show that selecting the more significant features would improve the performance of the classifier. We empirically evaluate the classification effectiveness of LibSVM classifier on the reduced feature subset of diabetes dataset. The evaluations suggest that the feature subset selected improves the predictive accuracy of the classifier and reduce false negatives and false positives.
Interpreter of maladies: redescription mining applied to biomedical data analysis.
Waltman, Peter; Pearlman, Alex; Mishra, Bud
2006-04-01
Comprehensive, systematic and integrated data-centric statistical approaches to disease modeling can provide powerful frameworks for understanding disease etiology. Here, one such computational framework based on redescription mining in both its incarnations, static and dynamic, is discussed. The static framework provides bioinformatic tools applicable to multifaceted datasets, containing genetic, transcriptomic, proteomic, and clinical data for diseased patients and normal subjects. The dynamic redescription framework provides systems biology tools to model complex sets of regulatory, metabolic and signaling pathways in the initiation and progression of a disease. As an example, the case of chronic fatigue syndrome (CFS) is considered, which has so far remained intractable and unpredictable in its etiology and nosology. The redescription mining approaches can be applied to the Centers for Disease Control and Prevention's Wichita (KS, USA) dataset, integrating transcriptomic, epidemiological and clinical data, and can also be used to study how pathways in the hypothalamic-pituitary-adrenal axis affect CFS patients.
BRAD, the genetics and genomics database for Brassica plants.
Cheng, Feng; Liu, Shengyi; Wu, Jian; Fang, Lu; Sun, Silong; Liu, Bo; Li, Pingxia; Hua, Wei; Wang, Xiaowu
2011-10-13
Brassica species include both vegetable and oilseed crops, which are very important to the daily life of common human beings. Meanwhile, the Brassica species represent an excellent system for studying numerous aspects of plant biology, specifically for the analysis of genome evolution following polyploidy, so it is also very important for scientific research. Now, the genome of Brassica rapa has already been assembled, it is the time to do deep mining of the genome data. BRAD, the Brassica database, is a web-based resource focusing on genome scale genetic and genomic data for important Brassica crops. BRAD was built based on the first whole genome sequence and on further data analysis of the Brassica A genome species, Brassica rapa (Chiifu-401-42). It provides datasets, such as the complete genome sequence of B. rapa, which was de novo assembled from Illumina GA II short reads and from BAC clone sequences, predicted genes and associated annotations, non coding RNAs, transposable elements (TE), B. rapa genes' orthologous to those in A. thaliana, as well as genetic markers and linkage maps. BRAD offers useful searching and data mining tools, including search across annotation datasets, search for syntenic or non-syntenic orthologs, and to search the flanking regions of a certain target, as well as the tools of BLAST and Gbrowse. BRAD allows users to enter almost any kind of information, such as a B. rapa or A. thaliana gene ID, physical position or genetic marker. BRAD, a new database which focuses on the genetics and genomics of the Brassica plants has been developed, it aims at helping scientists and breeders to fully and efficiently use the information of genome data of Brassica plants. BRAD will be continuously updated and can be accessed through http://brassicadb.org.
ERIC Educational Resources Information Center
Yu, Pulan
2012-01-01
Classification, clustering and association mining are major tasks of data mining and have been widely used for knowledge discovery. Associative classification mining, the combination of both association rule mining and classification, has emerged as an indispensable way to support decision making and scientific research. In particular, it offers a…
Using PIDs to Support the Full Research Data Publishing Lifecycle
NASA Astrophysics Data System (ADS)
Waard, A. D.
2016-12-01
Persistent identifiers can help support scientific research, track scientific impact and let researchers achieve recognition for their work. We discuss a number of ways in which Elsevier utilizes PIDs to support the scholarly lifecycle: To improve the process of storing and sharing data, Mendeley Data (http://data.mendeley.com) makes use of persistent identifiers to support the dynamic nature of data and software, by tracking and recording the provenance and versioning of datasets. This system now allows the comparison of different versions of a dataset, to see precisely what was changed during a versioning update. To present research data in context for the reader, we include PIDs in research articles as hyperlinks: https://www.elsevier.com/books-and-journals/content-innovation/data-base-linking. In some cases, PIDs fetch data files from the repositories provide that allow the embedding of visualizations, e.g. with PANGAEA and PubChem: https://www.elsevier.com/books-and-journals/content-innovation/protein-viewer; https://www.elsevier.com/books-and-journals/content-innovation/pubchem. To normalize referenced data elements, the Resource Identification Initiative - which we developed together with members of the Force11 RRID group - introduces a unified standard for resource identifiers (RRIDs) that can easily be interpreted by both humans and text mining tools. https://www.force11.org/group/resource-identification-initiative/update-resource-identification-initiative, as can be seen in our Antibody Data app: https://www.elsevier.com/books-and-journals/content-innovation/antibody-data To enable better citation practices and support robust metrics system for sharing research data, we have helped develop, and are early adopters of the Force11 Data Citation Principles and Implementation groups (https://www.force11.org/group/dcip) Lastly, through our work with the Research Data Alliance Publishing Data Services group, we helped create a set of guidelines (http://www.scholix.org/guidelines) and a demonstrator service (http://dliservice.research-infrastructures.eu/#/) for a linked data network connecting datasets, articles, and individuals, which all rely on robust PIDs.
Data-Mining Technologies for Diabetes: A Systematic Review
Marinov, Miroslav; Mosa, Abu Saleh Mohammad; Yoo, Illhoi; Boren, Suzanne Austin
2011-01-01
Background The objective of this study is to conduct a systematic review of applications of data-mining techniques in the field of diabetes research. Method We searched the MEDLINE database through PubMed. We initially identified 31 articles by the search, and selected 17 articles representing various data-mining methods used for diabetes research. Our main interest was to identify research goals, diabetes types, data sets, data-mining methods, data-mining software and technologies, and outcomes. Results The applications of data-mining techniques in the selected articles were useful for extracting valuable knowledge and generating new hypothesis for further scientific research/experimentation and improving health care for diabetes patients. The results could be used for both scientific research and real-life practice to improve the quality of health care diabetes patients. Conclusions Data mining has played an important role in diabetes research. Data mining would be a valuable asset for diabetes researchers because it can unearth hidden knowledge from a huge amount of diabetes-related data. We believe that data mining can significantly help diabetes research and ultimately improve the quality of health care for diabetes patients. PMID:22226277
Salem, Saeed; Ozcaglar, Cagri
2014-01-01
Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.
Benchmark Dataset for Whole Genome Sequence Compression.
C L, Biji; S Nair, Achuthsankar
2017-01-01
The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.
NASA Astrophysics Data System (ADS)
Vallat, C.; Besse, S.; Barbarisi, I.; Arviset, C.; De Marchi, G.; Barthelemy, M.; Coia, D.; Costa, M.; Docasal, R.; Fraga, D.; Heather, D. J.; Lim, T.; Macfarlane, A.; Martinez, S.; Rios, C.; Vallejo, F.; Said, J.
2017-09-01
The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA has started to implement a number of significant improvements, mostly driven by the evolution of the PDS standards, and the growing need for better interfaces and advanced applications to support science exploitation.
Biblio-MetReS for user-friendly mining of genes and biological processes in scientific documents.
Usie, Anabel; Karathia, Hiren; Teixidó, Ivan; Alves, Rui; Solsona, Francesc
2014-01-01
One way to initiate the reconstruction of molecular circuits is by using automated text-mining techniques. Developing more efficient methods for such reconstruction is a topic of active research, and those methods are typically included by bioinformaticians in pipelines used to mine and curate large literature datasets. Nevertheless, experimental biologists have a limited number of available user-friendly tools that use text-mining for network reconstruction and require no programming skills to use. One of these tools is Biblio-MetReS. Originally, this tool permitted an on-the-fly analysis of documents contained in a number of web-based literature databases to identify co-occurrence of proteins/genes. This approach ensured results that were always up-to-date with the latest live version of the databases. However, this 'up-to-dateness' came at the cost of large execution times. Here we report an evolution of the application Biblio-MetReS that permits constructing co-occurrence networks for genes, GO processes, Pathways, or any combination of the three types of entities and graphically represent those entities. We show that the performance of Biblio-MetReS in identifying gene co-occurrence is as least as good as that of other comparable applications (STRING and iHOP). In addition, we also show that the identification of GO processes is on par to that reported in the latest BioCreAtIvE challenge. Finally, we also report the implementation of a new strategy that combines on-the-fly analysis of new documents with preprocessed information from documents that were encountered in previous analyses. This combination simultaneously decreases program run time and maintains 'up-to-dateness' of the results. http://metres.udl.cat/index.php/downloads, metres.cmb@gmail.com.
Mining Institutional Datasets to Support Policy Making and Implementation
ERIC Educational Resources Information Center
Yorke, Mantz; Barnett, Greg; Evanson, Peter; Haines, Chris; Jenkins, Don; Knight, Peter; Scurry, Dave; Stowell, Marie; Woolf, Harvey
2005-01-01
Datasets are often under-exploited by institutions, yet they contain evidence that is potentially of high value for planning and decision-making. This article shows how institutional data were used to determine whether the demographic background of students might have an influence on their performance: this is a matter of particular interest where…
Empirical evaluation of interest-level criteria
NASA Astrophysics Data System (ADS)
Sahar, Sigal; Mansour, Yishay
1999-02-01
Efficient association rule mining algorithms already exist, however, as the size of databases increases, the number of patterns mined by the algorithms increases to such an extent that their manual evaluation becomes impractical. Automatic evaluation methods are, therefore, required in order to sift through the initial list of rules, which the datamining algorithm outputs. These evaluation methods, or criteria, rank the association rules mined from the dataset. We empirically examined several such statistical criteria: new criteria, as well as previously known ones. The empirical evaluation was conducted using several databases, including a large real-life dataset, acquired from an order-by-phone grocery store, a dataset composed from www proxy logs, and several datasets from the UCI repository. We were interested in discovering whether the ranking performed by the various criteria is similar or easily distinguishable. Our evaluation detected, when significant differences exist, three patterns of behavior in the eight criteria we examined. There is an obvious dilemma in determining how many association rules to choose (in accordance with support and confidence parameters). The tradeoff is between having stringent parameters and, therefore, few rules, or lenient parameters and, thus, a multitude of rules. In many cases, our empirical evaluation revealed that most of the rules found by the comparably strict parameters ranked highly according to the interestingness criteria, when using lax parameters (producing significantly more association rules). Finally, we discuss the association rules that ranked highest, explain why these results are sound, and how they direct future research.
High-throughput neuroimaging-genetics computational infrastructure
Dinov, Ivo D.; Petrosyan, Petros; Liu, Zhizhong; Eggert, Paul; Hobel, Sam; Vespa, Paul; Woo Moon, Seok; Van Horn, John D.; Franco, Joseph; Toga, Arthur W.
2014-01-01
Many contemporary neuroscientific investigations face significant challenges in terms of data management, computational processing, data mining, and results interpretation. These four pillars define the core infrastructure necessary to plan, organize, orchestrate, validate, and disseminate novel scientific methods, computational resources, and translational healthcare findings. Data management includes protocols for data acquisition, archival, query, transfer, retrieval, and aggregation. Computational processing involves the necessary software, hardware, and networking infrastructure required to handle large amounts of heterogeneous neuroimaging, genetics, clinical, and phenotypic data and meta-data. Data mining refers to the process of automatically extracting data features, characteristics and associations, which are not readily visible by human exploration of the raw dataset. Result interpretation includes scientific visualization, community validation of findings and reproducible findings. In this manuscript we describe the novel high-throughput neuroimaging-genetics computational infrastructure available at the Institute for Neuroimaging and Informatics (INI) and the Laboratory of Neuro Imaging (LONI) at University of Southern California (USC). INI and LONI include ultra-high-field and standard-field MRI brain scanners along with an imaging-genetics database for storing the complete provenance of the raw and derived data and meta-data. In addition, the institute provides a large number of software tools for image and shape analysis, mathematical modeling, genomic sequence processing, and scientific visualization. A unique feature of this architecture is the Pipeline environment, which integrates the data management, processing, transfer, and visualization. Through its client-server architecture, the Pipeline environment provides a graphical user interface for designing, executing, monitoring validating, and disseminating of complex protocols that utilize diverse suites of software tools and web-services. These pipeline workflows are represented as portable XML objects which transfer the execution instructions and user specifications from the client user machine to remote pipeline servers for distributed computing. Using Alzheimer's and Parkinson's data, we provide several examples of translational applications using this infrastructure1. PMID:24795619
Data Mining Research with the LSST
NASA Astrophysics Data System (ADS)
Borne, Kirk D.; Strauss, M. A.; Tyson, J. A.
2007-12-01
The LSST catalog database will exceed 10 petabytes, comprising several hundred attributes for 5 billion galaxies, 10 billion stars, and over 1 billion variable sources (optical variables, transients, or moving objects), extracted from over 20,000 square degrees of deep imaging in 5 passbands with thorough time domain coverage: 1000 visits over the 10-year LSST survey lifetime. The opportunities are enormous for novel scientific discoveries within this rich time-domain ultra-deep multi-band survey database. Data Mining, Machine Learning, and Knowledge Discovery research opportunities with the LSST are now under study, with a potential for new collaborations to develop to contribute to these investigations. We will describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. We also give some illustrative examples of current scientific data mining research in astronomy, and point out where new research is needed. In particular, the data mining research community will need to address several issues in the coming years as we prepare for the LSST data deluge. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; visual data mining algorithms for visual exploration of the data; indexing of multi-attribute multi-dimensional astronomical databases (beyond RA-Dec spatial indexing) for rapid querying of petabyte databases; and more. Finally, we will identify opportunities for synergistic collaboration between the data mining research group and the LSST Data Management and Science Collaboration teams.
Mining algorithm for association rules in big data based on Hadoop
NASA Astrophysics Data System (ADS)
Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying
2018-04-01
In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.
Khashan, Raed; Zheng, Weifan; Tropsha, Alexander
2014-03-01
We present a novel approach to generating fragment-based molecular descriptors. The molecules are represented by labeled undirected chemical graph. Fast Frequent Subgraph Mining (FFSM) is used to find chemical-fragments (subgraphs) that occur in at least a subset of all molecules in a dataset. The collection of frequent subgraphs (FSG) forms a dataset-specific descriptors whose values for each molecule are defined by the number of times each frequent fragment occurs in this molecule. We have employed the FSG descriptors to develop variable selection k Nearest Neighbor (kNN) QSAR models of several datasets with binary target property including Maximum Recommended Therapeutic Dose (MRTD), Salmonella Mutagenicity (Ames Genotoxicity), and P-Glycoprotein (PGP) data. Each dataset was divided into training, test, and validation sets to establish the statistical figures of merit reflecting the model validated predictive power. The classification accuracies of models for both training and test sets for all datasets exceeded 75 %, and the accuracy for the external validation sets exceeded 72 %. The model accuracies were comparable or better than those reported earlier in the literature for the same datasets. Furthermore, the use of fragment-based descriptors affords mechanistic interpretation of validated QSAR models in terms of essential chemical fragments responsible for the compounds' target property. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
ERIC Educational Resources Information Center
Pechenizkiy, Mykola; Calders, Toon; Conati, Cristina; Ventura, Sebastian; Romero, Cristobal; Stamper, John
2011-01-01
The 4th International Conference on Educational Data Mining (EDM 2011) brings together researchers from computer science, education, psychology, psychometrics, and statistics to analyze large datasets to answer educational research questions. The conference, held in Eindhoven, The Netherlands, July 6-9, 2011, follows the three previous editions…
78 FR 25266 - An Assessment of Potential Mining Impacts on Salmon Ecosystems of Bristol Bay, Alaska
Federal Register 2010, 2011, 2012, 2013, 2014
2013-04-30
... information presented in the report, the realistic mining scenario used, the data and information used to... additional data or scientific or technical information about Bristol Bay resources or large-scale mining that... Potential Mining Impacts on Salmon Ecosystems of Bristol Bay, Alaska AGENCY: Environmental Protection Agency...
78 FR 34093 - An Assessment of Potential Mining Impacts on Salmon Ecosystems of Bristol Bay, Alaska
Federal Register 2010, 2011, 2012, 2013, 2014
2013-06-06
... scientific and technical information presented in the report, the realistic mining scenario used, the data... Potential Mining Impacts on Salmon Ecosystems of Bristol Bay, Alaska AGENCY: Environmental Protection Agency... document titled, ``An Assessment of Potential Mining Impacts on Salmon Ecosystems of Bristol Bay, Alaska...
Building a better search engine for earth science data
NASA Astrophysics Data System (ADS)
Armstrong, E. M.; Yang, C. P.; Moroni, D. F.; McGibbney, L. J.; Jiang, Y.; Huang, T.; Greguska, F. R., III; Li, Y.; Finch, C. J.
2017-12-01
Free text data searching of earth science datasets has been implemented with varying degrees of success and completeness across the spectrum of the 12 NASA earth sciences data centers. At the JPL Physical Oceanography Distributed Active Archive Center (PO.DAAC) the search engine has been developed around the Solr/Lucene platform. Others have chosen other popular enterprise search platforms like Elasticsearch. Regardless, the default implementations of these search engines leveraging factors such as dataset popularity, term frequency and inverse document term frequency do not fully meet the needs of precise relevancy and ranking of earth science search results. For the PO.DAAC, this shortcoming has been identified for several years by its external User Working Group that has assigned several recommendations to improve the relevancy and discoverability of datasets related to remotely sensed sea surface temperature, ocean wind, waves, salinity, height and gravity that comprise a total count of over 500 public availability datasets. Recently, the PO.DAAC has teamed with an effort led by George Mason University to improve the improve the search and relevancy ranking of oceanographic data via a simple search interface and powerful backend services called MUDROD (Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery) funded by the NASA AIST program. MUDROD has mined and utilized the combination of PO.DAAC earth science dataset metadata, usage metrics, and user feedback and search history to objectively extract relevance for improved data discovery and access. In addition to improved dataset relevance and ranking, the MUDROD search engine also returns recommendations to related datasets and related user queries. This presentation will report on use cases that drove the architecture and development, and the success metrics and improvements on search precision and recall that MUDROD has demonstrated over the existing PO.DAAC search interfaces.
Becnel, Lauren B.; Darlington, Yolanda F.; Ochsner, Scott A.; Easton-Marks, Jeremy R.; Watkins, Christopher M.; McOwiti, Apollo; Kankanamge, Wasula H.; Wise, Michael W.; DeHart, Michael; Margolis, Ronald N.; McKenna, Neil J.
2015-01-01
Signaling pathways involving nuclear receptors (NRs), their ligands and coregulators, regulate tissue-specific transcriptomes in diverse processes, including development, metabolism, reproduction, the immune response and neuronal function, as well as in their associated pathologies. The Nuclear Receptor Signaling Atlas (NURSA) is a Consortium focused around a Hub website (www.nursa.org) that annotates and integrates diverse ‘omics datasets originating from the published literature and NURSA-funded Data Source Projects (NDSPs). These datasets are then exposed to the scientific community on an Open Access basis through user-friendly data browsing and search interfaces. Here, we describe the redesign of the Hub, version 3.0, to deploy “Web 2.0” technologies and add richer, more diverse content. The Molecule Pages, which aggregate information relevant to NR signaling pathways from myriad external databases, have been enhanced to include resources for basic scientists, such as post-translational modification sites and targeting miRNAs, and for clinicians, such as clinical trials. A portal to NURSA’s Open Access, PubMed-indexed journal Nuclear Receptor Signaling has been added to facilitate manuscript submissions. Datasets and information on reagents generated by NDSPs are available, as is information concerning periodic new NDSP funding solicitations. Finally, the new website integrates the Transcriptomine analysis tool, which allows for mining of millions of richly annotated public transcriptomic data points in the field, providing an environment for dataset re-use and citation, bench data validation and hypothesis generation. We anticipate that this new release of the NURSA database will have tangible, long term benefits for both basic and clinical research in this field. PMID:26325041
Kharat, Amit T; Singh, Amarjit; Kulkarni, Vilas M; Shah, Digish
2014-01-01
Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining. PMID:25024513
Mining Concept Maps from News Stories for Measuring Civic Scientific Literacy in Media
ERIC Educational Resources Information Center
Tseng, Yuen-Hsien; Chang, Chun-Yen; Rundgren, Shu-Nu Chang; Rundgren, Carl-Johan
2010-01-01
Motivated by a long-term goal in education for measuring Taiwanese civic scientific literacy in media (SLiM), this work reports the detailed techniques to efficiently mine a concept map from 2 years of Chinese news articles (901,446 in total) for SLiM instrument development. From the Chinese news stories, key terms (important words or phrases),…
NASA Astrophysics Data System (ADS)
Martinez, Santa; Besse, Sebastien; Heather, Dave; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; Macfarlane, Alan; Rios, Carlos; Vallejo, Fran; Saiz, Jaime; ESDC (European Space Data Centre) Team
2016-10-01
The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://archives.esac.esa.int/psa. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more specialised views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will be also up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's ExoMars and upcoming BepiColombo missions. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. A login mechanism will provide additional functionalities to the users to aid / ease their searches (e.g. saving queries, managing default views). This contribution will introduce the new PSA, its key features and access interfaces.
Alkahest NuclearBLAST : a user-friendly BLAST management and analysis system
Diener, Stephen E; Houfek, Thomas D; Kalat, Sam E; Windham, DE; Burke, Mark; Opperman, Charles; Dean, Ralph A
2005-01-01
Background - Sequencing of EST and BAC end datasets is no longer limited to large research groups. Drops in per-base pricing have made high throughput sequencing accessible to individual investigators. However, there are few options available which provide a free and user-friendly solution to the BLAST result storage and data mining needs of biologists. Results - Here we describe NuclearBLAST, a batch BLAST analysis, storage and management system designed for the biologist. It is a wrapper for NCBI BLAST which provides a user-friendly web interface which includes a request wizard and the ability to view and mine the results. All BLAST results are stored in a MySQL database which allows for more advanced data-mining through supplied command-line utilities or direct database access. NuclearBLAST can be installed on a single machine or clustered amongst a number of machines to improve analysis throughput. NuclearBLAST provides a platform which eases data-mining of multiple BLAST results. With the supplied scripts, the program can export data into a spreadsheet-friendly format, automatically assign Gene Ontology terms to sequences and provide bi-directional best hits between two datasets. Users with SQL experience can use the database to ask even more complex questions and extract any subset of data they require. Conclusion - This tool provides a user-friendly interface for requesting, viewing and mining of BLAST results which makes the management and data-mining of large sets of BLAST analyses tractable to biologists. PMID:15958161
Mining the wealth of Phobos multispectral data contained in MEX-OMEGA and MRO-CRISM datasets
NASA Astrophysics Data System (ADS)
Pajola, Maurizio; Roush, Ted
2016-04-01
The origin of Phobos is still strongly debated between in situ versus asteroid capture formation scenarios. Growing interest in Phobos as a scientific destination is demonstrated by the multiple NASA-Discovery missions presented in 2015, the proposed ESA-M class mission PhoDEx, and JAXA's Mars Moon's sample return mission (MMX, Mars Moons eXploration). This large number of Phobos dedicated missions and JAXA's planned mission clearly illustrate the scientific interest in Phobos as a destination. We believe that there is still a wealth of data that can be mined in order to constrain Phobos' surface properties and possibly its origin. There are several multispectral Phobos datasets available on the NASA-Planetary Data System. In this work we focus on the first steps aimed to perform a new spectral analysis on the Mars satellite Phobos. We made use of the available Mars Express (MEX) OMEGA spectral cubes, obtained throughout the MEX mission, and Mars Reconnaissance Orbiter (MRO) CRISM hyperspectral data obtained during early MRO orbits around Mars. Previous analyses (Fraeman et al. 2012, 2014) mainly focus on three specific Regions of Interest, ROIs, located inside the so-called blue and red regions of Phobos, i.e. inside or close to Stickney crater (the biggest crater on Phobos, 8-km in diameter) and its eastern ejecta blanket. We extended this analysis by considering multiple ROIs located both in the leading and the trailing hemispheres of the satellite, taking advantage of the broader coverage of the OMEGA and CRISM data. This provides the possibility to detect spectral properties of intermediate areas located in the transition region between the blue and the red spectral unit. The analyses can enable i) documentation of how the spectral slope changes between these two units, ii) their boundaries and surface extent, and iii) identify additional surface materials, if present. This work paves the way to a more thorough analysis, foreseen in the near future, where the widest spectral wavelength range available (from near UV to far-IR) will be mined, returning a desirable complete picture of surface properties to enable planning for, and validation/confirmation by, future Phobos spacecraft. Acknowledgements: We make use of the public NASA-Planetary Data System MEX-OMEGA and MRO-CRISM spectral data of Phobos. M.P. was supported for this research by an appointment to the NASA Postdoctoral Program at the Ames Research Center, administered by Oak Ridge Associated Universities through a contract with NASA. References: Fraeman et al. 2012, J. Geophy. Res, E00J15, 10.1029/2012JE004137; Fraeman et al., 2014, Icarus, 229, 196-205, 10.1016/icarus.2013.11.021
NASA Astrophysics Data System (ADS)
Bonner, J.
2006-05-01
Differences in energy partitioning of seismic phases from earthquakes and explosions provide the opportunity for event identification. In this talk, I will briefly review teleseismic Ms:mb and P/S ratio techniques that help identify events based on differences in compressional, shear, and surface wave energy generation from explosions and earthquakes. With the push to identify smaller yield explosions, the identification process has become increasingly complex as varied types of explosions, including chemical, mining, and nuclear, must be identified at regional distances. Thus, I will highlight some of the current views and problems associated with the energy partitioning of seismic phases from single- and delay-fired chemical explosions. One problem yet to have a universally accepted answer is whether the explosion and earthquake populations, based on the Ms:mb discriminants, should be separated at smaller magnitudes. I will briefly describe the datasets and theory that support either converging or parallel behavior of these populations. Also, I will discuss improvement to the currently used methods that will better constrain this problem in the future. I will also discuss the role of regional P/S ratios in identifying explosions. In particular, recent datasets from South Africa, Scandinavia, and the Western United States collected from earthquakes, single-fired chemical explosions, and/or delay-fired mining explosions have provide new insight into regional P, S, Lg, and Rg energy partitioning. Data from co-located mining and chemical explosions suggest that some mining explosions may be used for limited calibration of regional discriminants in regions where no historic explosion data is available.
FraudMiner: A Novel Credit Card Fraud Detection Model Based on Frequent Itemset Mining
Seeja, K. R.; Zareapoor, Masoumeh
2014-01-01
This paper proposes an intelligent credit card fraud detection model for detecting fraud from highly imbalanced and anonymous credit card transaction datasets. The class imbalance problem is handled by finding legal as well as fraud transaction patterns for each customer by using frequent itemset mining. A matching algorithm is also proposed to find to which pattern (legal or fraud) the incoming transaction of a particular customer is closer and a decision is made accordingly. In order to handle the anonymous nature of the data, no preference is given to any of the attributes and each attribute is considered equally for finding the patterns. The performance evaluation of the proposed model is done on UCSD Data Mining Contest 2009 Dataset (anonymous and imbalanced) and it is found that the proposed model has very high fraud detection rate, balanced classification rate, Matthews correlation coefficient, and very less false alarm rate than other state-of-the-art classifiers. PMID:25302317
FraudMiner: a novel credit card fraud detection model based on frequent itemset mining.
Seeja, K R; Zareapoor, Masoumeh
2014-01-01
This paper proposes an intelligent credit card fraud detection model for detecting fraud from highly imbalanced and anonymous credit card transaction datasets. The class imbalance problem is handled by finding legal as well as fraud transaction patterns for each customer by using frequent itemset mining. A matching algorithm is also proposed to find to which pattern (legal or fraud) the incoming transaction of a particular customer is closer and a decision is made accordingly. In order to handle the anonymous nature of the data, no preference is given to any of the attributes and each attribute is considered equally for finding the patterns. The performance evaluation of the proposed model is done on UCSD Data Mining Contest 2009 Dataset (anonymous and imbalanced) and it is found that the proposed model has very high fraud detection rate, balanced classification rate, Matthews correlation coefficient, and very less false alarm rate than other state-of-the-art classifiers.
AstroML: Python-powered Machine Learning for Astronomy
NASA Astrophysics Data System (ADS)
Vander Plas, Jake; Connolly, A. J.; Ivezic, Z.
2014-01-01
As astronomical data sets grow in size and complexity, automated machine learning and data mining methods are becoming an increasingly fundamental component of research in the field. The astroML project (http://astroML.org) provides a common repository for practical examples of the data mining and machine learning tools used and developed by astronomical researchers, written in Python. The astroML module contains a host of general-purpose data analysis and machine learning routines, loaders for openly-available astronomical datasets, and fast implementations of specific computational methods often used in astronomy and astrophysics. The associated website features hundreds of examples of these routines being used for analysis of real astronomical datasets, while the associated textbook provides a curriculum resource for graduate-level courses focusing on practical statistics, machine learning, and data mining approaches within Astronomical research. This poster will highlight several of the more powerful and unique examples of analysis performed with astroML, all of which can be reproduced in their entirety on any computer with the proper packages installed.
The conference goal was to provide a forum for the exchange of scientific information on current and emerging approaches to assessing characterization, monitoring, source control, treatment and/or remediation on mining-influenced waters. The conference was aimed at mining remedi...
Ontology-based meta-analysis of global collections of high-throughput public data.
Kupershmidt, Ilya; Su, Qiaojuan Jane; Grewal, Anoop; Sundaresh, Suman; Halperin, Inbal; Flynn, James; Shekar, Mamatha; Wang, Helen; Park, Jenny; Cui, Wenwu; Wall, Gregory D; Wisotzkey, Robert; Alag, Satnam; Akhtari, Saeid; Ronaghi, Mostafa
2010-09-29
The investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today. We have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets. Our work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.
Younger, Paul L; Coulton, Richard H; Froggatt, Eric C
2005-02-01
The use of risk-based decision-making in environmental management is often assumed to rely primarily on the availability of robust scientific data and insights, while in practice socio-economic criteria are often of considerable importance. However, the relative contributions to decision-making made by scientific and socio-economic inputs are rarely assessed, and even less commonly reported. Such an assessment has been made for a major remediation project in southwest England, in which some 300 l/s of highly acidic, metalliferous mine waters are now being treated using oxidation and chemical neutralisation. In the process of reaching the decision to commission the treatment plant, a wide range of scientific studies were undertaken, including: biological impact assessments, hydrogeological investigations of the effect of pumping on the flooded mine system, and hydrological and geochemical characterisation, together with integrated catchment modelling, of pollutant sources and pathways. These investigations revealed that, despite the spectacular nature of the original mine water outburst in 1992, the ecology of the Fal estuary remains remarkably robust. No scientific evidence emerged of any grounds for concern over the estuarine ecology, even if mine water were left to flow untreated. However, a rare ecological resource known as "maerl" (a form of calcified seaweed) is harvested annually in the estuary, providing significant revenue to the local economy and underpinning the 'clean' image of local sea water. Social and environmental benefit surveys revealed strong public perceptions that any visible discoloration in the estuary must indicate a diminution in quality of the maerl, to the detriment of both the public image and economy of the area. This factor proved sufficient to justify the continued pump-and-treat operations at the mine site. Although the decisive factor in the end was socio-economic in nature, robust assessment of this factor could not have been made without robust scientific evidence. It is concluded that investment in investigating and contributing to the formation of public perceptions is just as important as investing in scientific investigations per se.
NASA Astrophysics Data System (ADS)
Boetius, Antje; Haeckel, Matthias
2018-01-01
As human use of rare metals has diversified and risen with global development, metal ore deposits from the deep ocean floor are increasingly seen as an attractive future resource. Japan recently completed the first successful test for zinc extraction from the deep seabed, and the number of seafloor exploration licenses filed at the International Seabed Authority (ISA) has tripled in the past 5 years. Seafloor-mining equipment is being tested, and industrial-scale production in national waters could start in a few years. We call for integrated scientific studies of global metal resources, the fluxes and fates of metal uses, and the ecological footprints of mining on land and in the sea, to critically assess the risks of deep-sea mining and the chances for alternative technologies. Given the increasing scientific evidence for long-lasting impacts of mining on the abyssal environment, precautionary regulations for commercial deep-sea mining are essential to protect marine ecosystems and their biodiversity.
2014-01-01
Background Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. Results We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways. PMID:25221624
NASA Astrophysics Data System (ADS)
Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd
2017-10-01
Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.
Assembling Large, Multi-Sensor Climate Datasets Using the SciFlo Grid Workflow System
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Xing, Z.; Fetzer, E.
2008-12-01
NASA's Earth Observing System (EOS) is the world's most ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the A-Train platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the cloud scenes from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time matchups between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, and assemble merged datasets for further scientific and statistical analysis. To meet these large-scale challenges, we are utilizing a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data query, access, subsetting, co-registration, mining, fusion, and advanced statistical analysis. SciFlo is a semantically-enabled ("smart") Grid Workflow system that ties together a peer-to-peer network of computers into an efficient engine for distributed computation. The SciFlo workflow engine enables scientists to do multi-instrument Earth Science by assembling remotely-invokable Web Services (SOAP or http GET URLs), native executables, command-line scripts, and Python codes into a distributed computing flow. A scientist visually authors the graph of operation in the VizFlow GUI, or uses a text editor to modify the simple XML workflow documents. The SciFlo client & server engines optimize the execution of such distributed workflows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. The engine transparently moves data to the operators, and moves operators to the data (on the dozen trusted SciFlo nodes). SciFlo also deploys a variety of Data Grid services to: query datasets in space and time, locate & retrieve on-line data granules, provide on-the-fly variable and spatial subsetting, and perform pairwise instrument matchups for A-Train datasets. These services are combined into efficient workflows to assemble the desired large-scale, merged climate datasets. SciFlo is currently being applied in several large climate studies: comparisons of aerosol optical depth between MODIS, MISR, AERONET ground network, and U. Michigan's IMPACT aerosol transport model; characterization of long-term biases in microwave and infrared instruments (AIRS, MLS) by comparisons to GPS temperature retrievals accurate to 0.1 degrees Kelvin; and construction of a decade-long, multi-sensor water vapor climatology stratified by classified cloud scene by bringing together datasets from AIRS/AMSU, AMSR-E, MLS, MODIS, and CloudSat (NASA MEASUREs grant, Fetzer PI). The presentation will discuss the SciFlo technologies, their application in these distributed workflows, and the many challenges encountered in assembling and analyzing these massive datasets.
Assembling Large, Multi-Sensor Climate Datasets Using the SciFlo Grid Workflow System
NASA Astrophysics Data System (ADS)
Wilson, B.; Manipon, G.; Xing, Z.; Fetzer, E.
2009-04-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To meet these large-scale challenges, we are utilizing a Grid computing and dataflow framework, named SciFlo, in which we are deploying a set of versatile and reusable operators for data query, access, subsetting, co-registration, mining, fusion, and advanced statistical analysis. SciFlo is a semantically-enabled ("smart") Grid Workflow system that ties together a peer-to-peer network of computers into an efficient engine for distributed computation. The SciFlo workflow engine enables scientists to do multi-instrument Earth Science by assembling remotely-invokable Web Services (SOAP or http GET URLs), native executables, command-line scripts, and Python codes into a distributed computing flow. A scientist visually authors the graph of operation in the VizFlow GUI, or uses a text editor to modify the simple XML workflow documents. The SciFlo client & server engines optimize the execution of such distributed workflows and allow the user to transparently find and use datasets and operators without worrying about the actual location of the Grid resources. The engine transparently moves data to the operators, and moves operators to the data (on the dozen trusted SciFlo nodes). SciFlo also deploys a variety of Data Grid services to: query datasets in space and time, locate & retrieve on-line data granules, provide on-the-fly variable and spatial subsetting, perform pairwise instrument matchups for A-Train datasets, and compute fused products. These services are combined into efficient workflows to assemble the desired large-scale, merged climate datasets. SciFlo is currently being applied in several large climate studies: comparisons of aerosol optical depth between MODIS, MISR, AERONET ground network, and U. Michigan's IMPACT aerosol transport model; characterization of long-term biases in microwave and infrared instruments (AIRS, MLS) by comparisons to GPS temperature retrievals accurate to 0.1 degrees Kelvin; and construction of a decade-long, multi-sensor water vapor climatology stratified by classified cloud scene by bringing together datasets from AIRS/AMSU, AMSR-E, MLS, MODIS, and CloudSat (NASA MEASUREs grant, Fetzer PI). The presentation will discuss the SciFlo technologies, their application in these distributed workflows, and the many challenges encountered in assembling and analyzing these massive datasets.
Mineral resources management based on GIS and RS: a case study of the Laozhaiwan Gold Mine
NASA Astrophysics Data System (ADS)
Wu, Hao; Hua, Xianghong; Wang, Xinzhou; Ma, Liguang; Yuan, Yanbin
2005-10-01
With the development of digital information technology in mining industry, the concept of DM (Digital Mining) and MGIS (Mining Geographical Information System) are becoming the research focus but not perfect. How to effectively manage the dataset of geological, surveying and mineral products grade is the key point that concerned the sustainable development and standardized management in mining industry. Based on the existing combined GIS and remote sensing technology, we propose a model named DMMIS (Digital Mining Management Information System), which is composed of the database layer, the ActiveX layer and the user interface layer. The system is used in Laozhaiwan Gold Mine, Yunnan Province of China, which is shown to demonstrate the feasibility of the research and development achievement stated in this paper. Finally, some conclusions and constructive advices for future research work are given.
Microarray Data Mining for Potential Selenium Targets in Chemoprevention of Prostate Cancer
ZHANG, HAITAO; DONG, YAN; ZHAO, HONGJUAN; BROOKS, JAMES D.; HAWTHORN, LESLEYANN; NOWAK, NORMA; MARSHALL, JAMES R.; GAO, ALLEN C.; IP, CLEMENT
2008-01-01
Background A previous clinical trial showed that selenium supplementation significantly reduced the incidence of prostate cancer. We report here a bioinformatics approach to gain new insights into selenium molecular targets that might be relevant to prostate cancer chemoprevention. Materials and Methods We first performed data mining analysis to identify genes which are consistently dysregulated in prostate cancer using published datasets from gene expression profiling of clinical prostate specimens. We then devised a method to systematically analyze three selenium microarray datasets from the LNCaP human prostate cancer cells, and to match the analysis to the cohort of genes implicated in prostate carcinogenesis. Moreover, we compared the selenium datasets with two datasets obtained from expression profiling of androgen-stimulated LNCaP cells. Results We found that selenium reverses the expression of genes implicated in prostate carcinogenesis. In addition, we found that selenium could counteract the effect of androgen on the expression of a subset obtained from androgen-regulated genes. Conclusions The above information provides us with a treasure of new clues to investigate the mechanism of selenium chemoprevention of prostate cancer. Furthermore, these selenium target genes could also serve as biomarkers in future clinical trials to gauge the efficacy of selenium intervention. PMID:18548127
Large-scale machine learning and evaluation platform for real-time traffic surveillance
NASA Astrophysics Data System (ADS)
Eichel, Justin A.; Mishra, Akshaya; Miller, Nicholas; Jankovic, Nicholas; Thomas, Mohan A.; Abbott, Tyler; Swanson, Douglas; Keller, Joel
2016-09-01
In traffic engineering, vehicle detectors are trained on limited datasets, resulting in poor accuracy when deployed in real-world surveillance applications. Annotating large-scale high-quality datasets is challenging. Typically, these datasets have limited diversity; they do not reflect the real-world operating environment. There is a need for a large-scale, cloud-based positive and negative mining process and a large-scale learning and evaluation system for the application of automatic traffic measurements and classification. The proposed positive and negative mining process addresses the quality of crowd sourced ground truth data through machine learning review and human feedback mechanisms. The proposed learning and evaluation system uses a distributed cloud computing framework to handle data-scaling issues associated with large numbers of samples and a high-dimensional feature space. The system is trained using AdaBoost on 1,000,000 Haar-like features extracted from 70,000 annotated video frames. The trained real-time vehicle detector achieves an accuracy of at least 95% for 1/2 and about 78% for 19/20 of the time when tested on ˜7,500,000 video frames. At the end of 2016, the dataset is expected to have over 1 billion annotated video frames.
The Stream-Catchment (StreamCat) and Lake-Catchment ...
Background/Question/MethodsLake and stream conditions respond to both natural and human-related landscape features. Characterizing these features within contributing areas (i.e., delineated watersheds) of streams and lakes could improve our understanding of how biological conditions vary spatially and improve the use, management, and restoration of these aquatic resources. However, the specialized geospatial techniques required to define and characterize stream and lake watersheds has limited their widespread use in both scientific and management efforts at large spatial scales. We developed the StreamCat and LakeCat Datasets to model, predict, and map the probable biological conditions of streams and lakes across the conterminous US (CONUS). Both StreamCat and LakeCat contain watershed-level characterizations of several hundred natural (e.g., soils, geology, climate, and land cover) and anthropogenic (e.g., urbanization, agriculture, mining, and forest management) landscape features for ca. 2.6 million stream segments and 376,000 lakes across the CONUS, respectively. These datasets can be paired with field samples to provide independent variables for modeling and other analyses. We paired 1,380 stream and 1,073 lake samples from the USEPAs National Aquatic Resource Surveys with StreamCat and LakeCat and used random forest (RF) to model and then map an invertebrate condition index and chlorophyll a concentration, respectively. Results/ConclusionsThe invertebrate
Design of FastQuery: How to Generalize Indexing and Querying System for Scientific Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Jerry; Wu, Kesheng
2011-04-18
Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies such as FastBit are critical for facilitating interactive exploration of large datasets. These technologies rely on adding auxiliary information to existing datasets to accelerate query processing. To use these indices, we need to match the relational data model used by the indexing systems with the array data model used by most scientific data, and to provide an efficient input and output layer for reading and writing the indices. In this work, we present a flexible design that can be easily applied to most scientific datamore » formats. We demonstrate this flexibility by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using simulation data from the particle accelerator and climate simulation communities. To demonstrate the effectiveness of the new design, we also present a detailed performance study using both synthetic and real scientific workloads.« less
ERIC Educational Resources Information Center
Faulkner, Robert; Davidson, Jane W.; McPherson, Gary E.
2010-01-01
The use of data mining for the analysis of data collected in natural settings is increasingly recognized as a legitimate mode of enquiry. This rule-inductive paradigm is an effective means of discovering relationships within large datasets--especially in research that has limited experimental design--and for the subsequent formulation of…
Effective and efficient analysis of spatio-temporal data
NASA Astrophysics Data System (ADS)
Zhang, Zhongnan
Spatio-temporal data mining, i.e., mining knowledge from large amount of spatio-temporal data, is a highly demanding field because huge amounts of spatio-temporal data have been collected in various applications, ranging from remote sensing, to geographical information systems (GIS), computer cartography, environmental assessment and planning, etc. The collection data far exceeded human's ability to analyze which make it crucial to develop analysis tools. Recent studies on data mining have extended to the scope of data mining from relational and transactional datasets to spatial and temporal datasets. Among the various forms of spatio-temporal data, remote sensing images play an important role, due to the growing wide-spreading of outer space satellites. In this dissertation, we proposed two approaches to analyze the remote sensing data. The first one is about applying association rules mining onto images processing. Each image was divided into a number of image blocks. We built a spatial relationship for these blocks during the dividing process. This made a large number of images into a spatio-temporal dataset since each image was shot in time-series. The second one implemented co-occurrence patterns discovery from these images. The generated patterns represent subsets of spatial features that are located together in space and time. A weather analysis is composed of individual analysis of several meteorological variables. These variables include temperature, pressure, dew point, wind, clouds, visibility and so on. Local-scale models provide detailed analysis and forecasts of meteorological phenomena ranging from a few kilometers to about 100 kilometers in size. When some of above meteorological variables have some special change tendency, some kind of severe weather will happen in most cases. Using the discovery of association rules, we found that some special meteorological variables' changing has tight relation with some severe weather situation that will happen very soon. This dissertation is composed of three parts: an introduction, some basic knowledges and relative works, and my own three contributions to the development of approaches for spatio-temporal data mining: DYSTAL algorithm, STARSI algorithm, and COSTCOP+ algorithm.
Code of Federal Regulations, 2014 CFR
2014-07-01
... surface mine, including any delivery, office, or scientific worker or occasional, short-term maintenance... construction, or who is regularly exposed to mine hazards, or who is a maintenance or service worker employed by the operator or a maintenance or service worker contracted by the operator to work at the mine for...
Code of Federal Regulations, 2012 CFR
2012-07-01
... surface mine, including any delivery, office, or scientific worker or occasional, short-term maintenance... construction, or who is regularly exposed to mine hazards, or who is a maintenance or service worker employed by the operator or a maintenance or service worker contracted by the operator to work at the mine for...
Code of Federal Regulations, 2013 CFR
2013-07-01
... surface mine, including any delivery, office, or scientific worker or occasional, short-term maintenance... construction, or who is regularly exposed to mine hazards, or who is a maintenance or service worker employed by the operator or a maintenance or service worker contracted by the operator to work at the mine for...
Code of Federal Regulations, 2011 CFR
2011-07-01
... surface mine, including any delivery, office, or scientific worker or occasional, short-term maintenance... construction, or who is regularly exposed to mine hazards, or who is a maintenance or service worker employed by the operator or a maintenance or service worker contracted by the operator to work at the mine for...
Long-Range Trends in the Development of Higher Education in Mining
ERIC Educational Resources Information Center
Pzhevskii, V. V.
1978-01-01
Reviews the relationship between scientific and technological progress, needs of the Soviet economy, and capacities of higher education institutions in terms of the future of the mining industry. Mining specialists will have to prepare for technical progress in the extracting industries, growth in production, complex mechanization, and automation…
An Investigation of the Intellectual Structure of Opinion Mining Research
ERIC Educational Resources Information Center
Zhu, Yongjun; Kim, Meen Chul; Chen, Chaomei
2017-01-01
Introduction: Opinion mining has been receiving increasing attention from a broad range of scientific communities since early 2000s. The present study aims to systematically investigate the intellectual structure of opinion mining research. Method: Using topic search, citation expansion, and patent search, we collected 5,596 bibliographic records…
Online Tools for Bioinformatics Analyses in Nutrition Sciences12
Malkaram, Sridhar A.; Hassan, Yousef I.; Zempleni, Janos
2012-01-01
Recent advances in “omics” research have resulted in the creation of large datasets that were generated by consortiums and centers, small datasets that were generated by individual investigators, and bioinformatics tools for mining these datasets. It is important for nutrition laboratories to take full advantage of the analysis tools to interrogate datasets for information relevant to genomics, epigenomics, transcriptomics, proteomics, and metabolomics. This review provides guidance regarding bioinformatics resources that are currently available in the public domain, with the intent to provide a starting point for investigators who want to take advantage of the opportunities provided by the bioinformatics field. PMID:22983844
NASA Astrophysics Data System (ADS)
Currell, Matthew J.; Werner, Adrian D.; McGrath, Chris; Webb, John A.; Berkman, Michael
2017-05-01
Understanding and managing impacts from mining on groundwater-dependent ecosystems (GDEs) and other groundwater users requires development of defensible science supported by adequate field data. This usually leads to the creation of predictive models and analysis of the likely impacts of mining and their accompanying uncertainties. The identification, monitoring and management of impacts on GDEs are often a key component of mine approvals, which need to consider and attempt to minimise the risks that negative impacts may arise. Here we examine a case study where approval for a large mining project in Australia (Carmichael Coal Mine) was challenged in court on the basis that it may result in more extensive impacts on a GDE (Doongmabulla Springs) of high ecological and cultural significance than predicted by the proponent. We show that throughout the environmental assessment and approval process, significant data gaps and scientific uncertainties remained unresolved. Evidence shows that the assumed conceptual hydrogeological model for the springs could be incorrect, and that at least one alternative conceptualisation (that the springs are dependent on a deep fault) is consistent with the available field data. Assumptions made about changes to spring flow as a consequence of mine-induced drawdown also appear problematic, with significant implications for the spring-fed wetlands. Despite the large scale of the project, it appears that critical scientific data required to resolve uncertainties and construct robust models of the springs' relationship to the groundwater system were lacking at the time of approval, contributing to uncertainty and conflict. For this reason, we recommend changes to the approval process that would require a higher standard of scientific information to be collected and reviewed, particularly in relation to key environmental assets during the environmental impact assessment process in future projects.
Research on Occupational Safety, Health Management and Risk Control Technology in Coal Mines.
Zhou, Lu-Jie; Cao, Qing-Gui; Yu, Kai; Wang, Lin-Lin; Wang, Hai-Bin
2018-04-26
This paper studies the occupational safety and health management methods as well as risk control technology associated with the coal mining industry, including daily management of occupational safety and health, identification and assessment of risks, early warning and dynamic monitoring of risks, etc.; also, a B/S mode software (Geting Coal Mine, Jining, Shandong, China), i.e., Coal Mine Occupational Safety and Health Management and Risk Control System, is developed to attain the aforementioned objectives, namely promoting the coal mine occupational safety and health management based on early warning and dynamic monitoring of risks. Furthermore, the practical effectiveness and the associated pattern for applying this software package to coal mining is analyzed. The study indicates that the presently developed coal mine occupational safety and health management and risk control technology and the associated software can support the occupational safety and health management efforts in coal mines in a standardized and effective manner. It can also control the accident risks scientifically and effectively; its effective implementation can further improve the coal mine occupational safety and health management mechanism, and further enhance the risk management approaches. Besides, its implementation indicates that the occupational safety and health management and risk control technology has been established based on a benign cycle involving dynamic feedback and scientific development, which can provide a reliable assurance to the safe operation of coal mines.
Research on Occupational Safety, Health Management and Risk Control Technology in Coal Mines
Zhou, Lu-jie; Cao, Qing-gui; Yu, Kai; Wang, Lin-lin; Wang, Hai-bin
2018-01-01
This paper studies the occupational safety and health management methods as well as risk control technology associated with the coal mining industry, including daily management of occupational safety and health, identification and assessment of risks, early warning and dynamic monitoring of risks, etc.; also, a B/S mode software (Geting Coal Mine, Jining, Shandong, China), i.e., Coal Mine Occupational Safety and Health Management and Risk Control System, is developed to attain the aforementioned objectives, namely promoting the coal mine occupational safety and health management based on early warning and dynamic monitoring of risks. Furthermore, the practical effectiveness and the associated pattern for applying this software package to coal mining is analyzed. The study indicates that the presently developed coal mine occupational safety and health management and risk control technology and the associated software can support the occupational safety and health management efforts in coal mines in a standardized and effective manner. It can also control the accident risks scientifically and effectively; its effective implementation can further improve the coal mine occupational safety and health management mechanism, and further enhance the risk management approaches. Besides, its implementation indicates that the occupational safety and health management and risk control technology has been established based on a benign cycle involving dynamic feedback and scientific development, which can provide a reliable assurance to the safe operation of coal mines. PMID:29701715
A new dataset validation system for the Planetary Science Archive
NASA Astrophysics Data System (ADS)
Manaud, N.; Zender, J.; Heather, D.; Martinez, S.
2007-08-01
The Planetary Science Archive is the official archive for the Mars Express mission. It has received its first data by the end of 2004. These data are delivered by the PI teams to the PSA team as datasets, which are formatted conform to the Planetary Data System (PDS). The PI teams are responsible for analyzing and calibrating the instrument data as well as the production of reduced and calibrated data. They are also responsible of the scientific validation of these data. ESA is responsible of the long-term data archiving and distribution to the scientific community and must ensure, in this regard, that all archived products meet quality. To do so, an archive peer-review is used to control the quality of the Mars Express science data archiving process. However a full validation of its content is missing. An independent review board recently recommended that the completeness of the archive as well as the consistency of the delivered data should be validated following well-defined procedures. A new validation software tool is being developed to complete the overall data quality control system functionality. This new tool aims to improve the quality of data and services provided to the scientific community through the PSA, and shall allow to track anomalies in and to control the completeness of datasets. It shall ensure that the PSA end-users: (1) can rely on the result of their queries, (2) will get data products that are suitable for scientific analysis, (3) can find all science data acquired during a mission. We defined dataset validation as the verification and assessment process to check the dataset content against pre-defined top-level criteria, which represent the general characteristics of good quality datasets. The dataset content that is checked includes the data and all types of information that are essential in the process of deriving scientific results and those interfacing with the PSA database. The validation software tool is a multi-mission tool that has been designed to provide the user with the flexibility of defining and implementing various types of validation criteria, to iteratively and incrementally validate datasets, and to generate validation reports.
Solutions for Mining Distributed Scientific Data
NASA Astrophysics Data System (ADS)
Lynnes, C.; Pham, L.; Graves, S.; Ramachandran, R.; Maskey, M.; Keiser, K.
2007-12-01
Researchers at the University of Alabama in Huntsville (UAH) and the Goddard Earth Sciences Data and Information Services Center (GES DISC) are working on approaches and methodologies facilitating the analysis of large amounts of distributed scientific data. Despite the existence of full-featured analysis tools, such as the Algorithm Development and Mining (ADaM) toolkit from UAH, and data repositories, such as the GES DISC, that provide online access to large amounts of data, there remain obstacles to getting the analysis tools and the data together in a workable environment. Does one bring the data to the tools or deploy the tools close to the data? The large size of many current Earth science datasets incurs significant overhead in network transfer for analysis workflows, even with the advanced networking capabilities that are available between many educational and government facilities. The UAH and GES DISC team are developing a capability to define analysis workflows using distributed services and online data resources. We are developing two solutions for this problem that address different analysis scenarios. The first is a Data Center Deployment of the analysis services for large data selections, orchestrated by a remotely defined analysis workflow. The second is a Data Mining Center approach of providing a cohesive analysis solution for smaller subsets of data. The two approaches can be complementary and thus provide flexibility for researchers to exploit the best solution for their data requirements. The Data Center Deployment of the analysis services has been implemented by deploying ADaM web services at the GES DISC so they can access the data directly, without the need of network transfers. Using the Mining Workflow Composer, a user can define an analysis workflow that is then submitted through a Web Services interface to the GES DISC for execution by a processing engine. The workflow definition is composed, maintained and executed at a distributed location, but most of the actual services comprising the workflow are available local to the GES DISC data repository. Additional refinements will ultimately provide a package that is easily implemented and configured at additional data centers for analysis of additional science data sets. Enhancements to the ADaM toolkit allow the staging of distributed data wherever the services are deployed, to support a Data Mining Center that can provide additional computational resources, large storage of output, easier addition and updates to available services, and access to data from multiple repositories. The Data Mining Center case provides researchers more flexibility to quickly try different workflow configurations and refine the process, using smaller amounts of data that may likely be transferred from distributed online repositories. This environment is sufficient for some analyses, but can also be used as an initial sandbox to test and refine a solution before staging the execution at a Data Center Deployment. Detection of airborne dust both over water and land in MODIS imagery using mining services for both solutions will be presented. The dust detection is just one possible example of the mining and analysis capabilities the proposed mining services solutions will provide to the science community. More information about the available services and the current status of this project is available at http://www.itsc.uah.edu/mws/
Data Mining Techniques Applied to Hydrogen Lactose Breath Test.
Rubio-Escudero, Cristina; Valverde-Fernández, Justo; Nepomuceno-Chamorro, Isabel; Pontes-Balanza, Beatriz; Hernández-Mendoza, Yoedusvany; Rodríguez-Herrera, Alfonso
2017-01-01
Analyze a set of data of hydrogen breath tests by use of data mining tools. Identify new patterns of H2 production. Hydrogen breath tests data sets as well as k-means clustering as the data mining technique to a dataset of 2571 patients. Six different patterns have been extracted upon analysis of the hydrogen breath test data. We have also shown the relevance of each of the samples taken throughout the test. Analysis of the hydrogen breath test data sets using data mining techniques has identified new patterns of hydrogen generation upon lactose absorption. We can see the potential of application of data mining techniques to clinical data sets. These results offer promising data for future research on the relations between gut microbiota produced hydrogen and its link to clinical symptoms.
NASA Astrophysics Data System (ADS)
Qadi, A.; Cloutis, E.; Samson, C.; Whyte, L.; Ellery, A.; Bell, J. F.; Berard, G.; Boivin, A.; Haddad, E.; Lavoie, J.; Jamroz, W.; Kruzelecky, R.; Mack, A.; Mann, P.; Olsen, K.; Perrot, M.; Popa, D.; Rhind, T.; Sharma, R.; Stromberg, J.; Strong, K.; Tremblay, A.; Wilhelm, R.; Wing, B.; Wong, B.
2015-05-01
The Canadian Space Agency (CSA), through its Analogue Missions program, supported a microrover-based analogue mission designed to simulate a Mars rover mission geared toward identifying and characterizing methane emissions on Mars. The analogue mission included two, progressively more complex, deployments in open-pit asbestos mines where methane can be generated from the weathering of olivine into serpentine: the Jeffrey mine deployment (June 2011) and the Norbestos mine deployment (June 2012). At the Jeffrey Mine, testing was conducted over 4 days using a modified off-the-shelf Pioneer rover and scientific instruments including Raman spectrometer, Picarro methane detector, hyperspectral point spectrometer and electromagnetic induction sounder for testing rock and gas samples. At the Norbestos Mine, we used the research Kapvik microrover which features enhanced autonomous navigation capabilities and a wider array of scientific instruments. This paper describes the rover operations in terms of planning, deployment, communication and equipment setup, rover path parameters and instrument performance. Overall, the deployments suggest that a search strategy of “follow the methane” is not practical given the mechanisms of methane dispersion. Rather, identification of features related to methane sources based on image tone/color and texture from panoramic imagery is more profitable.
Lohmann, Ingrid
2012-01-01
In multi-cellular organisms, spatiotemporal activity of cis-regulatory DNA elements depends on their occupancy by different transcription factors (TFs). In recent years, genome-wide ChIP-on-Chip, ChIP-Seq and DamID assays have been extensively used to unravel the combinatorial interaction of TFs with cis-regulatory modules (CRMs) in the genome. Even though genome-wide binding profiles are increasingly becoming available for different TFs, single TF binding profiles are in most cases not sufficient for dissecting complex regulatory networks. Thus, potent computational tools detecting statistically significant and biologically relevant TF-motif co-occurrences in genome-wide datasets are essential for analyzing context-dependent transcriptional regulation. We have developed COPS (Co-Occurrence Pattern Search), a new bioinformatics tool based on a combination of association rules and Markov chain models, which detects co-occurring TF binding sites (BSs) on genomic regions of interest. COPS scans DNA sequences for frequent motif patterns using a Frequent-Pattern tree based data mining approach, which allows efficient performance of the software with respect to both data structure and implementation speed, in particular when mining large datasets. Since transcriptional gene regulation very often relies on the formation of regulatory protein complexes mediated by closely adjoining TF binding sites on CRMs, COPS additionally detects preferred short distance between co-occurring TF motifs. The performance of our software with respect to biological significance was evaluated using three published datasets containing genomic regions that are independently bound by several TFs involved in a defined biological process. In sum, COPS is a fast, efficient and user-friendly tool mining statistically and biologically significant TFBS co-occurrences and therefore allows the identification of TFs that combinatorially regulate gene expression. PMID:23272209
Jamal, Salma; Scaria, Vinod
2014-01-01
Background. Traditional Chinese medicine encompasses a well established alternate system of medicine based on a broad range of herbal formulations and is practiced extensively in the region for the treatment of a wide variety of diseases. In recent years, several reports describe in depth studies of the molecular ingredients of traditional Chinese medicines on the biological activities including anti-bacterial activities. The availability of a well-curated dataset of molecular ingredients of traditional Chinese medicines and accurate in-silico cheminformatics models for data mining for antitubercular agents and computational filters to prioritize molecules has prompted us to search for potential hits from these datasets. Results. We used a consensus approach to predict molecules with potential antitubercular activities from a large dataset of molecular ingredients of traditional Chinese medicines available in the public domain. We further prioritized 160 molecules based on five computational filters (SMARTSfilter) so as to avoid potentially undesirable molecules. We further examined the molecules for permeability across Mycobacterial cell wall and for potential activities against non-replicating and drug tolerant Mycobacteria. Additional in-depth literature surveys for the reported antitubercular activities of the molecular ingredients and their sources were considered for drawing support to prioritization. Conclusions. Our analysis suggests that datasets of molecular ingredients of traditional Chinese medicines offer a new opportunity to mine for potential biological activities. In this report, we suggest a proof-of-concept methodology to prioritize molecules for further experimental assays using a variety of computational tools. We also additionally suggest that a subset of prioritized molecules could be used for evaluation for tuberculosis due to their additional effect against non-replicating tuberculosis as well as the additional hepato-protection offered by the source of these ingredients.
Jamal, Salma
2014-01-01
Background. Traditional Chinese medicine encompasses a well established alternate system of medicine based on a broad range of herbal formulations and is practiced extensively in the region for the treatment of a wide variety of diseases. In recent years, several reports describe in depth studies of the molecular ingredients of traditional Chinese medicines on the biological activities including anti-bacterial activities. The availability of a well-curated dataset of molecular ingredients of traditional Chinese medicines and accurate in-silico cheminformatics models for data mining for antitubercular agents and computational filters to prioritize molecules has prompted us to search for potential hits from these datasets. Results. We used a consensus approach to predict molecules with potential antitubercular activities from a large dataset of molecular ingredients of traditional Chinese medicines available in the public domain. We further prioritized 160 molecules based on five computational filters (SMARTSfilter) so as to avoid potentially undesirable molecules. We further examined the molecules for permeability across Mycobacterial cell wall and for potential activities against non-replicating and drug tolerant Mycobacteria. Additional in-depth literature surveys for the reported antitubercular activities of the molecular ingredients and their sources were considered for drawing support to prioritization. Conclusions. Our analysis suggests that datasets of molecular ingredients of traditional Chinese medicines offer a new opportunity to mine for potential biological activities. In this report, we suggest a proof-of-concept methodology to prioritize molecules for further experimental assays using a variety of computational tools. We also additionally suggest that a subset of prioritized molecules could be used for evaluation for tuberculosis due to their additional effect against non-replicating tuberculosis as well as the additional hepato-protection offered by the source of these ingredients. PMID:25081126
Connecting the Public to Scientific Research Data - Science On a Sphere°
NASA Astrophysics Data System (ADS)
Henderson, M. A.; Russell, E. L.; Science on a Sphere Datasets
2011-12-01
Connecting the Public to Scientific Research Data - Science On a Sphere° Maurice Henderson, NASA Goddard Space Flight Center Elizabeth Russell, NOAA Earth System Research Laboratory, University of Colorado Cooperative Institute for Research in Environmental Sciences Science On a Sphere° is a six foot animated globe developed by the National Ocean and Atmospheric Administration, NOAA, as a means to display global scientific research data in an intuitive, engaging format in public forums. With over 70 permanent installations of SOS around the world in science museums, visitor's centers and universities, the audience that enjoys SOS yearly is substantial, wide-ranging, and diverse. Through partnerships with the National Aeronautics and Space Administration, NASA, the SOS Data Catalog (http://sos.noaa.gov/datasets/) has grown to a collection of over 350 datasets from NOAA, NASA, and many others. Using an external projection system, these datasets are displayed onto the sphere creating a seamless global image. In a cross-site evaluation of Science On a Sphere°, 82% of participants said yes, seeing information displayed on a sphere changed their understanding of the information. This unique technology captivates viewers and exposes them to scientific research data in a way that is accessible, presentable, and understandable. The datasets that comprise the SOS Data Catalog are scientific research data that have been formatted for display on SOS. By formatting research data into visualizations that can be used on SOS, NOAA and NASA are able to turn research data into educational materials that are easily accessible for users. In many cases, visualizations do not need to be modified because SOS uses a common map projection. The SOS Data Catalog has become a "one-stop shop" for a broad range of global datasets from across NOAA and NASA, and as a result, the traffic on the site is more than just SOS users. While the target audience for this site is SOS users, many inquiries come from teachers, book editors, film producers and students interested in using the available datasets. The SOS Data Catalog online includes a written description of each dataset, rendered images of the data, animated movies of the data, links to more information, details on the data source and creator, and a link to a FTP server where each dataset can be downloaded. Many of the datasets are also displayed on the SOS YouTube Channel and Facebook page. In addition, NASA has developed NASA Earth Observations, NEO, which is a collection of global satellite datasets. The NEO website allows users to layer multiple datasets and perform basic analysis. Through a new iPad application, the NASA Earth Observations datasets can be exported to SOS and analyzed on the sphere. This new capability greatly expands the number of datasets that can be shown on SOS and adds a new element of interactivity with the datasets.
Hahn, P; Dullweber, F; Unglaub, F; Spies, C K
2014-06-01
Searching for relevant publications is becoming more difficult with the increasing number of scientific articles. Text mining as a specific form of computer-based data analysis may be helpful in this context. Highlighting relations between authors and finding relevant publications concerning a specific subject using text analysis programs are illustrated graphically by 2 performed examples. © Georg Thieme Verlag KG Stuttgart · New York.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-05-27
... scientific reports assessing the environmental and water quality effects of mountaintop coal mining on Appalachian streams. Both reports, prepared by EPA scientists in the Agency's Office of Research and... responsible for the review of surface coal mining operations under the Clean Water Act. The two reports...
Chapter 2: The forestry reclamation approach
Jim Burger; Don Graves; Patrick Angel; Vic Davis; Carl Zipper
2017-01-01
The Forestry Reclamation Approach (FRA) is a method for reclaiming coal-mined land to forest under the federal Surface Mining Control and Reclamation Act of 1977 (SMCRA). The FRA is based on knowledge gained from both scientific research and experience (Fig. 2-1). The FRA can achieve cost-effective regulatory compliance for mine operators while creating productive...
Convalescing Cluster Configuration Using a Superlative Framework
Sabitha, R.; Karthik, S.
2015-01-01
Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895
Highly scalable and robust rule learner: performance evaluation and comparison.
Kurgan, Lukasz A; Cios, Krzysztof J; Dick, Scott
2006-02-01
Business intelligence and bioinformatics applications increasingly require the mining of datasets consisting of millions of data points, or crafting real-time enterprise-level decision support systems for large corporations and drug companies. In all cases, there needs to be an underlying data mining system, and this mining system must be highly scalable. To this end, we describe a new rule learner called DataSqueezer. The learner belongs to the family of inductive supervised rule extraction algorithms. DataSqueezer is a simple, greedy, rule builder that generates a set of production rules from labeled input data. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact, comprehensible, and have accuracy comparable to rules generated by other state-of-the-art rule extraction algorithms. The main advantages of DataSqueezer are very high efficiency, and missing data resistance. DataSqueezer exhibits log-linear asymptotic complexity with the number of training examples, and it is faster than other state-of-the-art rule learners. The learner is also robust to large quantities of missing data, as verified by extensive experimental comparison with the other learners. DataSqueezer is thus well suited to modern data mining and business intelligence tasks, which commonly involve huge datasets with a large fraction of missing data.
Wide-Open: Accelerating public data release by automating detection of overdue datasets
Poon, Hoifung; Howe, Bill
2017-01-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819
Wide-Open: Accelerating public data release by automating detection of overdue datasets.
Grechkin, Maxim; Poon, Hoifung; Howe, Bill
2017-06-01
Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
SOMA: A Proposed Framework for Trend Mining in Large UK Diabetic Retinopathy Temporal Databases
NASA Astrophysics Data System (ADS)
Somaraki, Vassiliki; Harding, Simon; Broadbent, Deborah; Coenen, Frans
In this paper, we present SOMA, a new trend mining framework; and Aretaeus, the associated trend mining algorithm. The proposed framework is able to detect different kinds of trends within longitudinal datasets. The prototype trends are defined mathematically so that they can be mapped onto the temporal patterns. Trends are defined and generated in terms of the frequency of occurrence of pattern changes over time. To evaluate the proposed framework the process was applied to a large collection of medical records, forming part of the diabetic retinopathy screening programme at the Royal Liverpool University Hospital.
Survey of Analysis of Crime Detection Techniques Using Data Mining and Machine Learning
NASA Astrophysics Data System (ADS)
Prabakaran, S.; Mitra, Shilpa
2018-04-01
Data mining is the field containing procedures for finding designs or patterns in a huge dataset, it includes strategies at the convergence of machine learning and database framework. It can be applied to various fields like future healthcare, market basket analysis, education, manufacturing engineering, crime investigation etc. Among these, crime investigation is an interesting application to process crime characteristics to help the society for a better living. This paper survey various data mining techniques used in this domain. This study may be helpful in designing new strategies for crime prediction and analysis.
Drilling informatics: data-driven challenges of scientific drilling
NASA Astrophysics Data System (ADS)
Yamada, Yasuhiro; Kyaw, Moe; Saito, Sanny
2017-04-01
The primary aim of scientific drilling is to precisely understand the dynamic nature of the Earth. This is the reason why we investigate the subsurface materials (rock and fluid including microbial community) existing under particular environmental conditions. This requires sample collection and analytical data production from the samples, and in-situ data measurement at boreholes. Current available data comes from cores, cuttings, mud logging, geophysical logging, and exploration geophysics, but these datasets are difficult to be integrated because of their different kinds and scales. Now we are producing more useful datasets to fill the gap between the exiting data and extracting more information from such datasets and finally integrating the information. In particular, drilling parameters are very useful datasets as geomechanical properties. We believe such approach, 'drilling informatics', would be the most appropriate to obtain the comprehensive and dynamic picture of our scientific target, such as the seismogenic fault zone and the Moho discontinuity surface. This presentation introduces our initiative and current achievements of drilling informatics.
On mining complex sequential data by means of FCA and pattern structures
NASA Astrophysics Data System (ADS)
Buzmakov, Aleksey; Egho, Elias; Jay, Nicolas; Kuznetsov, Sergei O.; Napoli, Amedeo; Raïssi, Chedy
2016-02-01
Nowadays data-sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of formal concept analysis and its extension based on "pattern structures". Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e. a data reduction of sequential structures) are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analysing interesting patient patterns from a French healthcare data-set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use-case which is the main motivation for this work.
Automating an integrated spatial data-mining model for landfill site selection
NASA Astrophysics Data System (ADS)
Abujayyab, Sohaib K. M.; Ahamad, Mohd Sanusi S.; Yahya, Ahmad Shukri; Ahmad, Siti Zubaidah; Aziz, Hamidi Abdul
2017-10-01
An integrated programming environment represents a robust approach to building a valid model for landfill site selection. One of the main challenges in the integrated model is the complicated processing and modelling due to the programming stages and several limitations. An automation process helps avoid the limitations and improve the interoperability between integrated programming environments. This work targets the automation of a spatial data-mining model for landfill site selection by integrating between spatial programming environment (Python-ArcGIS) and non-spatial environment (MATLAB). The model was constructed using neural networks and is divided into nine stages distributed between Matlab and Python-ArcGIS. A case study was taken from the north part of Peninsular Malaysia. 22 criteria were selected to utilise as input data and to build the training and testing datasets. The outcomes show a high-performance accuracy percentage of 98.2% in the testing dataset using 10-fold cross validation. The automated spatial data mining model provides a solid platform for decision makers to performing landfill site selection and planning operations on a regional scale.
sbtools: A package connecting R to cloud-based data for collaborative online research
Winslow, Luke; Chamberlain, Scott; Appling, Alison P.; Read, Jordan S.
2016-01-01
The adoption of high-quality tools for collaboration and reproducible research such as R and Github is becoming more common in many research fields. While Github and other version management systems are excellent resources, they were originally designed to handle code and scale poorly to large text-based or binary datasets. A number of scientific data repositories are coming online and are often focused on dataset archival and publication. To handle collaborative workflows using large scientific datasets, there is increasing need to connect cloud-based online data storage to R. In this article, we describe how the new R package sbtools enables direct access to the advanced online data functionality provided by ScienceBase, the U.S. Geological Survey’s online scientific data storage platform.
Text Mining in Organizational Research
Kobayashi, Vladimer B.; Berkers, Hannah A.; Kismihók, Gábor; Den Hartog, Deanne N.
2017-01-01
Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies. PMID:29881248
Text Mining in Organizational Research.
Kobayashi, Vladimer B; Mol, Stefan T; Berkers, Hannah A; Kismihók, Gábor; Den Hartog, Deanne N
2018-07-01
Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies.
ERIC Educational Resources Information Center
Konecny, Peter
2012-01-01
In the course of the 18th century a new type of scientifically educated functional elites developed, who were trained to administer mines. The educational project that led to the formation of a corps of mining engineers was part of a programme of administrative and economic reforms that led to a new configuration of bonds between state, economy…
Example Building Damage Caused by Mining Exploitation in Disturbed Rock Mass
NASA Astrophysics Data System (ADS)
Florkowska, Lucyna
2013-06-01
Issues concerning protection of buildings against the impact of underground coal mining pose significant scientific and engineering challenges. In Poland, where mining is a potent and prominent industry assuring domestic energy security, regions within reach of mining influences are plenty. Moreover, due to their industrial character they are also densely built-up areas. Because minerals have been extracted on an industrial scale in majority of those areas for many years, the rock mass structure has been significantly disturbed. Hence, exploitation of successive layers of multi-seam deposits might cause considerable damage - both in terms of surface and existing infrastructure networks. In the light of those facts, the means of mining and building prevention have to be improved on a regular basis. Moreover, they have to be underpinned by reliable analyses holistically capturing the comprehensive picture of the mining, geotechnical and constructional situation of structures. Scientific research conducted based on observations and measurements of mining-induced strain in buildings is deployed to do just that. Presented in this paper examples of damage sustained by buildings armed with protection against mining influences give an account of impact the mining exploitation in disturbed rock mass can have. This paper is based on analyses of mining damage to church and Nursing Home owned by Evangelical Augsburg Parish in Bytom-Miechowice. Neighbouring buildings differ in the date they were built, construction, building technology, geometry of the building body and fitted protection against mining damage. Both the buildings, however, have sustained lately significant deformation and damage caused by repeated mining exploitation. Selected damage has been discussed hereunder. The structures have been characterised, their current situation and mining history have been outlined, which have taken their toll on character and magnitude of damage. Description has been supplemented with photographic documentation.
Global Data Spatially Interrelate System for Scientific Big Data Spatial-Seamless Sharing
NASA Astrophysics Data System (ADS)
Yu, J.; Wu, L.; Yang, Y.; Lei, X.; He, W.
2014-04-01
A good data sharing system with spatial-seamless services will prevent the scientists from tedious, boring, and time consuming work of spatial transformation, and hence encourage the usage of the scientific data, and increase the scientific innovation. Having been adopted as the framework of Earth datasets by Group on Earth Observation (GEO), Earth System Spatial Grid (ESSG) is potential to be the spatial reference of the Earth datasets. Based on the implementation of ESSG, SDOG-ESSG, a data sharing system named global data spatially interrelate system (GASE) was design to make the data sharing spatial-seamless. The architecture of GASE was introduced. The implementation of the two key components, V-Pools, and interrelating engine, and the prototype is presented. Any dataset is firstly resampled into SDOG-ESSG, and is divided into small blocks, and then are mapped into hierarchical system of the distributed file system in V-Pools, which together makes the data serving at a uniform spatial reference and at a high efficiency. Besides, the datasets from different data centres are interrelated by the interrelating engine at the uniform spatial reference of SDOGESSG, which enables the system to sharing the open datasets in the internet spatial-seamless.
Improving the discoverability, accessibility, and citability of omics datasets: a case report.
Darlington, Yolanda F; Naumov, Alexey; McOwiti, Apollo; Kankanamge, Wasula H; Becnel, Lauren B; McKenna, Neil J
2017-03-01
Although omics datasets represent valuable assets for hypothesis generation, model testing, and data validation, the infrastructure supporting their reuse lacks organization and consistency. Using nuclear receptor signaling transcriptomic datasets as proof of principle, we developed a model to improve the discoverability, accessibility, and citability of published omics datasets. Primary datasets were retrieved from archives, processed to extract data points, then subjected to metadata enrichment and gap filling. The resulting secondary datasets were exposed on responsive web pages to support mining of gene lists, discovery of related datasets, and single-click citation integration with popular reference managers. Automated processes were established to embed digital object identifier-driven links to the secondary datasets in associated journal articles, small molecule and gene-centric databases, and a dataset search engine. Our model creates multiple points of access to reprocessed and reannotated derivative datasets across the digital biomedical research ecosystem, promoting their visibility and usability across disparate research communities. © The Author 2016. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Policy and data-intensive scientific discovery in the beginning of the 21st century.
Ozdemir, Vural; Smith, Charles; Bongiovanni, Kathleen; Cullen, David; Knoppers, Bartha M; Lowe, Andrew; Peters, Mette; Robbins, Robert; Stewart, Elizabeth; Yee, Gene; Yu, Yi-Kuo; Kolker, Eugene
2011-04-01
Recent developments in our ability to capture, curate, and analyze data, the field of data-intensive science (DIS), have indeed made these interesting and challenging times for scientific practice as well as policy making in real time. We are confronted with immense datasets that challenge our ability to pool, transfer, analyze, or interpret scientific observations. We have more data available than ever before, yet more questions to be answered as well, and no clear path to answer them. We are excited by the potential for science-based solutions to humankind's problems, yet stymied by the limitations of our current cyberinfrastructure and existing public policies. Importantly, DIS signals a transformation of the hypothesis-driven tradition of science ("first hypothesize, then experiment") to one that is typified by "first experiment, then hypothesize" mode of discovery. Another hallmark of DIS is that it amasses data that are public goods (i.e., creates a "commons") that can further be creatively mined for various applications in different sectors. As such, this calls for a science policy vision that is long term. We herein reflect on how best to approach to policy making at this critical inflection point when DIS applications are being diversified in agriculture, ecology, marine biology, and environmental research internationally. This article outlines the key policy issues and gaps that emerged from the multidisciplinary discussions at the NSF-funded DIS workshop held at the Seattle Children's Research Institute in Seattle, on September 19-20, 2010. © Mary Ann Liebert, Inc.
An Integrated Suite of Text and Data Mining Tools - Phase II
2005-08-30
Riverside, CA, USA Mazda Motor Corp, Jpn Univ of Darmstadt, Darmstadt, Ger Navy Center for Applied Research in Artificial Intelligence Univ of...with Georgia Tech Research Corporation developed a desktop text-mining software tool named TechOASIS (known commercially as VantagePoint). By the...of this dataset and groups the Corporate Source items that co-occur with the found items. He decides he is only interested in the institutions
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H
2015-01-01
Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595
Mande, Sharmila S.
2016-01-01
The nature of inter-microbial metabolic interactions defines the stability of microbial communities residing in any ecological niche. Deciphering these interaction patterns is crucial for understanding the mode/mechanism(s) through which an individual microbial community transitions from one state to another (e.g. from a healthy to a diseased state). Statistical correlation techniques have been traditionally employed for mining microbial interaction patterns from taxonomic abundance data corresponding to a given microbial community. In spite of their efficiency, these correlation techniques can capture only 'pair-wise interactions'. Moreover, their emphasis on statistical significance can potentially result in missing out on several interactions that are relevant from a biological standpoint. This study explores the applicability of one of the earliest association rule mining algorithm i.e. the 'Apriori algorithm' for deriving 'microbial association rules' from the taxonomic profile of given microbial community. The classical Apriori approach derives association rules by analysing patterns of co-occurrence/co-exclusion between various '(subsets of) features/items' across various samples. Using real-world microbiome data, the efficiency/utility of this rule mining approach in deciphering multiple (biologically meaningful) association patterns between 'subsets/subgroups' of microbes (constituting microbiome samples) is demonstrated. As an example, association rules derived from publicly available gut microbiome datasets indicate an association between a group of microbes (Faecalibacterium, Dorea, and Blautia) that are known to have mutualistic metabolic associations among themselves. Application of the rule mining approach on gut microbiomes (sourced from the Human Microbiome Project) further indicated similar microbial association patterns in gut microbiomes irrespective of the gender of the subjects. A Linux implementation of the Association Rule Mining (ARM) software (customised for deriving 'microbial association rules' from microbiome data) is freely available for download from the following link: http://metagenomics.atc.tcs.com/arm. PMID:27124399
Tandon, Disha; Haque, Mohammed Monzoorul; Mande, Sharmila S
2016-01-01
The nature of inter-microbial metabolic interactions defines the stability of microbial communities residing in any ecological niche. Deciphering these interaction patterns is crucial for understanding the mode/mechanism(s) through which an individual microbial community transitions from one state to another (e.g. from a healthy to a diseased state). Statistical correlation techniques have been traditionally employed for mining microbial interaction patterns from taxonomic abundance data corresponding to a given microbial community. In spite of their efficiency, these correlation techniques can capture only 'pair-wise interactions'. Moreover, their emphasis on statistical significance can potentially result in missing out on several interactions that are relevant from a biological standpoint. This study explores the applicability of one of the earliest association rule mining algorithm i.e. the 'Apriori algorithm' for deriving 'microbial association rules' from the taxonomic profile of given microbial community. The classical Apriori approach derives association rules by analysing patterns of co-occurrence/co-exclusion between various '(subsets of) features/items' across various samples. Using real-world microbiome data, the efficiency/utility of this rule mining approach in deciphering multiple (biologically meaningful) association patterns between 'subsets/subgroups' of microbes (constituting microbiome samples) is demonstrated. As an example, association rules derived from publicly available gut microbiome datasets indicate an association between a group of microbes (Faecalibacterium, Dorea, and Blautia) that are known to have mutualistic metabolic associations among themselves. Application of the rule mining approach on gut microbiomes (sourced from the Human Microbiome Project) further indicated similar microbial association patterns in gut microbiomes irrespective of the gender of the subjects. A Linux implementation of the Association Rule Mining (ARM) software (customised for deriving 'microbial association rules' from microbiome data) is freely available for download from the following link: http://metagenomics.atc.tcs.com/arm.
Moment tensor clustering: a tool to monitor mining induced seismicity
NASA Astrophysics Data System (ADS)
Cesca, Simone; Dahm, Torsten; Tolga Sen, Ali
2013-04-01
Automated moment tensor inversion routines have been setup in the last decades for the analysis of global and regional seismicity. Recent developments could be used to analyse smaller events and larger datasets. In particular, applications to microseismicity, e.g. in mining environments, have then led to the generation of large moment tensor catalogues. Moment tensor catalogues provide a valuable information about the earthquake source and details of rupturing processes taking place in the seismogenic region. Earthquake focal mechanisms can be used to discuss the local stress field, possible orientations of the fault system or to evaluate the presence of shear and/or tensile cracks. Focal mechanism and moment tensor solutions are typically analysed for selected events, and quick and robust tools for the automated analysis of larger catalogues are needed. We propose here a method to perform cluster analysis for large moment tensor catalogues and identify families of events which characterize the studied microseismicity. Clusters include events with similar focal mechanisms, first requiring the definition of distance between focal mechanisms. Different metrics are here proposed, both for the case of pure double couple, constrained moment tensor and full moment tensor catalogues. Different clustering approaches are implemented and discussed. The method is here applied to synthetic and real datasets from mining environments to demonstrate its potential: the proposed cluserting techniques prove to be able to automatically recognise major clusters. An important application for mining monitoring concerns the early identification of anomalous rupture processes, which is relevant for the hazard assessment. This study is funded by the project MINE, which is part of the R&D-Programme GEOTECHNOLOGIEN. The project MINE is funded by the German Ministry of Education and Research (BMBF), Grant of project BMBF03G0737.
NASA Astrophysics Data System (ADS)
Shrestha, S. R.; Collow, T. W.; Rose, B.
2016-12-01
Scientific datasets are generated from various sources and platforms but they are typically produced either by earth observation systems or by modelling systems. These are widely used for monitoring, simulating, or analyzing measurements that are associated with physical, chemical, and biological phenomena over the ocean, atmosphere, or land. A significant subset of scientific datasets stores values directly as rasters or in a form that can be rasterized. This is where a value exists at every cell in a regular grid spanning the spatial extent of the dataset. Government agencies like NOAA, NASA, EPA, USGS produces large volumes of near real-time, forecast, and historical data that drives climatological and meteorological studies, and underpins operations ranging from weather prediction to sea ice loss. Modern science is computationally intensive because of the availability of an enormous amount of scientific data, the adoption of data-driven analysis, and the need to share these dataset and research results with the public. ArcGIS as a platform is sophisticated and capable of handling such complex domain. We'll discuss constructs and capabilities applicable to multidimensional gridded data that can be conceptualized as a multivariate space-time cube. Building on the concept of a two-dimensional raster, a typical multidimensional raster dataset could contain several "slices" within the same spatial extent. We will share a case from the NOAA Climate Forecast Systems Reanalysis (CFSR) multidimensional data as an example of how large collections of rasters can be efficiently organized and managed through a data model within a geodatabase called "Mosaic dataset" and dynamically transformed and analyzed using raster functions. A raster function is a lightweight, raster-valued transformation defined over a mixed set of raster and scalar input. That means, just like any tool, you can provide a raster function with input parameters. It enables dynamic processing of only the data that's being displayed on the screen or requested by an application. We will present the dynamic processing and analysis of CFSR data using the chains of raster function and share it as dynamic multidimensional image service. This workflow and capabilities can be easily applied to any scientific data formats that are supported in mosaic dataset.
Antarctic and Sub-Antarctic Asteroidea database.
Moreau, Camille; Mah, Christopher; Agüera, Antonio; Améziane, Nadia; David Barnes; Crokaert, Guillaume; Eléaume, Marc; Griffiths, Huw; Charlène Guillaumot; Hemery, Lenaïg G; Jażdżewska, Anna; Quentin Jossart; Vladimir Laptikhovsky; Linse, Katrin; Neill, Kate; Sands, Chester; Thomas Saucède; Schiaparelli, Stefano; Siciński, Jacek; Vasset, Noémie; Bruno Danis
2018-01-01
The present dataset is a compilation of georeferenced occurrences of asteroids (Echinodermata: Asteroidea) in the Southern Ocean. Occurrence data south of 45°S latitude were mined from various sources together with information regarding the taxonomy, the sampling source and sampling sites when available. Records from 1872 to 2016 were thoroughly checked to ensure the quality of a dataset that reaches a total of 13,840 occurrences from 4,580 unique sampling events. Information regarding the reproductive strategy (brooders vs. broadcasters) of 63 species is also made available. This dataset represents the most exhaustive occurrence database on Antarctic and Sub-Antarctic asteroids.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Julia, J; Nyblade, A; Gok, R
2009-07-06
In this project, we are developing and exploiting a unique seismic dataset to address the characteristics of small seismic events and the associated seismic signals observed at local (< 200 km) and regional (< 2000 km) distances. The dataset is being developed using mining-induced events from three deep gold mines in South Africa recorded on in-mine networks (< 1 km) composed of tens of high-frequency sensors, a network of four broadband stations installed as part of this project at the surface around the mines (1-10 km), and a network of existing broadband seismic stations at local/regional distances (50-1000 km) frommore » the mines. Data acquisition has now been completed and includes: (1) {approx}2 years (2007 and 2008) of continuous recording by the surface broadband array, and (2) tens of thousands of mine tremors in the -3.4 < ML < 4.4 local magnitude range. Events with positive magnitudes are generally well recorded by the surface-mine stations, while magnitudes of 3.0 and larger are seen at regional distances (up to {approx} 600 km) in high-pass filtered recordings. We have now completed the quality control of the in-mine data gathered at the three gold mines included in this project. The quality control consisted of: (1) identification and analysis of outliers among the P- and S-wave travel-time picks reported by the in-mine network operator and (2) verification of sensor orientations. The outliers have been identified through a 'Wadati filter' that searches for the largest subset of P- and S-wave travel-time picks consistent with a medium of uniform wave-speed. They have observed that outliers are generally picked at a few select stations. They have also detected that trigger times were mistakenly reported as origin times by the in-mine network operator, and corrections have been obtained from the intercept times in the Wadati diagrams. Sensor orientations have been verified through rotations into the local ray-coordinate system and, when possible, corrected by correlating waveforms obtained from theoretical and empirical rotation angles. Full moment tensor solutions have been obtained for selected events within the Savuka network volume, with moment magnitudes in the 0.5 < M{sub W} < 2.6 range. The solutions were obtained by inverting P-, SV-, and SH-spectral amplitudes measured on the theoretically rotated waveforms with visually assigned polarities. Most of the solutions have a non-zero implosive contribution (47 out of 76), while a small percentage is purely deviatoric (10 out of 76). The deviatoric moment tensors range from pure double couple to pure non-double couple mechanisms. We have also calibrated the regional stations for seismic coda-derived source spectra and moment magnitude using the envelope methodology of Mayeda et al. (2003). they tie the coda M{sub w} to independent values from waveform modeling. The resulting coda-based source spectra of shallow mining-related events show significant spectral peaking that is not seen in deeper tectonic earthquakes. This coda peaking may be an independent method of identifying shallow events and is similar to coda peaking with previously observed for Nevada explosions, where the frequency of the observed spectral peak correlates with the depth of burial (Murphy et al., 2009).« less
AstroML: "better, faster, cheaper" towards state-of-the-art data mining and machine learning
NASA Astrophysics Data System (ADS)
Ivezic, Zeljko; Connolly, Andrew J.; Vanderplas, Jacob
2015-01-01
We present AstroML, a Python module for machine learning and data mining built on numpy, scipy, scikit-learn, matplotlib, and astropy, and distributed under an open license. AstroML contains a growing library of statistical and machine learning routines for analyzing astronomical data in Python, loaders for several open astronomical datasets (such as SDSS and other recent major surveys), and a large suite of examples of analyzing and visualizing astronomical datasets. AstroML is especially suitable for introducing undergraduate students to numerical research projects and for graduate students to rapidly undertake cutting-edge research. The long-term goal of astroML is to provide a community repository for fast Python implementations of common tools and routines used for statistical data analysis in astronomy and astrophysics (see http://www.astroml.org).
Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P
2007-03-15
Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
NASA Technical Reports Server (NTRS)
Lee, Lesley R.; Mason, Sara S.; Babiak-Vazquez, Adriana; Ray, Stacie L.; Van Baalen, Mary
2015-01-01
Since the 2010 NASA authorization to make the Life Sciences Data Archive (LSDA) and Lifetime Surveillance of Astronaut Health (LSAH) data archives more accessible by the research and operational communities, demand for data has greatly increased. Correspondingly, both the number and scope of requests have increased, from 142 requests fulfilled in 2011 to 224 in 2014, and with some datasets comprising up to 1 million data points. To meet the demand, the LSAH and LSDA Repositories project was launched, which allows active and retired astronauts to authorize full, partial, or no access to their data for research without individual, study-specific informed consent. A one-on-one personal informed consent briefing is required to fully communicate the implications of the several tiers of consent. Due to the need for personal contact to conduct Repositories consent meetings, the rate of consenting has not kept up with demand for individualized, possibly attributable data. As a result, other methods had to be implemented to allow the release of large datasets, such as release of only de-identified data. However the compilation of large, de-identified data sets places a significant resource burden on LSAH and LSDA and may result in diminished scientific usefulness of the dataset. As a result, LSAH and LSDA worked with the JSC Institutional Review Board Chair, Astronaut Office physicians, and NASA Office of General Counsel personnel to develop a "Remote Consenting" process for retrospective data mining studies. This is particularly useful since the majority of the astronaut cohort is retired from the agency and living outside the Houston area. Originally planned as a method to send informed consent briefing slides and consent forms only by mail, Remote Consenting has evolved into a means to accept crewmember decisions on individual studies via their method of choice: email or paper copy by mail. To date, 100 emails have been sent to request participation in eight HRP-funded studies. The development of the Remote Consent process, the laws allowing transmission of consent via electronic means, total metrics to date, and remaining challenges (e.g., response issues, use of International Partner data, biospecimens/genetic data) for the research use of LSAH/LSDA data will be described.
Structuring intuition with theory: The high-throughput way
NASA Astrophysics Data System (ADS)
Fornari, Marco
2015-03-01
First principles methodologies have grown in accuracy and applicability to the point where large databases can be built, shared, and analyzed with the goal of predicting novel compositions, optimizing functional properties, and discovering unexpected relationships between the data. In order to be useful to a large community of users, data should be standardized, validated, and distributed. In addition, tools to easily manage large datasets should be made available to effectively lead to materials development. Within the AFLOW consortium we have developed a simple frame to expand, validate, and mine data repositories: the MTFrame. Our minimalistic approach complement AFLOW and other existing high-throughput infrastructures and aims to integrate data generation with data analysis. We present few examples from our work on materials for energy conversion. Our intent s to pinpoint the usefulness of high-throughput methodologies to guide the discovery process by quantitatively structuring the scientific intuition. This work was supported by ONR-MURI under Contract N00014-13-1-0635 and the Duke University Center for Materials Genomics.
A Testbed Demonstration of an Intelligent Archive in a Knowledge Building System
NASA Technical Reports Server (NTRS)
Ramapriyan, Hampapuram; Isaac, David; Morse, Steve; Yang, Wenli; Bonnlander, Brian; McConaughy, Gail; Di, Liping; Danks, David
2005-01-01
The last decade's influx of raw data and derived geophysical parameters from several Earth observing satellites to NASA data centers has created a data-rich environment for Earth science research and applications. While advances in hardware and information management have made it possible to archive petabytes of data and distribute terabytes of data daily to a broad community of users, further progress is necessary in the transformation of data into information, and information into knowledge that can be used in particular applications in order to realize the full potential of these valuable datasets. In examining what is needed to enable this progress in the data provider environment that exists today and is expected to evolve in the next several years, we arrived at the concept of an Intelligent Archive in context of a Knowledge Building System (IA/KBS). Our prior work and associated papers investigated usage scenarios, required capabilities, system architecture, data volume issues, and supporting technologies. We identified six key capabilities of an IA/KBS: Virtual Product Generation, Significant Event Detection, Automated Data Quality Assessment, Large-Scale Data Mining, Dynamic Feedback Loop, and Data Discovery and Efficient Requesting. Among these capabilities, large-scale data mining is perceived by many in the community to be an area of technical risk. One of the main reasons for this is that standard data mining research and algorithms operate on datasets that are several orders of magnitude smaller than the actual sizes of datasets maintained by realistic earth science data archives. Therefore, we defined a test-bed activity to implement a large-scale data mining algorithm in a pseudo-operational scale environment and to examine any issues involved. The application chosen for applying the data mining algorithm is wildfire prediction over the continental U.S. This paper reports a number of observations based on our experience with this test-bed. While proof-of-concept for data mining scalability and utility has been a major goal for the research reported here, it was not the only one. The other five capabilities of an WKBS named above have been considered as well, and an assessment of the implications of our experience for these other areas will also be presented. The lessons learned through the testbed effort and presented in this paper will benefit technologists, scientists, and system operators as they consider introducing IA/KBS capabilities into production systems.
Kumar, Ashwani; Singh, Tiratha Raj
2017-03-01
Alzheimer's disease (AD) is a progressive, incurable and terminal neurodegenerative disorder of the brain and is associated with mutations in amyloid precursor protein, presenilin 1, presenilin 2 or apolipoprotein E, but its underlying mechanisms are still not fully understood. Healthcare sector is generating a large amount of information corresponding to diagnosis, disease identification and treatment of an individual. Mining knowledge and providing scientific decision-making for the diagnosis and treatment of disease from the clinical dataset are therefore increasingly becoming necessary. The current study deals with the construction of classifiers that can be human readable as well as robust in performance for gene dataset of AD using a decision tree. Models of classification for different AD genes were generated according to Mini-Mental State Examination scores and all other vital parameters to achieve the identification of the expression level of different proteins of disorder that may possibly determine the involvement of genes in various AD pathogenesis pathways. The effectiveness of decision tree in AD diagnosis is determined by information gain with confidence value (0.96), specificity (92 %), sensitivity (98 %) and accuracy (77 %). Besides this functional gene classification using different parameters and enrichment analysis, our finding indicates that the measures of all the gene assess in single cohorts are sufficient to diagnose AD and will help in the prediction of important parameters for other relevant assessments.
Predicting future discoveries from current scientific literature.
Petrič, Ingrid; Cestnik, Bojan
2014-01-01
Knowledge discovery in biomedicine is a time-consuming process starting from the basic research, through preclinical testing, towards possible clinical applications. Crossing of conceptual boundaries is often needed for groundbreaking biomedical research that generates highly inventive discoveries. We demonstrate the ability of a creative literature mining method to advance valuable new discoveries based on rare ideas from existing literature. When emerging ideas from scientific literature are put together as fragments of knowledge in a systematic way, they may lead to original, sometimes surprising, research findings. If enough scientific evidence is already published for the association of such findings, they can be considered as scientific hypotheses. In this chapter, we describe a method for the computer-aided generation of such hypotheses based on the existing scientific literature. Our literature-based discovery of NF-kappaB with its possible connections to autism was recently approved by scientific community, which confirms the ability of our literature mining methodology to accelerate future discoveries based on rare ideas from existing literature.
NASA Astrophysics Data System (ADS)
Landsfeld, M. F.; Daudert, B.; Friedrichs, M.; Morton, C.; Hegewisch, K.; Husak, G. J.; Funk, C. C.; Peterson, P.; Huntington, J. L.; Abatzoglou, J. T.; Verdin, J. P.; Williams, E. L.
2015-12-01
The Famine Early Warning Systems Network (FEWS NET) focuses on food insecurity in developing nations and provides objective, evidence based analysis to help government decision-makers and relief agencies plan for and respond to humanitarian emergencies. The Google Earth Engine (GEE) is a platform provided by Google Inc. to support scientific research and analysis of environmental data in their cloud environment. The intent is to allow scientists and independent researchers to mine massive collections of environmental data and leverage Google's vast computational resources to detect changes and monitor the Earth's surface and climate. GEE hosts an enormous amount of satellite imagery and climate archives, one of which is the Climate Hazards Group Infrared Precipitation with Stations dataset (CHIRPS). The CHIRPS dataset is land based, quasi-global (latitude 50N-50S), 0.05 degree resolution, and has a relatively long term period of record (1981-present). CHIRPS is on a continuous monthly feed into the GEE as new data fields are generated each month. This precipitation dataset is a key input for FEWS NET monitoring and forecasting efforts. FEWS NET intends to leverage the GEE in order to provide analysts and scientists with flexible, interactive tools to aid in their monitoring and research efforts. These scientists often work in bandwidth limited regions, so lightweight Internet tools and services that bypass the need for downloading massive datasets to analyze them, are preferred for their work. The GEE provides just this type of service. We present a tool designed specifically for FEWS NET scientists to be utilized interactively for investigating and monitoring for agro-climatological issues. We are able to utilize the enormous GEE computing power to generate on-the-fly statistics to calculate precipitation anomalies, z-scores, percentiles and band ratios, and allow the user to interactively select custom areas for statistical time series comparisons and predictions.
Argue, Denise M.; Kiah, Richard G.; Piatak, Nadine M.; Seal, Robert R.; Hammarstrom, Jane M.; Hathaway, Edward; Coles, James F.
2008-01-01
The data contained in this report are a compilation of selected water- and sediment-quality, aquatic biology, and mine-waste data collected at the Ely Copper Mine Superfund site in Vershire, VT, from August 1998 through May 2007. The Ely Copper Mine Superfund site is in eastern, central Vermont (fig. 1) within the Vermont Copper Belt (Hammarstrom and others, 2001). The Ely Copper Mine site was placed on the U.S. Environmental Protection Agency (USEPA) National Priorities List in 2001. Previous investigations conducted at the site documented that the mine is contributing metals and highly acidic waters to local streams (Hammarstrom and others, 2001; Holmes and others, 2002; Piatak and others, 2003, 2004, and 2006). The U.S. Geological Survey (USGS), in cooperation with the USEPA, compiled selected data from previous investigations into uniform datasets that will be used to help characterize the extent of contamination at the mine. The data may be used to determine the magnitude of biological impacts from the contamination and in the development of remediation activities. This report contains analytical data for samples collected from 98 stream locations, 6 pond locations, 21 surface-water seeps, and 29 mine-waste locations. The 98 stream locations are within 3 streams and their tributaries. Ely Brook flows directly through the Ely Copper Mine then into Schoolhouse Brook (fig. 2), which joins the Ompompanoosuc River (fig. 1). The six pond locations are along Ely Brook Tributary 2 (fig. 2). The surface-water seeps and mine-waste locations are near the headwaters of Ely Brook (fig. 2 and fig. 3). The datasets 'Site_Directory' and 'Coordinates' contain specific information about each of the sample locations including stream name, number of meters from the mouth of stream, geographic coordinates, types of samples collected (matrix of sample), and the figure on which the sample location is depicted. Data have been collected at the Ely Copper Mine Superfund site by the USEPA, the Vermont Department of Environmental Conservation (VTDEC), and the USGS. Data also have been collected on behalf of USEPA by the following agencies: Arthur D. Little Incorporated (ADL), U.S. Army Cold Region Research and Engineering Laboratory (CRREL), URS Corporation (URS), USEPA, and USGS. These data provide information about the aquatic communities and their habitats, including chemical analyses of surface water, pore water, sediments, and fish tissue; assessments of macroinvertebrate and fish assemblages; physical characteristics of sediments; and chemical analyses of soil and soil leachate collected in and around the piles of mine waste.
Primer for Using RE-Powering Data to Screen Sites for Renewable Energy Potential
This reference guide provides users with tips for using the RE-Powering Screening Dataset spreadsheet, which contains detailed site information on over 60,000 contaminated lands, landfills, and mine sites.
76 FR 7868 - Center for Scientific Review; Notice of Closed Meetings
Federal Register 2010, 2011, 2012, 2013, 2014
2011-02-11
... Special Emphasis Panel, Small Business: Computational Biology, Image Processing and Data Mining. Date... for Scientific Review Special Emphasis Panel, Quick Trial on Imaging and Image-Guided Intervention...
Data Mining as a Service (DMaaS)
NASA Astrophysics Data System (ADS)
Tejedor, E.; Piparo, D.; Mascetti, L.; Moscicki, J.; Lamanna, M.; Mato, P.
2016-10-01
Data Mining as a Service (DMaaS) is a software and computing infrastructure that allows interactive mining of scientific data in the cloud. It allows users to run advanced data analyses by leveraging the widely adopted Jupyter notebook interface. Furthermore, the system makes it easier to share results and scientific code, access scientific software, produce tutorials and demonstrations as well as preserve the analyses of scientists. This paper describes how a first pilot of the DMaaS service is being deployed at CERN, starting from the notebook interface that has been fully integrated with the ROOT analysis framework, in order to provide all the tools for scientists to run their analyses. Additionally, we characterise the service backend, which combines a set of IT services such as user authentication, virtual computing infrastructure, mass storage, file synchronisation, development portals or batch systems. The added value acquired by the combination of the aforementioned categories of services is discussed, focusing on the opportunities offered by the CERNBox synchronisation service and its massive storage backend, EOS.
Liu, L L; Liu, M J; Ma, M
2015-09-28
The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.
ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers.
Xing, Yuting; Wu, Chengkun; Yang, Xi; Wang, Wei; Zhu, En; Yin, Jianping
2018-04-27
A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.
Activity recognition from minimal distinguishing subsequence mining
NASA Astrophysics Data System (ADS)
Iqbal, Mohammad; Pao, Hsing-Kuo
2017-08-01
Human activity recognition is one of the most important research topics in the era of Internet of Things. To separate different activities given sensory data, we utilize a Minimal Distinguishing Subsequence (MDS) mining approach to efficiently find distinguishing patterns among different activities. We first transform the sensory data into a series of sensor triggering events and operate the MDS mining procedure afterwards. The gap constraints are also considered in the MDS mining. Given the multi-class nature of most activity recognition tasks, we modify the MDS mining approach from a binary case to a multi-class one to fit the need for multiple activity recognition. We also study how to select the best parameter set including the minimal and the maximal support thresholds in finding the MDSs for effective activity recognition. Overall, the prediction accuracy is 86.59% on the van Kasteren dataset which consists of four different activities for recognition.
Sams, James I.; Veloski, Garret; Ackman, T.E.
2003-01-01
Nighttime high-resolution airborne thermal infrared imagery (TIR) data were collected in the predawn hours during Feb 5-8 and March 11-12, 1999, from a helicopter platform for 72.4 km of the Youghiogheny River, from Connellsville to McKeesport, in southwestern Pennsylvania. The TIR data were used to identify sources of mine drainage from abandoned mines that discharge directly into the Youghiogheny River. Image-processing and geographic information systems (GIS) techniques were used to identify 70 sites within the study area as possible mine drainage sources. The combination of GIS datasets and the airborne TIR data provided a fast and accurate method to target the possible sources. After field reconnaissance, it was determined that 24 of the 70 sites were mine drainage. This paper summarizes: the procedures used to process the TIR data and extract potential mine-drainage sites; methods used for verification of the TIR data; a discussion of factors affecting the TIR data; and a brief summary of water quality.
Effective application of improved profit-mining algorithm for the interday trading model.
Hsieh, Yu-Lung; Yang, Don-Lin; Wu, Jungpin
2014-01-01
Many real world applications of association rule mining from large databases help users make better decisions. However, they do not work well in financial markets at this time. In addition to a high profit, an investor also looks for a low risk trading with a better rate of winning. The traditional approach of using minimum confidence and support thresholds needs to be changed. Based on an interday model of trading, we proposed effective profit-mining algorithms which provide investors with profit rules including information about profit, risk, and winning rate. Since profit-mining in the financial market is still in its infant stage, it is important to detail the inner working of mining algorithms and illustrate the best way to apply them. In this paper we go into details of our improved profit-mining algorithm and showcase effective applications with experiments using real world trading data. The results show that our approach is practical and effective with good performance for various datasets.
Effective Application of Improved Profit-Mining Algorithm for the Interday Trading Model
Wu, Jungpin
2014-01-01
Many real world applications of association rule mining from large databases help users make better decisions. However, they do not work well in financial markets at this time. In addition to a high profit, an investor also looks for a low risk trading with a better rate of winning. The traditional approach of using minimum confidence and support thresholds needs to be changed. Based on an interday model of trading, we proposed effective profit-mining algorithms which provide investors with profit rules including information about profit, risk, and winning rate. Since profit-mining in the financial market is still in its infant stage, it is important to detail the inner working of mining algorithms and illustrate the best way to apply them. In this paper we go into details of our improved profit-mining algorithm and showcase effective applications with experiments using real world trading data. The results show that our approach is practical and effective with good performance for various datasets. PMID:24688442
Interactive Hard Rock Mining 2006 Conference CD
This interactive conference CD provides a forum for the exchange of scientific information on current and future environmental issues which shape the future of the mining industry, including: base metals, precious metals, industrial minerals, and aggregates. The focus of the conf...
NASA Astrophysics Data System (ADS)
Landsfeld, M. F.; Hegewisch, K.; Daudert, B.; Morton, C.; Husak, G. J.; Friedrichs, M.; Funk, C. C.; Huntington, J. L.; Abatzoglou, J. T.; Verdin, J. P.
2016-12-01
The Famine Early Warning Systems Network (FEWS NET) focuses on food insecurity in developing nations and provides objective, evidence-based analysis to help government decision-makers and relief agencies plan for and respond to humanitarian emergencies. The network of FEWS NET analysts and scientists require flexible, interactive tools to aid in their monitoring and research efforts. Because they often work in bandwidth-limited regions, lightweight Internet tools and services that bypass the need for downloading massive datasets are preferred for their work. To support food security analysis FEWS NET developed a custom interface for the Google Earth Engine (GEE). GEE is a platform developed by Google to support scientific analysis of environmental data in their cloud computing environment. This platform allows scientists and independent researchers to mine massive collections of environmental data, leveraging Google's vast computational resources for purposes of detecting changes and monitoring the Earth's surface and climate. GEE hosts an enormous amount of satellite imagery and climate archives, one of which is the Climate Hazards Group Infrared Precipitation with Stations dataset (CHIRPS). CHIRPS precipitation dataset is a key input for FEWS NET monitoring and forecasting efforts. In this talk we introduce the FEWS Engine interface. We present an application that highlights the utility of FEWS Engine for forecasting the upcoming seasonal precipitation of southern Africa. Specifically, the current state of ENSO is assessed and used to identify similar historical seasons. The FEWS Engine compositing tool is used to examine rainfall and other environmental data for these analog seasons. The application illustrates the unique benefits of using FEWS Engine for on-the-fly food security scenario development.
Librarians and Scientists: Combining Forces for Better Metrics
NASA Astrophysics Data System (ADS)
Rots, Arnold H.; Winkelman, Sherry
2015-08-01
Traditionally, observatory bibliographies mainly rely on two parameters derived from the carefully compiled lists of publications associated, in a well-defined way, with the observatories contribution to the advancement of science: numbers of articles and numbers of citations - in addition to the bibliographic metadata relating to those articles. The information that can be extracted from metrics based on these parameters is limited. This is a realization that is not just typical to astronomy and astrophysics, but one that is felt across many disciplines.Relating articles with very specific datasets allows us to join those datasets' metadata with the bibliographic metadata which opens a much richer field of information to mine for knowledge concerning the performance, not only of the observatory as a whole, but also its parts: instruments, types of observations, length of observations, etc. We have experimented extensively with such new metrics in the Chandra Data Archive in the Chandra X-ray Center at SAO.The linking of articles with individual datasets requires a level of scientific expertise that is usually not in the, otherwise extensive, skill set of the librarians, but is something that is crucial on the road to more informative bibliographic metrics.This talk is a plea for librarians and research scientists to join forces to make this happen. The added benefit of such a collaboration is a powerful research tool for navigating data and literature through a single interface.This work has been supported by NASA under contract NAS 8-03060 to the Smithsonian Astrophysical Observatory for operation of the Chandra X-ray Center. It depends critically on the services provided by the ADS, which is funded by NASA Grant NNX12AG54G.
Space environment data storage and access: lessons learned and recommendations for the future
NASA Astrophysics Data System (ADS)
Evans, Hugh; Heynderickx, Daniel
2012-07-01
With the ever increasing volume of space environment data available at present and planned for the near future, the demands on data storage and access methods are increasing as well. In addition, continued access to historical, archived data remains crucial. On the basis of many years of experience, the authors identify the following issues as important for continued and efficient handling of datasets now and in the future: The huge data volumes currently or very soon avaiable from a number of space missions will limi direct Internet download access to even relatively short epoch ranges of data. Therefore, data providers should establish or extend standardised data (post-) processing services so that only data query results should be downloaded. Although a single standardised data format will in all likelihood remain utopia, data providers should at least include extensive metadata with their data products, according to established standards and practices (e.g. ISTP, SPASE). Standardisation of (sets of) metadata greatly facilitates data mining and querying. The use of SQL database storage should be considered instead of, or in parallel with, classic storage of data files. The use of SQL does away with having to handle file parsing and processing, while at the same time standard access protocols can be used to (remotely) connect to such data repositories. Many data holdings are still lacking in extensive descriptions of data provenance (e.g. instrument description), content and format. Unfortunately, detailed data information is usually rejected by scientific and technical journals. Re-processing of historical archived datasets into modern formats, making them easily available and usable, is urgently required, as knowledge is being lost. A global data directory has still not been achieved; policy makers should enforce stricter rules for "broadcasting" dataset information.
A test and re-estimation of Taylor's empirical capacity-reserve relationship
Long, K.R.
2009-01-01
In 1977, Taylor proposed a constant elasticity model relating capacity choice in mines to reserves. A test of this model using a very large (n = 1,195) dataset confirms its validity but obtains significantly different estimated values for the model coefficients. Capacity is somewhat inelastic with respect to reserves, with an elasticity of 0.65 estimated for open-pit plus block-cave underground mines and 0.56 for all other underground mines. These new estimates should be useful for capacity determinations as scoping studies and as a starting point for feasibility studies. The results are robust over a wide range of deposit types, deposit sizes, and time, consistent with physical constraints on mine capacity that are largely independent of technology. ?? 2009 International Association for Mathematical Geology.
Nahar, Jesmin; Imam, Tasadduq; Tickle, Kevin S; Garcia-Alonso, Debora
2013-01-01
This chapter is a review of data mining techniques used in medical research. It will cover the existing applications of these techniques in the identification of diseases, and also present the authors' research experiences in medical disease diagnosis and analysis. A computational diagnosis approach can have a significant impact on accurate diagnosis and result in time and cost effective solutions. The chapter will begin with an overview of computational intelligence concepts, followed by details on different classification algorithms. Use of association learning, a well recognised data mining procedure, will also be discussed. Many of the datasets considered in existing medical data mining research are imbalanced, and the chapter focuses on this issue as well. Lastly, the chapter outlines the need of data governance in this research domain.
Aggregating todays data for tomorrows science: a geological use case
NASA Astrophysics Data System (ADS)
Glaves, H.; Kingdon, A.; Nayembil, M.; Baker, G.
2016-12-01
Geoscience data is made up of diverse and complex smaller datasets that, when aggregated together, build towards what is recognised as `big data'. The British Geological Survey (BGS), which acts as a repository for all subsurface data from the United Kingdom, has been collating these disparate small datasets that have been accumulated from the activities of a large number of geoscientists over many years. Recently this picture has been further complicated by the addition of new data sources such as near real-time sensor data, and industry or community data that is increasingly delivered via automatic donations. Many of these datasets have been aggregated in relational databases to form larger ones that are used to address a variety of issues ranging from development of national infrastructure to disaster response. These complex domain-specific SQL databases deliver effective data management using normalised subject-based database designs in a secure environment. However, the isolated subject-oriented design of these systems inhibits efficient cross-domain querying of the datasets. Additionally, the tools provided often do not enable effective data discovery as they have problems resolving the complex underlying normalised structures. Recent requirements to understand sub-surface geology in three dimensions have led BGS to develop new data systems. One such solution is PropBase which delivers a generic denormalised data structure within an RDBMS to store geological property data. Propbase facilitates rapid and standardised data discovery and access, incorporating 2D and 3D physical and chemical property data, including associated metadata. It also provides a dedicated web interface to deliver complex multiple data sets from a single database in standardised common output formats (e.g. CSV, GIS shape files) without the need for complex data conditioning. PropBase facilitates new scientific research, previously considered impractical, by enabling property data searches across multiple databases. Using the Propbase exemplar this presentation will seek to illustrate how BGS has developed systems for aggregating `small datasets' to create the `big data' necessary for the data analytics, mining, processing and visualisation needed for future geoscientific research.
CANFAR + Skytree: Mining Massive Datasets as an Essential Part of the Future of Astronomy
NASA Astrophysics Data System (ADS)
Ball, Nicholas M.
2013-01-01
The future study of large astronomical datasets, consisting of hundreds of millions to billions of objects, will be dominated by large computing resources, and by analysis tools of the necessary scalability and sophistication to extract useful information. Significant effort will be required to fulfil their potential as a provider of the next generation of science results. To-date, computing systems have allowed either sophisticated analysis of small datasets, e.g., most astronomy software, or simple analysis of large datasets, e.g., database queries. At the Canadian Astronomy Data Centre, we have combined our cloud computing system, the Canadian Advanced Network for Astronomical Research (CANFAR), with the world's most advanced machine learning software, Skytree, to create the world's first cloud computing system for data mining in astronomy. This allows the full sophistication of the huge fields of data mining and machine learning to be applied to the hundreds of millions of objects that make up current large datasets. CANFAR works by utilizing virtual machines, which appear to the user as equivalent to a desktop. Each machine is replicated as desired to perform large-scale parallel processing. Such an arrangement carries far more flexibility than other cloud systems, because it enables the user to immediately install and run the same code that they already utilize for science on their desktop. We demonstrate the utility of the CANFAR + Skytree system by showing science results obtained, including assigning photometric redshifts with full probability density functions (PDFs) to a catalog of approximately 133 million galaxies from the MegaPipe reductions of the Canada-France-Hawaii Telescope Legacy Wide and Deep surveys. Each PDF is produced nonparametrically from 100 instances of the photometric parameters for each galaxy, generated by perturbing within the errors on the measurements. Hence, we produce, store, and assign redshifts to, a catalog of over 13 billion object instances. This catalog is comparable in size to those expected from next-generation surveys, such as Large Synoptic Survey Telescope. The CANFAR+Skytree system is open for use by any interested member of the astronomical community.
Data-Oriented Astrophysics at NOAO: The Science Archive & The Data Lab
NASA Astrophysics Data System (ADS)
Juneau, Stephanie; NOAO Data Lab, NOAO Science Archive
2018-06-01
As we keep progressing into an era of increasingly large astronomy datasets, NOAO’s data-oriented mission is growing in prominence. The NOAO Science Archive, which captures and processes the pixel data from mountaintops in Chile and Arizona, now contains holdings at Petabyte scales. Working at the intersection of astronomy and data science, the main goal of the NOAO Data Lab is to provide users with a suite of tools to work close to this data, the catalogs derived from them, as well as externally provided datasets, and thus optimize the scientific productivity of the astronomy community. These tools and services include databases, query tools, virtual storage space, workflows through our Jupyter Notebook server, and scripted analysis. We currently host datasets from NOAO facilities such as the Dark Energy Survey (DES), the DESI imaging Legacy Surveys (LS), the Dark Energy Camera Plane Survey (DECaPS), and the nearly all-sky NOAO Source Catalog (NSC). We are further preparing for large spectroscopy datasets such as DESI. After a brief overview of the Science Archive, the Data Lab and datasets, I will briefly showcase scientific applications showing use of our data holdings. Lastly, I will describe our vision for future developments as we tackle the next technical and scientific challenges.
The LSST Data Mining Research Agenda
NASA Astrophysics Data System (ADS)
Borne, K.; Becla, J.; Davidson, I.; Szalay, A.; Tyson, J. A.
2008-12-01
We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night) multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.
Data quality enhancement and knowledge discovery from relevant signals in acoustic emission
NASA Astrophysics Data System (ADS)
Mejia, Felipe; Shyu, Mei-Ling; Nanni, Antonio
2015-10-01
The increasing popularity of structural health monitoring has brought with it a growing need for automated data management and data analysis tools. Of great importance are filters that can systematically detect unwanted signals in acoustic emission datasets. This study presents a semi-supervised data mining scheme that detects data belonging to unfamiliar distributions. This type of outlier detection scheme is useful detecting the presence of new acoustic emission sources, given a training dataset of unwanted signals. In addition to classifying new observations (herein referred to as "outliers") within a dataset, the scheme generates a decision tree that classifies sub-clusters within the outlier context set. The obtained tree can be interpreted as a series of characterization rules for newly-observed data, and they can potentially describe the basic structure of different modes within the outlier distribution. The data mining scheme is first validated on a synthetic dataset, and an attempt is made to confirm the algorithms' ability to discriminate outlier acoustic emission sources from a controlled pencil-lead-break experiment. Finally, the scheme is applied to data from two fatigue crack-growth steel specimens, where it is shown that extracted rules can adequately describe crack-growth related acoustic emission sources while filtering out background "noise." Results show promising performance in filter generation, thereby allowing analysts to extract, characterize, and focus only on meaningful signals.
Extracting Cross-Ontology Weighted Association Rules from Gene Ontology Annotations.
Agapito, Giuseppe; Milano, Marianna; Guzzi, Pietro Hiram; Cannataro, Mario
2016-01-01
Gene Ontology (GO) is a structured repository of concepts (GO Terms) that are associated to one or more gene products through a process referred to as annotation. The analysis of annotated data is an important opportunity for bioinformatics. There are different approaches of analysis, among those, the use of association rules (AR) which provides useful knowledge, discovering biologically relevant associations between terms of GO, not previously known. In a previous work, we introduced GO-WAR (Gene Ontology-based Weighted Association Rules), a methodology for extracting weighted association rules from ontology-based annotated datasets. We here adapt the GO-WAR algorithm to mine cross-ontology association rules, i.e., rules that involve GO terms present in the three sub-ontologies of GO. We conduct a deep performance evaluation of GO-WAR by mining publicly available GO annotated datasets, showing how GO-WAR outperforms current state of the art approaches.
Code of Federal Regulations, 2012 CFR
2012-07-01
... independent contractors who are engaged in mining operations; and (ii) Any construction worker who is exposed to hazards of mining operations. (2) The definition of “miner” does not include scientific workers; delivery workers; customers (including commercial over-the-road truck drivers); vendors; or visitors. This...
Code of Federal Regulations, 2013 CFR
2013-07-01
... independent contractors who are engaged in mining operations; and (ii) Any construction worker who is exposed to hazards of mining operations. (2) The definition of “miner” does not include scientific workers; delivery workers; customers (including commercial over-the-road truck drivers); vendors; or visitors. This...
Code of Federal Regulations, 2014 CFR
2014-07-01
... independent contractors who are engaged in mining operations; and (ii) Any construction worker who is exposed to hazards of mining operations. (2) The definition of “miner” does not include scientific workers; delivery workers; customers (including commercial over-the-road truck drivers); vendors; or visitors. This...
NASA Astrophysics Data System (ADS)
Versteeg, R. J.; Johnson, T.; Henrie, A.; Johnson, D.
2013-12-01
The Hanford 300 Area, located adjacent to the Columbia River in south-central Washington, USA, is the site of former research and uranium fuel rod fabrication facilities. Waste disposal practices at the site included discharging between 33 and 59 metric tons of uranium over a 40 year period into shallow infiltration galleries, resulting in persistent uranium contamination within the vadose and saturated zones. Uranium transport from the vadose zone to the saturated zone is intimately linked with water table fluctuations and river water driven by upstream dam operations. Different remedial efforts have occurred at the site to address uranium contamination. Numerous investigations are occurring at the site, both to investigate remedial performance and to increase the understanding of uranium dynamics. Several of these studies include acquisition of large hydrological and time lapse electrical geophysical data sets. Such datasets contain large amounts of information on hydrological processes. There are substantial challenges in how to effectively deal with the data volumes of such datasets, how to process such datasets and how to provide users with the ability to effectively access and synergize the hydrological information contained in raw and processed data. These challenges motivated the development of a cloud based cyberinfrastructure for dealing with large electrical hydrogeophysical datasets. This cyberinfrastructure is modular and extensible and includes datamanagement, data processing, visualization and result mining capabilities. Specifically, it provides for data transmission to a central server, data parsing in a relational database and processing of the data using a PNNL developed parallel inversion code on either dedicated or commodity compute clusters. Access to results is done through a browser with interactive tools allowing for generation of on demand visualization of the inversion results as well as interactive data mining and statistical calculation. This infrastructure was used for the acquisition and processing of an electrical geophysical timelapse survey which was collected over a highly instrumented field site in the Hanford 300 Area. Over a 13 month period between November 2011 and December 2012 1823 timelapse datasets were collected (roughly 5 datasets a day for a total of 23 million individual measurements) on three parallel resistivity lines of 30 m each with 0.5 meter electrode spacing. In addition, hydrological and environmental data were collected from dedicated and general purpose sensors. This dataset contains rich information on near surface processes on a range of different spatial and temporal scales (ranging from hourly to seasonal). We will show how this cyberinfrastructure was used to manage and process this dataset and how the cyberinfrastructure can be used to access, mine and visualize the resulting data and information.
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.
Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H
2015-01-01
The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Gurbaxani, Brian M; Jones, James F; Goertzel, Benjamin N; Maloney, Elizabeth M
2006-04-01
To provide a mathematical introduction to the Wichita (KS, USA) clinical dataset, which is all of the nongenetic data (no microarray or single nucleotide polymorphism data) from the 2-day clinical evaluation, and show the preliminary findings and limitations, of popular, matrix algebra-based data mining techniques. An initial matrix of 440 variables by 227 human subjects was reduced to 183 variables by 164 subjects. Variables were excluded that strongly correlated with chronic fatigue syndrome (CFS) case classification by design (for example, the multidimensional fatigue inventory [MFI] data), that were otherwise self reporting in nature and also tended to correlate strongly with CFS classification, or were sparse or nonvarying between case and control. Subjects were excluded if they did not clearly fall into well-defined CFS classifications, had comorbid depression with melancholic features, or other medical or psychiatric exclusions. The popular data mining techniques, principle components analysis (PCA) and linear discriminant analysis (LDA), were used to determine how well the data separated into groups. Two different feature selection methods helped identify the most discriminating parameters. Although purely biological features (variables) were found to separate CFS cases from controls, including many allostatic load and sleep-related variables, most parameters were not statistically significant individually. However, biological correlates of CFS, such as heart rate and heart rate variability, require further investigation. Feature selection of a limited number of variables from the purely biological dataset produced better separation between groups than a PCA of the entire dataset. Feature selection highlighted the importance of many of the allostatic load variables studied in more detail by Maloney and colleagues in this issue [1] , as well as some sleep-related variables. Nonetheless, matrix linear algebra-based data mining approaches appeared to be of limited utility when compared with more sophisticated nonlinear analyses on richer data types, such as those found in Maloney and colleagues [1] and Goertzel and colleagues [2] in this issue.
Sampling and monitoring for closure
McLemore, V.T.; Russell, C.C.; Smith, K.S.
2004-01-01
The Metals Mining Sector of the Acid Drainage Technology Initiative (ADTI-MMS) addresses technical drainage-quality issues related to metal mining and related metallurgical operations, for future and active mines, as well as, for historical mines and mining districts. One of the first projects of ADTI-MMS is to develop a handbook describing the best sampling, monitoring, predicting, mitigating, and modeling of drainage from metal mines, pit lakes and related metallurgical facilities based upon current scientific and engineering practices. One of the important aspects of planning a new mine in today's regulatory environment is the philosophy of designing a new or existing mine or expansion of operations for ultimate closure. The holistic philosophy taken in the ADTI-MMS handbook maintains that sampling and monitoring programs should be designed to take into account all aspects of the mine-life cycle. Data required for the closure of the operation are obtained throughout the mine-life cycle, from exploration through post-closure.
Astroinformatics, data mining and the future of astronomical research
NASA Astrophysics Data System (ADS)
Brescia, Massimo; Longo, Giuseppe
2013-08-01
Astronomy, as many other scientific disciplines, is facing a true data deluge which is bound to change both the praxis and the methodology of every day research work. The emerging field of astroinformatics, while on the one end appears crucial to face the technological challenges, on the other is opening new exciting perspectives for new astronomical discoveries through the implementation of advanced data mining procedures. The complexity of astronomical data and the variety of scientific problems, however, call for innovative algorithms and methods as well as for an extreme usage of ICT technologies.
Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets.
Krüger, Jens; Potter, Kristin; Macleod, Rob S; Johnson, Christopher
2008-01-01
With the continual increase in computing power, volumetric datasets with sizes ranging from only a few megabytes to petascale are generated thousands of times per day. Such data may come from an ordinary source such as simple everyday medical imaging procedures, while larger datasets may be generated from cluster-based scientific simulations or measurements of large scale experiments. In computer science an incredible amount of work worldwide is put into the efficient visualization of these datasets. As researchers in the field of scientific visualization, we often have to face the task of handling very large data from various sources. This data usually comes in many different data formats. In medical imaging, the DICOM standard is well established, however, most research labs use their own data formats to store and process data. To simplify the task of reading the many different formats used with all of the different visualization programs, we present a system for the efficient handling of many types of large scientific datasets (see Figure 1 for just a few examples). While primarily targeted at structured volumetric data, UVF can store just about any type of structured and unstructured data. The system is composed of a file format specification with a reference implementation of a reader. It is not only a common, easy to implement format but also allows for efficient rendering of most datasets without the need to convert the data in memory.
Fault Tolerant Frequent Pattern Mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shohdy, Sameh; Vishnu, Abhinav; Agrawal, Gagan
FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing,more » though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.« less
NASA Astrophysics Data System (ADS)
Aldeen Yousra, S.; Mazleena, Salleh
2018-05-01
Recent advancement in Information and Communication Technologies (ICT) demanded much of cloud services to sharing users’ private data. Data from various organizations are the vital information source for analysis and research. Generally, this sensitive or private data information involves medical, census, voter registration, social network, and customer services. Primary concern of cloud service providers in data publishing is to hide the sensitive information of individuals. One of the cloud services that fulfill the confidentiality concerns is Privacy Preserving Data Mining (PPDM). The PPDM service in Cloud Computing (CC) enables data publishing with minimized distortion and absolute privacy. In this method, datasets are anonymized via generalization to accomplish the privacy requirements. However, the well-known privacy preserving data mining technique called K-anonymity suffers from several limitations. To surmount those shortcomings, I propose a new heuristic anonymization framework for preserving the privacy of sensitive datasets when publishing on cloud. The advantages of K-anonymity, L-diversity and (α, k)-anonymity methods for efficient information utilization and privacy protection are emphasized. Experimental results revealed the superiority and outperformance of the developed technique than K-anonymity, L-diversity, and (α, k)-anonymity measure.
Meneghetti, Natascia; Facco, Pierantonio; Bezzo, Fabrizio; Himawan, Chrismono; Zomer, Simeone; Barolo, Massimiliano
2016-05-30
In this proof-of-concept study, a methodology is proposed to systematically analyze large data historians of secondary pharmaceutical manufacturing systems using data mining techniques. The objective is to develop an approach enabling to automatically retrieve operation-relevant information that can assist the management in the periodic review of a manufactory system. The proposed methodology allows one to automatically perform three tasks: the identification of single batches within the entire data-sequence of the historical dataset, the identification of distinct operating phases within each batch, and the characterization of a batch with respect to an assigned multivariate set of operating characteristics. The approach is tested on a six-month dataset of a commercial-scale granulation/drying system, where several millions of data entries are recorded. The quality of results and the generality of the approach indicate that there is a strong potential for extending the method to even larger historical datasets and to different operations, thus making it an advanced PAT tool that can assist the implementation of continual improvement paradigms within a quality-by-design framework. Copyright © 2016 Elsevier B.V. All rights reserved.
Enrichment of OpenStreetMap Data Completeness with Sidewalk Geometries Using Data Mining Techniques.
Mobasheri, Amin; Huang, Haosheng; Degrossi, Lívia Castro; Zipf, Alexander
2018-02-08
Tailored routing and navigation services utilized by wheelchair users require certain information about sidewalk geometries and their attributes to execute efficiently. Except some minor regions/cities, such detailed information is not present in current versions of crowdsourced mapping databases including OpenStreetMap. CAP4Access European project aimed to use (and enrich) OpenStreetMap for making it fit to the purpose of wheelchair routing. In this respect, this study presents a modified methodology based on data mining techniques for constructing sidewalk geometries using multiple GPS traces collected by wheelchair users during an urban travel experiment. The derived sidewalk geometries can be used to enrich OpenStreetMap to support wheelchair routing. The proposed method was applied to a case study in Heidelberg, Germany. The constructed sidewalk geometries were compared to an official reference dataset ("ground truth dataset"). The case study shows that the constructed sidewalk network overlays with 96% of the official reference dataset. Furthermore, in terms of positional accuracy, a low Root Mean Square Error (RMSE) value (0.93 m) is achieved. The article presents our discussion on the results as well as the conclusion and future research directions.
Data publication, documentation and user friendly landing pages - improving data discovery and reuse
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Ulbricht, Damian; Bertelmann, Roland
2016-04-01
Research data are the basis for scientific research and often irreplaceable (e.g. observational data). Storage of such data in appropriate, theme specific or institutional repositories is an essential part of ensuring their long term preservation and access. The free and open access to research data for reuse and scrutiny has been identified as a key issue by the scientific community as well as by research agencies and the public. To ensure the datasets to intelligible and usable for others they must be accompanied by comprehensive data description and standardized metadata for data discovery, and ideally should be published using digital object identifier (DOI). These make datasets citable and ensure their long-term accessibility and are accepted in reference lists of journal articles (http://www.copdess.org/statement-of-commitment/). The GFZ German Research Centre for Geosciences is the national laboratory for Geosciences in Germany and part of the Helmholtz Association, Germany's largest scientific organization. The development and maintenance of data systems is a key component of 'GFZ Data Services' to support state-of-the-art research. The datasets, archived in and published by the GFZ Data Repository cover all geoscientific disciplines and range from large dynamic datasets deriving from global monitoring seismic or geodetic networks with real-time data acquisition, to remotely sensed satellite products, to automatically generated data publications from a database for data from micro meteorological stations, to various model results, to geochemical and rock mechanical analyses from various labs, and field observations. The user-friendly presentation of published datasets via a DOI landing page is as important for reuse as the storage itself, and the required information is highly specific for each scientific discipline. If dataset descriptions are too general, or require the download of a dataset before knowing its suitability, many researchers often decide not to reuse a published dataset. In contrast to large data repositories without thematic specification, theme-specific data repositories have a large expertise in data discovery and opportunity to develop usable, discipline-specific formats and layouts for specific datasets, including consultation to different formats for the data description (e.g., via a Data Report or an article in a Data Journal) with full consideration of international metadata standards.
Integrating Entropy and Closed Frequent Pattern Mining for Social Network Modelling and Analysis
NASA Astrophysics Data System (ADS)
Adnan, Muhaimenul; Alhajj, Reda; Rokne, Jon
The recent increase in the explicitly available social networks has attracted the attention of the research community to investigate how it would be possible to benefit from such a powerful model in producing effective solutions for problems in other domains where the social network is implicit; we argue that social networks do exist around us but the key issue is how to realize and analyze them. This chapter presents a novel approach for constructing a social network model by an integrated framework that first preparing the data to be analyzed and then applies entropy and frequent closed patterns mining for network construction. For a given problem, we first prepare the data by identifying items and transactions, which arc the basic ingredients for frequent closed patterns mining. Items arc main objects in the problem and a transaction is a set of items that could exist together at one time (e.g., items purchased in one visit to the supermarket). Transactions could be analyzed to discover frequent closed patterns using any of the well-known techniques. Frequent closed patterns have the advantage that they successfully grab the inherent information content of the dataset and is applicable to a broader set of domains. Entropies of the frequent closed patterns arc used to keep the dimensionality of the feature vectors to a reasonable size; it is a kind of feature reduction process. Finally, we analyze the dynamic behavior of the constructed social network. Experiments were conducted on a synthetic dataset and on the Enron corpus email dataset. The results presented in the chapter show that social networks extracted from a feature set as frequent closed patterns successfully carry the community structure information. Moreover, for the Enron email dataset, we present an analysis to dynamically indicate the deviations from each user's individual and community profile. These indications of deviations can be very useful to identify unusual events.
A Survey of Bioinformatics Database and Software Usage through Mining the Literature.
Duck, Geraint; Nenadic, Goran; Filannino, Michele; Brass, Andy; Robertson, David L; Stevens, Robert
2016-01-01
Computer-based resources are central to much, if not most, biological and medical research. However, while there is an ever expanding choice of bioinformatics resources to use, described within the biomedical literature, little work to date has provided an evaluation of the full range of availability or levels of usage of database and software resources. Here we use text mining to process the PubMed Central full-text corpus, identifying mentions of databases or software within the scientific literature. We provide an audit of the resources contained within the biomedical literature, and a comparison of their relative usage, both over time and between the sub-disciplines of bioinformatics, biology and medicine. We find that trends in resource usage differs between these domains. The bioinformatics literature emphasises novel resource development, while database and software usage within biology and medicine is more stable and conservative. Many resources are only mentioned in the bioinformatics literature, with a relatively small number making it out into general biology, and fewer still into the medical literature. In addition, many resources are seeing a steady decline in their usage (e.g., BLAST, SWISS-PROT), though some are instead seeing rapid growth (e.g., the GO, R). We find a striking imbalance in resource usage with the top 5% of resource names (133 names) accounting for 47% of total usage, and over 70% of resources extracted being only mentioned once each. While these results highlight the dynamic and creative nature of bioinformatics research they raise questions about software reuse, choice and the sharing of bioinformatics practice. Is it acceptable that so many resources are apparently never reused? Finally, our work is a step towards automated extraction of scientific method from text. We make the dataset generated by our study available under the CC0 license here: http://dx.doi.org/10.6084/m9.figshare.1281371.
TERATOLOGY v2.0 – building a path forward
Unraveling the complex relationships between environmental factors and early life susceptibility in assessing the risk for adverse pregnancy outcomes requires advanced knowledge of biological systems. Large datasets and deep data-mining tools are emerging resources for predictive...
Data Mining at NASA: From Theory to Applications
NASA Technical Reports Server (NTRS)
Srivastava, Ashok N.
2009-01-01
This slide presentation demonstrates the data mining/machine learning capabilities of NASA Ames and Intelligent Data Understanding (IDU) group. This will encompass the work done recently in the group by various group members. The IDU group develops novel algorithms to detect, classify, and predict events in large data streams for scientific and engineering systems. This presentation for Knowledge Discovery and Data Mining 2009 is to demonstrate the data mining/machine learning capabilities of NASA Ames and IDU group. This will encompass the work done re cently in the group by various group members.
Mining of high utility-probability sequential patterns from uncertain databases
Zhang, Binbin; Fournier-Viger, Philippe; Li, Ting
2017-01-01
High-utility sequential pattern mining (HUSPM) has become an important issue in the field of data mining. Several HUSPM algorithms have been designed to mine high-utility sequential patterns (HUPSPs). They have been applied in several real-life situations such as for consumer behavior analysis and event detection in sensor networks. Nonetheless, most studies on HUSPM have focused on mining HUPSPs in precise data. But in real-life, uncertainty is an important factor as data is collected using various types of sensors that are more or less accurate. Hence, data collected in a real-life database can be annotated with existing probabilities. This paper presents a novel pattern mining framework called high utility-probability sequential pattern mining (HUPSPM) for mining high utility-probability sequential patterns (HUPSPs) in uncertain sequence databases. A baseline algorithm with three optional pruning strategies is presented to mine HUPSPs. Moroever, to speed up the mining process, a projection mechanism is designed to create a database projection for each processed sequence, which is smaller than the original database. Thus, the number of unpromising candidates can be greatly reduced, as well as the execution time for mining HUPSPs. Substantial experiments both on real-life and synthetic datasets show that the designed algorithm performs well in terms of runtime, number of candidates, memory usage, and scalability for different minimum utility and minimum probability thresholds. PMID:28742847
75 FR 33816 - Center for Scientific Review; Notice of Closed Meetings
Federal Register 2010, 2011, 2012, 2013, 2014
2010-06-15
... Scientific Review Special Emphasis Panel; Small Business: Computational Biology, Image Processing, and Data Mining. Date: July 21, 2010. Time: 8 a.m. to 6 p.m. Agenda: To review and evaluate grant applications...
Federal Register 2010, 2011, 2012, 2013, 2014
2013-08-21
... Scientific Counselors, National Institute for Occupational Safety and Health (BSC, NIOSH) In accordance with... demonstrations relating to occupational safety and health and to mine health. The Board of Scientific Counselors shall provide guidance to the Director, National Institute for Occupational Safety and Health on...
Bomb Strike Experiment for Mine Clearance Operations
2006-03-01
avenue of trade is responsible for 25 percent of the U.S. Gross Domestic Product ( GDP ), up from 11 percent in 1970, and experts agree that this figure...of Recent Changes and Developments. Defense Scientific Establishment, Auckland , New Zealand, Report 149. Johnson, S., Damn The Mines --- Full Speed
Code of Federal Regulations, 2010 CFR
2010-07-01
... working in an underground mine and who is engaged in the extraction and production process, or engaged in... workers, such as drillers and blasters, who are engaged in the extraction and production process or... working in an underground mine, including any delivery, office, or scientific worker or occasional, short...
Code of Federal Regulations, 2010 CFR
2010-07-01
... underground mine and who is engaged in the extraction and production process, or engaged in shaft or slope... and blasters, who are engaged in the extraction and production process or engaged in shaft or slope... surface mine, including any delivery, office, or scientific worker or occasional, short-term maintenance...
Literature Mining Methods for Toxicology and Construction of ...
Webinar Presentation on text-mining methodologies in use at NCCT and how they can be used to assist with the OECD Retinoid project. Presentation to 1st Workshop/Scientific Expert Group meeting on the OECD Retinoid Project - April 26, 2016 –Brussels, Presented remotely via web.
Teaching the Scientific Method: It's All in the Perspective
ERIC Educational Resources Information Center
Ayers, James M.; Ayers, Kathleen M.
2007-01-01
A three unit module of inquiry, including morphological comparison, cladogram construction, and data mining has been developed to teach students the nature of experimental science. Students generate angiosperm morphological data, form cladistic hypotheses, then mine taxonomic, bioinformatic and historical data from many sources to replicate and…
NASA Astrophysics Data System (ADS)
Obulesu, O.; Rama Mohan Reddy, A., Dr; Mahendra, M.
2017-08-01
Detecting regular and efficient cyclic models is the demanding activity for data analysts due to unstructured, vigorous and enormous raw information produced from web. Many existing approaches generate large candidate patterns in the occurrence of huge and complex databases. In this work, two novel algorithms are proposed and a comparative examination is performed by considering scalability and performance parameters. The first algorithm is, EFPMA (Extended Regular Model Detection Algorithm) used to find frequent sequential patterns from the spatiotemporal dataset and the second one is, ETMA (Enhanced Tree-based Mining Algorithm) for detecting effective cyclic models with symbolic database representation. EFPMA is an algorithm grows models from both ends (prefixes and suffixes) of detected patterns, which results in faster pattern growth because of less levels of database projection compared to existing approaches such as Prefixspan and SPADE. ETMA uses distinct notions to store and manage transactions data horizontally such as segment, sequence and individual symbols. ETMA exploits a partition-and-conquer method to find maximal patterns by using symbolic notations. Using this algorithm, we can mine cyclic models in full-series sequential patterns including subsection series also. ETMA reduces the memory consumption and makes use of the efficient symbolic operation. Furthermore, ETMA only records time-series instances dynamically, in terms of character, series and section approaches respectively. The extent of the pattern and proving efficiency of the reducing and retrieval techniques from synthetic and actual datasets is a really open & challenging mining problem. These techniques are useful in data streams, traffic risk analysis, medical diagnosis, DNA sequence Mining, Earthquake prediction applications. Extensive investigational outcomes illustrates that the algorithms outperforms well towards efficiency and scalability than ECLAT, STNR and MAFIA approaches.
Educational and Scientific Applications of Climate Model Diagnostic Analyzer
NASA Astrophysics Data System (ADS)
Lee, S.; Pan, L.; Zhai, C.; Tang, B.; Kubar, T. L.; Zhang, J.; Bao, Q.
2016-12-01
Climate Model Diagnostic Analyzer (CMDA) is a web-based information system designed for the climate modeling and model analysis community to analyze climate data from models and observations. CMDA provides tools to diagnostically analyze climate data for model validation and improvement, and to systematically manage analysis provenance for sharing results with other investigators. CMDA utilizes cloud computing resources, multi-threading computing, machine-learning algorithms, web service technologies, and provenance-supporting technologies to address technical challenges that the Earth science modeling and model analysis community faces in evaluating and diagnosing climate models. As CMDA infrastructure and technology have matured, we have developed the educational and scientific applications of CMDA. Educationally, CMDA supported the summer school of the JPL Center for Climate Sciences for three years since 2014. In the summer school, the students work on group research projects where CMDA provide datasets and analysis tools. Each student is assigned to a virtual machine with CMDA installed in Amazon Web Services. A provenance management system for CMDA is developed to keep track of students' usages of CMDA, and to recommend datasets and analysis tools for their research topic. The provenance system also allows students to revisit their analysis results and share them with their group. Scientifically, we have developed several science use cases of CMDA covering various topics, datasets, and analysis types. Each use case developed is described and listed in terms of a scientific goal, datasets used, the analysis tools used, scientific results discovered from the use case, an analysis result such as output plots and data files, and a link to the exact analysis service call with all the input arguments filled. For example, one science use case is the evaluation of NCAR CAM5 model with MODIS total cloud fraction. The analysis service used is Difference Plot Service of Two Variables, and the datasets used are NCAR CAM total cloud fraction and MODIS total cloud fraction. The scientific highlight of the use case is that the CAM5 model overall does a fairly decent job at simulating total cloud cover, though simulates too few clouds especially near and offshore of the eastern ocean basins where low clouds are dominant.
FastQuery: A Parallel Indexing System for Scientific Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chou, Jerry; Wu, Kesheng; Prabhat,
2011-07-29
Modern scientific datasets present numerous data management and analysis challenges. State-of-the- art index and query technologies such as FastBit can significantly improve accesses to these datasets by augmenting the user data with indexes and other secondary information. However, a challenge is that the indexes assume the relational data model but the scientific data generally follows the array data model. To match the two data models, we design a generic mapping mechanism and implement an efficient input and output interface for reading and writing the data and their corresponding indexes. To take advantage of the emerging many-core architectures, we also developmore » a parallel strategy for indexing using threading technology. This approach complements our on-going MPI-based parallelization efforts. We demonstrate the flexibility of our software by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using data from a particle accelerator model and a global climate model. We also conducted a detailed performance study using these scientific datasets. The results show that FastQuery speeds up the query time by a factor of 2.5x to 50x, and it reduces the indexing time by a factor of 16 on 24 cores.« less
Unsupervised learning on scientific ocean drilling datasets from the South China Sea
NASA Astrophysics Data System (ADS)
Tse, Kevin C.; Chiu, Hon-Chim; Tsang, Man-Yin; Li, Yiliang; Lam, Edmund Y.
2018-06-01
Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.
Process mining techniques: an application to time management
NASA Astrophysics Data System (ADS)
Khowaja, Ali Raza
2018-04-01
In an environment people have to make sure that all of their work are completed within a given time in accordance with its quality. In order to achieve the real phenomenon of process mining one needs to understand all of these processes in a detailed manner. Personal Information and communication has always been a highlighting issue on internet but for now information and communication tools within factual life refers to their daily schedule, location analysis, environmental analysis and, more generally, social media applications support these systems which makes data available for data analysis generated through event logs, but also for process analysis which combines environmental and location analysis. Process mining can be used to exploit all these real live processes with the help of the event logs which are already available in those datasets through user censored data or may be user labeled data. These processes could be used to redesign a user's flow and understand all these processes in a bit more detailed manner. In order to increase the quality of each of the processes that we go through our daily lives is to give a closer look to each of the processes and after analyzing them, one should make changes to get better results. On the contrarily, we applied process mining techniques on seven different subjects combined in a single dataset collected from Korea. Above all, the following paper comments on the efficiency of processes in the event logs referring to time management's sphere of influence.
Romaní, Franco; Cabezas, César; Espinoza, Manuel; Minaya, Gabriela; Huaripata, José; Ureta, Juan Manuel; Yazuda, Myriam; Gastañaga, María del Carmen; Miraval, María Luz; Aparco, Juan Pablo; Anaya, Elizabeth; Castro, José; Esquivel, Silvia
2012-01-01
The development of scientific health research requires a sustained and articulated research system that is consistent with the research priorities, as well as both internal and external funding, and availability of competent human resources. The Mining Canon, a constitutional right, has been partly used to foster applied scientific research in public universities (PU). In addition, the National Health Institute (INSTITUTO NACIONAL DE SALUD - INS) is devoted, among others, to promoting, managing and disseminating health research development at a national level. As part of these activities, a technical team was created to provide technical assistance to PU for research development using Mining Canon funds by making local adjustments to research protocols promoted by the INS and assumed by the professors-researchers at the Universities. This article aims at describing the reality of research at Peruvian public universities that have access to Mining Canon funds, as well as to elaborate on the work the INS is carrying out in order to strengthen research capabilities, starting with the development of research proposals that could potentially be funded by the Mining Canon.
Reinventing the U.S. Bureau of Mines
DOE Office of Scientific and Technical Information (OSTI.GOV)
Graham, R.L.
1995-10-01
The changes in the US Bureau of Mines (USBM) are discussed. Beginning this year, the Bureau will focus its scientific research efforts in four critical areas. These areas are: (1) health and safety; (2) materials research (which will be accomplished in partnership with the private sector); (3) environmental remediation; and (4) pollution prevention and control. In addition to the very important scientific research that is going on, the US Bureau of Mines: collects and publishes consumption and supply data and information on 100 commodities for the 60 most important mineral producing, consuming and trading countries; and analyzes minerals related issues,more » such as proposed mining law reform legislation and proposed mine waste regulations, and assists in federal ecosystem planning efforts. On December 30, 1994, Secretary Babbitt signed a secretarial order implementing an organizational structure to allow the USBM to go forward in this new research focus. In Fiscal Year 1995, Congress appropriated funds to the USBM to focus research in this organizational structure. The USBM has experienced major funding reductions in the Clinton Administration. The whole Federal government is getting smaller. The USBM workforce is comparable in size to what it was three decades ago.« less
Rinaldi, Fabio; Schneider, Gerold; Kaljurand, Kaarel; Hess, Michael; Andronis, Christos; Konstandi, Ourania; Persidis, Andreas
2007-02-01
The amount of new discoveries (as published in the scientific literature) in the biomedical area is growing at an exponential rate. This growth makes it very difficult to filter the most relevant results, and thus the extraction of the core information becomes very expensive. Therefore, there is a growing interest in text processing approaches that can deliver selected information from scientific publications, which can limit the amount of human intervention normally needed to gather those results. This paper presents and evaluates an approach aimed at automating the process of extracting functional relations (e.g. interactions between genes and proteins) from scientific literature in the biomedical domain. The approach, using a novel dependency-based parser, is based on a complete syntactic analysis of the corpus. We have implemented a state-of-the-art text mining system for biomedical literature, based on a deep-linguistic, full-parsing approach. The results are validated on two different corpora: the manually annotated genomics information access (GENIA) corpus and the automatically annotated arabidopsis thaliana circadian rhythms (ATCR) corpus. We show how a deep-linguistic approach (contrary to common belief) can be used in a real world text mining application, offering high-precision relation extraction, while at the same time retaining a sufficient recall.
Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.
Wu, Chenxue; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687
Using Fisher information to track stability in multivariate systems
With the current proliferation of data, the proficient use of statistical and mining techniques offer substantial benefits to capture useful information from any dataset. As numerous approaches make use of information theory concepts, here, we discuss how Fisher information (FI...
Review of Studies of Mechanoelectrical Transformations in Rocks in Russia and Abroad
NASA Astrophysics Data System (ADS)
Pomishin, E.; Yavorovich, L.
2016-06-01
The problem of monitoring and forecast of dynamic manifestations of rock masses becomes immediate in the mining industry because of the growth of mining work intensity and changeover to the mining operations in deeper levels. The article presents a short review of the scientific works of foreign researchers for more complete and in-depth study of geophysical methods of control of the stress-strain state and bump hazard of rock masses.
Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra
2015-01-01
Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level. PMID:25830807
Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra
2015-01-01
Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.
Development of mine explosion ground truth smart sensors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Taylor, Steven R.; Harben, Phillip E.; Jarpe, Steve
Accurate seismo-acoustic source location is one of the fundamental aspects of nuclear explosion monitoring. Critical to improved location is the compilation of ground truth data sets for which origin time and location are accurately known. Substantial effort by the National Laboratories and other seismic monitoring groups have been undertaken to acquire and develop ground truth catalogs that form the basis of location efforts (e.g. Sweeney, 1998; Bergmann et al., 2009; Waldhauser and Richards, 2004). In particular, more GT1 (Ground Truth 1 km) events are required to improve three-dimensional velocity models that are currently under development. Mine seismicity can form themore » basis of accurate ground truth datasets. Although the location of mining explosions can often be accurately determined using array methods (e.g. Harris, 1991) and from overhead observations (e.g. MacCarthy et al., 2008), accurate origin time estimation can be difficult. Occasionally, mine operators will share shot time, location, explosion size and even shot configuration, but this is rarely done, especially in foreign countries. Additionally, shot times provided by mine operators are often inaccurate. An inexpensive, ground truth event detector that could be mailed to a contact, placed in close proximity (< 5 km) to mining regions or earthquake aftershock regions that automatically transmits back ground-truth parameters, would greatly aid in development of ground truth datasets that could be used to improve nuclear explosion monitoring capabilities. We are developing an inexpensive, compact, lightweight smart sensor unit (or units) that could be used in the development of ground truth datasets for the purpose of improving nuclear explosion monitoring capabilities. The units must be easy to deploy, be able to operate autonomously for a significant period of time (> 6 months) and inexpensive enough to be discarded after useful operations have expired (although this may not be part of our business plan). Key parameters to be automatically determined are event origin time (within 0.1 sec), location (within 1 km) and size (within 0.3 magnitude units) without any human intervention. The key parameter ground truth information from explosions greater than magnitude 2.5 will be transmitted to a recording and transmitting site. Because we have identified a limited bandwidth, inexpensive two-way satellite communication (ORBCOMM), we have devised the concept of an accompanying Ground-Truth Processing Center that would enable calibration and ground-truth accuracy to improve over the duration of a deployment.« less
Mining and Indexing Graph Databases
ERIC Educational Resources Information Center
Yuan, Dayu
2013-01-01
Graphs are widely used to model structures and relationships of objects in various scientific and commercial fields. Chemical molecules, proteins, malware system-call dependencies and three-dimensional mechanical parts are all modeled as graphs. In this dissertation, we propose to mine and index those graph data to enable fast and scalable search.…
,
1999-01-01
The Abandoned Mine Lands (AML) Initiative is part of a larger strategy of the U.S. Department of the Interior and the U.S. Department of Agriculture to clean up Federal lands contaminated by abandoned mines.Thousands of abandond hard-rock metal mines (such as gold, copper, lead, and zinc) have left a dual legacy across the Western United States. They reflect the historic development of the west, yet at the same time represent a possible threat to human health and local ecosystems.Abandoned Mine Lands (AML) are areas adjacent to or affected by abandoned mines. AML's often contain unmined mineral deposits, mine dumps (the ore and rock removed to get to the ore deposits), and tailings (the material left over from the ore processing) that contaminate the surrounding watershed and ecosystem. For example, streams near AML's can contain metals and (or) be so acidic that fish and aquatic insects cannot live in them.Many of these abandoned hard-rock mines are located on or adjacent to public lands administered by the Bureau of Land Management, National Park Service, and U.S. Forest Service. These federal land management agencies and the USGS are committed to mitigating the adverse effects that AML's can have on water quality and stream habitats.The USGS AML Initiative began in 1997 and will continue through 2001 in two pilot watersheds - the Boulder River basin in southwestern Montana and the upper Animas River basin in southwestern Colorado. The USGS is providing a wide range of scientific expertise to help land managers minimize and, where possible, eliminate the adverse environmental effects of AML's. USGS ecologists, geologists, water quality experts, hydrologists, geochemists, and mapping and digital data collection experts are collaborating to provide the scientific knowledge needed for an effective cleanup of AML's.
NASA Astrophysics Data System (ADS)
Pereira, Dolores; Pereira, Alcides; Neves, Luis
2015-04-01
The study of radioactivity in natural stones is a subject of great interest from different points of view: scientific, social and economic. Several previous studies have demonstrated that the radioactivity is dependent, not only on the uranium content, but also on the structures, textures, minerals containing the uranium and degree of weathering of the natural stone. Villavieja granite is extracted in a village where uranium mining was an important activity during the 20th century. Today the mine is closed but the granite is still extracted. Incorrect information about natural radioactivity given to natural stone users, policy makers, construction managers and the general public has caused turmoil in the media for many years. This paper considers problems associated with the communication of reliable information, as well as uncertainties, on natural radioactivity to these audiences.
Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou
2006-06-15
The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.
Raja, Kalpana; Natarajan, Jeyakumar
2018-07-01
Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes. In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature. First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form. The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%. The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus. Copyright © 2018 Elsevier B.V. All rights reserved.
Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach
NASA Astrophysics Data System (ADS)
Landgrebe, T. C. W.; Merdith, A.; Dutkiewicz, A.; Müller, R. D.
2013-07-01
Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.
Beckmann, Jacques S; Lew, Daniel
2016-12-19
This era of groundbreaking scientific developments in high-resolution, high-throughput technologies is allowing the cost-effective collection and analysis of huge, disparate datasets on individual health. Proper data mining and translation of the vast datasets into clinically actionable knowledge will require the application of clinical bioinformatics. These developments have triggered multiple national initiatives in precision medicine-a data-driven approach centering on the individual. However, clinical implementation of precision medicine poses numerous challenges. Foremost, precision medicine needs to be contrasted with the powerful and widely used practice of evidence-based medicine, which is informed by meta-analyses or group-centered studies from which mean recommendations are derived. This "one size fits all" approach can provide inadequate solutions for outliers. Such outliers, which are far from an oddity as all of us fall into this category for some traits, can be better managed using precision medicine. Here, we argue that it is necessary and possible to bridge between precision medicine and evidence-based medicine. This will require worldwide and responsible data sharing, as well as regularly updated training programs. We also discuss the challenges and opportunities for achieving clinical utility in precision medicine. We project that, through collection, analyses and sharing of standardized medically relevant data globally, evidence-based precision medicine will shift progressively from therapy to prevention, thus leading eventually to improved, clinician-to-patient communication, citizen-centered healthcare and sustained well-being.
Dynamic association rules for gene expression data analysis.
Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung
2015-10-14
The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
Long-range prediction of Indian summer monsoon rainfall using data mining and statistical approaches
NASA Astrophysics Data System (ADS)
H, Vathsala; Koolagudi, Shashidhar G.
2017-10-01
This paper presents a hybrid model to better predict Indian summer monsoon rainfall. The algorithm considers suitable techniques for processing dense datasets. The proposed three-step algorithm comprises closed itemset generation-based association rule mining for feature selection, cluster membership for dimensionality reduction, and simple logistic function for prediction. The application of predicting rainfall into flood, excess, normal, deficit, and drought based on 36 predictors consisting of land and ocean variables is presented. Results show good accuracy in the considered study period of 37years (1969-2005).
Customizing FP-growth algorithm to parallel mining with Charm++ library
NASA Astrophysics Data System (ADS)
Puścian, Marek
2017-08-01
This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
Mining Context-Aware Association Rules Using Grammar-Based Genetic Programming.
Luna, Jose Maria; Pechenizkiy, Mykola; Del Jesus, Maria Jose; Ventura, Sebastian
2017-09-25
Real-world data usually comprise features whose interpretation depends on some contextual information. Such contextual-sensitive features and patterns are of high interest to be discovered and analyzed in order to obtain the right meaning. This paper formulates the problem of mining context-aware association rules, which refers to the search for associations between itemsets such that the strength of their implication depends on a contextual feature. For the discovery of this type of associations, a model that restricts the search space and includes syntax constraints by means of a grammar-based genetic programming methodology is proposed. Grammars can be considered as a useful way of introducing subjective knowledge to the pattern mining process as they are highly related to the background knowledge of the user. The performance and usefulness of the proposed approach is examined by considering synthetically generated datasets. A posteriori analysis on different domains is also carried out to demonstrate the utility of this kind of associations. For example, in educational domains, it is essential to identify and understand contextual and context-sensitive factors that affect overall and individual student behavior and performance. The results of the experiments suggest that the approach is feasible and it automatically identifies interesting context-aware associations from real-world datasets.
NASA Astrophysics Data System (ADS)
Cehlár, Michal; Janočko, Juraj; Demirel, Nuray; Anyona, Seroni; Vöth, Stefan; Tyulenev, Maxim; Zhironkin, Sergey
2017-11-01
24-26 April, 2017 the Scientific Practical Conference "International Innovative Mining Symposium (in memory of Prof. Vladimir Pronoza)" was successfully organized at T.F. Gorbachev Kuzbass State Technical University - KuzSTU (Kemerovo Russia). More than 20 participants submitted their papers in presence, and more than 40 authors sent their papers for virtual participation. The main participants of the conference - Keynote Speakers - paid special attention to the development of international cooperation of technical universities as a form of accelerating the transfer of innovations in mineral resources mining and processing.
NASA Astrophysics Data System (ADS)
Willis, D. M.; Coffey, H. E.; Henwood, R.; Erwin, E. H.; Hoyt, D. V.; Wild, M. N.; Denig, W. F.
2013-11-01
The measurements of sunspot positions and areas that were published initially by the Royal Observatory, Greenwich, and subsequently by the Royal Greenwich Observatory (RGO), as the Greenwich Photo-heliographic Results ( GPR), 1874 - 1976, exist in both printed and digital forms. These printed and digital sunspot datasets have been archived in various libraries and data centres. Unfortunately, however, typographic, systematic and isolated errors can be found in the various datasets. The purpose of the present paper is to begin the task of identifying and correcting these errors. In particular, the intention is to provide in one foundational paper all the necessary background information on the original solar observations, their various applications in scientific research, the format of the different digital datasets, the necessary definitions of the quantities measured, and the initial identification of errors in both the printed publications and the digital datasets. Two companion papers address the question of specific identifiable errors; namely, typographic errors in the printed publications, and both isolated and systematic errors in the digital datasets. The existence of two independently prepared digital datasets, which both contain information on sunspot positions and areas, makes it possible to outline a preliminary strategy for the development of an even more accurate digital dataset. Further work is in progress to generate an extremely reliable sunspot digital dataset, based on the programme of solar observations supported for more than a century by the Royal Observatory, Greenwich, and the Royal Greenwich Observatory. This improved dataset should be of value in many future scientific investigations.
Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud
NASA Astrophysics Data System (ADS)
Wilson, B. D.; Manipon, G.; Hua, H.
2012-12-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within SciReduce a versatile set of python operators for data lookup, access, subsetting, co-registration, mining, fusion, and statistical analysis. All operators take in sets of geo-located arrays and generate more arrays. Large, multi-year satellite and model datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of granules) can be compared or fused in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP or webification URLs, thereby minimizing the size of the stored input and intermediate datasets. A typical map function might assemble and quality control AIRS Level-2 water vapor profiles for a year of data in parallel, then a reduce function would average the profiles in lat/lon bins (again, in parallel), and a final reduce would aggregate the climatology and write it to output files. We are using SciReduce to automate the production of multiple versions of a multi-year water vapor climatology (AIRS & MODIS), stratified by Cloudsat cloud classification, and compare it to models (ECMWF & MERRA reanalysis). We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing huge datasets on our own nodes and in the Amazon Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer.
Large-Scale, Parallel, Multi-Sensor Data Fusion in the Cloud
NASA Astrophysics Data System (ADS)
Wilson, B.; Manipon, G.; Hua, H.
2012-04-01
NASA's Earth Observing System (EOS) is an ambitious facility for studying global climate change. The mandate now is to combine measurements from the instruments on the "A-Train" platforms (AIRS, AMSR-E, MODIS, MISR, MLS, and CloudSat) and other Earth probes to enable large-scale studies of climate change over periods of years to decades. However, moving from predominantly single-instrument studies to a multi-sensor, measurement-based model for long-duration analysis of important climate variables presents serious challenges for large-scale data mining and data fusion. For example, one might want to compare temperature and water vapor retrievals from one instrument (AIRS) to another instrument (MODIS), and to a model (ECMWF), stratify the comparisons using a classification of the "cloud scenes" from CloudSat, and repeat the entire analysis over years of AIRS data. To perform such an analysis, one must discover & access multiple datasets from remote sites, find the space/time "matchups" between instruments swaths and model grids, understand the quality flags and uncertainties for retrieved physical variables, assemble merged datasets, and compute fused products for further scientific and statistical analysis. To efficiently assemble such decade-scale datasets in a timely manner, we are utilizing Elastic Computing in the Cloud and parallel map/reduce-based algorithms. "SciReduce" is a Hadoop-like parallel analysis system, programmed in parallel python, that is designed from the ground up for Earth science. SciReduce executes inside VMWare images and scales to any number of nodes in the Cloud. Unlike Hadoop, in which simple tuples (keys & values) are passed between the map and reduce functions, SciReduce operates on bundles of named numeric arrays, which can be passed in memory or serialized to disk in netCDF4 or HDF5. Thus, SciReduce uses the native datatypes (geolocated grids, swaths, and points) that geo-scientists are familiar with. We are deploying within SciReduce a versatile set of python operators for data lookup, access, subsetting, co-registration, mining, fusion, and statistical analysis. All operators take in sets of geo-arrays and generate more arrays. Large, multi-year satellite and model datasets are automatically "sharded" by time and space across a cluster of nodes so that years of data (millions of granules) can be compared or fused in a massively parallel way. Input variables (arrays) are pulled on-demand into the Cloud using OPeNDAP or webification URLs, thereby minimizing the size of the stored input and intermediate datasets. A typical map function might assemble and quality control AIRS Level-2 water vapor profiles for a year of data in parallel, then a reduce function would average the profiles in bins (again, in parallel), and a final reduce would aggregate the climatology and write it to output files. We are using SciReduce to automate the production of multiple versions of a multi-year water vapor climatology (AIRS & MODIS), stratified by Cloudsat cloud classification, and compare it to models (ECMWF & MERRA reanalysis). We will present the architecture of SciReduce, describe the achieved "clock time" speedups in fusing huge datasets on our own nodes and in the Amazon Cloud, and discuss the Cloud cost tradeoffs for storage, compute, and data transfer.
Schoof, Rosalind A; Johnson, Dina L; Handziuk, Emma R; Landingham, Cynthia Van; Feldpausch, Alma M; Gallagher, Alexa E; Dell, Linda D; Kephart, Amy
2016-10-01
Lead exposure and blood lead levels (BLLs) in the United States have declined dramatically since the 1970s as many widespread lead uses have been discontinued. Large scale mining and mineral processing represents an additional localized source of potential lead exposure in many historical mining communities, such as Butte, Montana. After 25 years of ongoing remediation efforts and a residential metals abatement program that includes blood lead monitoring of Butte children, examination of blood lead trends offers a unique opportunity to assess the effectiveness of Butte's lead source and exposure reduction measures. This study examined BLL trends in Butte children ages 1-5 (n= 2796) from 2003-2010 as compared to a reference dataset matched for similar demographic characteristics over the same period. Blood lead differences across Butte during the same period are also examined. Findings are interpreted with respect to effectiveness of remediation and other factors potentially contributing to ongoing exposure concerns. BLLs from Butte were compared with a reference dataset (n=2937) derived from the National Health and Nutrition Examination Survey. The reference dataset was initially matched for child age and sample dates. Additional demographic factors associated with higher BLLs were then evaluated. Weights were applied to make the reference dataset more consistent with the Butte dataset for the three factors that were most disparate (poverty-to-income ratio, house age, and race/ethnicity). A weighted linear mixed regression model showed Butte geometric mean BLLs were higher than reference BLLs for 2003-2004 (3.48vs. 2.05µg/dL), 2005-2006 (2.65vs. 1.80µg/dL), and 2007-2008 (2.2vs. 1.72µg/dL), but comparable for 2009-2010 (1.53vs. 1.51µg/dL). This trend suggests that, over time, the impact of other factors that may be associated with Butte BLLs has been reduced. Neighborhood differences were examined by dividing the Butte dataset into the older area called "Uptown", located at higher elevation atop historical mine workings, and "the Flats", at lower elevation and more recently developed. Significant declines in BLLs were observed over time in both areas, though Uptown had slightly higher BLLs than the Flats (2003-2004: 3.57vs. 3.45µg/dL, p=0.7; 2005-2006: 2.84vs. 2.52µg/dL, p=0.1; 2007-2008: 2.58vs. 1.99µg/dL, p=0.001; 2009-2010: 1.71vs. 1.44µg/dL, p=0.02). BLLs were higher when tested in summer/fall than in winter/spring for both neighborhoods, and statistically higher BLLs were found for children in Uptown living in properties built before 1940. Neighborhood differences and the persistence of a greater percentage of high BLLs (>5µg/dL) in Butte vs. the reference dataset support continuation of the home lead abatement program. Butte BLL declines likely reflect the cumulative effectiveness of screening efforts, community-wide remediation, and the ongoing metals abatement program in Butte in addition to other factors not accounted for by this study. As evidenced in Butte, abatement programs that include home evaluations and assistance in addressing multiple sources of lead exposure can be an important complement to community-wide soil remediation activities. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
An NCME Instructional Module on Data Mining Methods for Classification and Regression
ERIC Educational Resources Information Center
Sinharay, Sandip
2016-01-01
Data mining methods for classification and regression are becoming increasingly popular in various scientific fields. However, these methods have not been explored much in educational measurement. This module first provides a review, which should be accessible to a wide audience in education measurement, of some of these methods. The module then…
The Reserve Mining Case is regarded as the first landmark environmental and public health case and originated from Lake Superior Enforcement Conferences conducted under the Federal Water Pollution Control Act before U.S. EPA was formed. In 1973 just before the federal trial bega...
A science-based, watershed strategy to support effective remediation of abandoned mine lands
Buxton, Herbert T.; Nimick, David A.; Von Guerard, Paul; Church, Stan E.; Frazier, Ann G.; Gray, John R.; Lipin, Bruce R.; Marsh, Sherman P.; Woodward, Daniel F.; Kimball, Briant A.; Finger, Susan E.; Ischinger, Lee S.; Fordham, John C.; Power, Martha S.; Bunch, Christine M.; Jones, John W.
1997-01-01
A U.S. Geological Survey Abandoned Mine Lands Initiative will develop a strategy for gathering and communicating the scientific information needed to formulate effective and cost-efficient remediation of abandoned mine lands. A watershed approach will identify, characterize, and remediate contaminated sites that have the most profound effect on water and ecosystem quality within a watershed. The Initiative will be conducted during 1997 through 2001 in two pilot watersheds, the Upper Animas River watershed in Colorado and the Boulder River watershed in Montana. Initiative efforts are being coordinated with the U.S. Forest Service, Bureau of Land Management, National Park Service, and other stakeholders which are using the resulting scientific information to design and implement remediation activities. The Initiative has the following eight objective-oriented components: estimate background (pre-mining) conditions; define baseline (current) conditions; identify target sites (major contaminant sources); characterize target sites and processes affecting contaminant dispersal; characterize ecosystem health and controlling processes at target sites; develop remediation goals and monitoring network; provide an integrated, quality-assured and accessible data network; and document lessons learned for future applications of the watershed approach.
A web server for mining Comparative Genomic Hybridization (CGH) data
NASA Astrophysics Data System (ADS)
Liu, Jun; Ranka, Sanjay; Kahveci, Tamer
2007-11-01
Advances in cytogenetics and molecular biology has established that chromosomal alterations are critical in the pathogenesis of human cancer. Recurrent chromosomal alterations provide cytological and molecular markers for the diagnosis and prognosis of disease. They also facilitate the identification of genes that are important in carcinogenesis, which in the future may help in the development of targeted therapy. A large amount of publicly available cancer genetic data is now available and it is growing. There is a need for public domain tools that allow users to analyze their data and visualize the results. This chapter describes a web based software tool that will allow researchers to analyze and visualize Comparative Genomic Hybridization (CGH) datasets. It employs novel data mining methodologies for clustering and classification of CGH datasets as well as algorithms for identifying important markers (small set of genomic intervals with aberrations) that are potentially cancer signatures. The developed software will help in understanding the relationships between genomic aberrations and cancer types.
Spatial Variation of Selenium in Appalachian Coal Seams
NASA Astrophysics Data System (ADS)
Le, L.; Tyner, J. S.; Perfect, E.; Yoder, D. C.
2013-12-01
The potential environmental impacts from coal extraction have led to many investigations of the geochemistry of coal. Previous studies have shown that selenium (Se) is an environmental contaminant due to its mutagenic effects on sensitive macro-organisms as a result of bioaccumulation in affected waters. Some regulatory authorities have responded by requiring the sampling of coal seams and adjacent rock for Se prior to authorizing a given coal mining permit. In at least one case, a single continuous rock core was sampled for Se to determine the threshold of Se across a 2.2 square kilometer proposed surface coal mine. To examine the adequacy of such an approach, we investigated the spatial variability and correlation of a West Virginia Geological and Economic Survey (WVGES) dataset of Se concentrations from coal seams collected within Appalachia (1088 samples). We conducted semi-variogram and Kriging cross-validation analyses on six coal seams from the dataset. Our findings suggest no significant spatial correlation of Se within a given coal seam.
NASA Astrophysics Data System (ADS)
Jung, Chinte; Sun, Chih-Hong
2006-10-01
Motivated by the increasing accessibility of technology, more and more spatial data are being made digitally available. How to extract the valuable knowledge from these large (spatial) databases is becoming increasingly important to businesses, as well. It is essential to be able to analyze and utilize these large datasets, convert them into useful knowledge, and transmit them through GIS-enabled instruments and the Internet, conveying the key information to business decision-makers effectively and benefiting business entities. In this research, we combine the techniques of GIS, spatial decision support system (SDSS), spatial data mining (SDM), and ArcGIS Server to achieve the following goals: (1) integrate databases from spatial and non-spatial datasets about the locations of businesses in Taipei, Taiwan; (2) use the association rules, one of the SDM methods, to extract the knowledge from the integrated databases; and (3) develop a Web-based SDSS GIService as a location-selection tool for business by the product of ArcGIS Server.
NASA Astrophysics Data System (ADS)
Heather, David; Besse, Sebastien; Vallat, Claire; Barbarisi, Isa; Arviset, Christophe; De Marchi, Guido; Barthelemy, Maud; Coia, Daniela; Costa, Marc; Docasal, Ruben; Fraga, Diego; Grotheer, Emmanuel; Lim, Tanya; MacFarlane, Alan; Martinez, Santa; Rios, Carlos; Vallejo, Fran; Saiz, Jaime
2017-04-01
The Planetary Science Archive (PSA) is the European Space Agency's (ESA) repository of science data from all planetary science and exploration missions. The PSA provides access to scientific datasets through various interfaces at http://psa.esa.int. All datasets are scientifically peer-reviewed by independent scientists, and are compliant with the Planetary Data System (PDS) standards. The PSA is currently implementing a number of significant improvements, mostly driven by the evolution of the PDS standard, and the growing need for better interfaces and advanced applications to support science exploitation. As of the end of 2016, the PSA is hosting data from all of ESA's planetary missions. This includes ESA's first planetary mission Giotto that encountered comet 1P/Halley in 1986 with a flyby at 800km. Science data from Venus Express, Mars Express, Huygens and the SMART-1 mission are also all available at the PSA. The PSA also contains all science data from Rosetta, which explored comet 67P/Churyumov-Gerasimenko and asteroids Steins and Lutetia. The year 2016 has seen the arrival of the ExoMars 2016 data in the archive. In the upcoming years, at least three new projects are foreseen to be fully archived at the PSA. The BepiColombo mission is scheduled for launch in 2018. Following that, the ExoMars Rover Surface Platform (RSP) in 2020, and then the JUpiter ICy moon Explorer (JUICE). All of these will archive their data in the PSA. In addition, a few ground-based support programmes are also available, especially for the Venus Express and Rosetta missions. The newly designed PSA will enhance the user experience and will significantly reduce the complexity for users to find their data promoting one-click access to the scientific datasets with more customized views when needed. This includes a better integration with Planetary GIS analysis tools and Planetary interoperability services (search and retrieve data, supporting e.g. PDAP, EPN-TAP). It will also be up-to-date with versions 3 and 4 of the PDS standards, as PDS4 will be used for ESA's ExoMars and upcoming BepiColombo missions. Users will have direct access to documentation, information and tools that are relevant to the scientific use of the dataset, including ancillary datasets, Software Interface Specification (SIS) documents, and any tools/help that the PSA team can provide. The new PSA interface was released in January 2017. The home page provides a direct and simple access to the scientific data, aiming to help scientists to discover and explore its content. The archive can be explored through a set of parameters that allow the selection of products through space and time. Quick views provide information needed for the selection of appropriate scientific products. During 2017, the PSA team will focus their efforts on developing a map search interface using GIS technologies to display ESA planetary datasets, an image gallery providing navigation through images to explore the datasets, and interoperability with international partners. This will be done in parallel with additional metadata searchable through the interface (i.e., geometry), and with a dedication to improve the content of 20 years of space exploration.
Analysis, Mining and Visualization Service at NCSA
NASA Astrophysics Data System (ADS)
Wilhelmson, R.; Cox, D.; Welge, M.
2004-12-01
NCSA's goal is to create a balanced system that fully supports high-end computing as well as: 1) high-end data management and analysis; 2) visualization of massive, highly complex data collections; 3) large databases; 4) geographically distributed Grid computing; and 5) collaboratories, all based on a secure computational environment and driven with workflow-based services. To this end NCSA has defined a new technology path that includes the integration and provision of cyberservices in support of data analysis, mining, and visualization. NCSA has begun to develop and apply a data mining system-NCSA Data-to-Knowledge (D2K)-in conjunction with both the application and research communities. NCSA D2K will enable the formation of model-based application workflows and visual programming interfaces for rapid data analysis. The Java-based D2K framework, which integrates analytical data mining methods with data management, data transformation, and information visualization tools, will be configurable from the cyberservices (web and grid services, tools, ..) viewpoint to solve a wide range of important data mining problems. This effort will use modules, such as a new classification methods for the detection of high-risk geoscience events, and existing D2K data management, machine learning, and information visualization modules. A D2K cyberservices interface will be developed to seamlessly connect client applications with remote back-end D2K servers, providing computational resources for data mining and integration with local or remote data stores. This work is being coordinated with SDSC's data and services efforts. The new NCSA Visualization embedded workflow environment (NVIEW) will be integrated with D2K functionality to tightly couple informatics and scientific visualization with the data analysis and management services. Visualization services will access and filter disparate data sources, simplifying tasks such as fusing related data from distinct sources into a coherent visual representation. This approach enables collaboration among geographically dispersed researchers via portals and front-end clients, and the coupling with data management services enables recording associations among datasets and building annotation systems into visualization tools and portals, giving scientists a persistent, shareable, virtual lab notebook. To facilitate provision of these cyberservices to the national community, NCSA will be providing a computational environment for large-scale data assimilation, analysis, mining, and visualization. This will be initially implemented on the new 512 processor shared memory SGI's recently purchased by NCSA. In addition to standard batch capabilities, NCSA will provide on-demand capabilities for those projects requiring rapid response (e.g., development of severe weather, earthquake events) for decision makers. It will also be used for non-sequential interactive analysis of data sets where it is important have access to large data volumes over space and time.
SpecTIR and SEBASS analysis of the National Mining District, Humboldt County, Nevada
NASA Astrophysics Data System (ADS)
Morken, Todd O.
The purpose of this study was to evaluate the minerals and materials that could be uniquely identified and mapped from measurements made with airborne hyperspectral SpecTIR VNIR/SWIR and SEBASS TIR sensors over areas in the National Mining District. SpecTIR Corporation and Aerospace Corporation acquired Hyperspectral measurements on June 26, 2008 using their ProSpecTIR and SEBASS sensors respectively. In addition the effects of vegetation, elevation, the atmosphere on spectral measurements were evaluated to determine their impact upon the data analysis and target identification. The National Mining District is located approximately 75 miles northeast of Winnemucca, Nevada at the northern end of the Santa Rosa Mountains. Precious metal mining has been dormant in this area since the 1940's, however with increased metal prices over the last decade economic interest in the region has increased substantially. Buckskin Mountain has a preserved alteration assemblage that is exposed in topographically steep terrain, ideal for exploring what hydrothermal alteration products can be identified and mapped in these datasets. These Visible Near Infrared (VNIR), Short Wave Infrared (SWIR), and Long Wave Infrared (LWIR) hyperspectral datasets were used to identify and map kaolinite, alunite, quartz, opal, and illite/muscovite, all of which are useful exploration target identifiers and can indicate regions of alteration. These mapping results were then combined with and compared to other geospatial data in a geographic information systems (GIS) database. The TIR hyperspectral data provided significant additional information that can benefit geologic exploration and demonstrated its usefulness as an additional tool for geological exploration.
NASA Astrophysics Data System (ADS)
Gibril, Mohamed Barakat A.; Idrees, Mohammed Oludare; Yao, Kouame; Shafri, Helmi Zulhaidi Mohd
2018-01-01
The growing use of optimization for geographic object-based image analysis and the possibility to derive a wide range of information about the image in textual form makes machine learning (data mining) a versatile tool for information extraction from multiple data sources. This paper presents application of data mining for land-cover classification by fusing SPOT-6, RADARSAT-2, and derived dataset. First, the images and other derived indices (normalized difference vegetation index, normalized difference water index, and soil adjusted vegetation index) were combined and subjected to segmentation process with optimal segmentation parameters obtained using combination of spatial and Taguchi statistical optimization. The image objects, which carry all the attributes of the input datasets, were extracted and related to the target land-cover classes through data mining algorithms (decision tree) for classification. To evaluate the performance, the result was compared with two nonparametric classifiers: support vector machine (SVM) and random forest (RF). Furthermore, the decision tree classification result was evaluated against six unoptimized trials segmented using arbitrary parameter combinations. The result shows that the optimized process produces better land-use land-cover classification with overall classification accuracy of 91.79%, 87.25%, and 88.69% for SVM and RF, respectively, while the results of the six unoptimized classifications yield overall accuracy between 84.44% and 88.08%. Higher accuracy of the optimized data mining classification approach compared to the unoptimized results indicates that the optimization process has significant impact on the classification quality.
A novel artificial immune clonal selection classification and rule mining with swarm learning model
NASA Astrophysics Data System (ADS)
Al-Sheshtawi, Khaled A.; Abdul-Kader, Hatem M.; Elsisi, Ashraf B.
2013-06-01
Metaheuristic optimisation algorithms have become popular choice for solving complex problems. By integrating Artificial Immune clonal selection algorithm (CSA) and particle swarm optimisation (PSO) algorithm, a novel hybrid Clonal Selection Classification and Rule Mining with Swarm Learning Algorithm (CS2) is proposed. The main goal of the approach is to exploit and explore the parallel computation merit of Clonal Selection and the speed and self-organisation merits of Particle Swarm by sharing information between clonal selection population and particle swarm. Hence, we employed the advantages of PSO to improve the mutation mechanism of the artificial immune CSA and to mine classification rules within datasets. Consequently, our proposed algorithm required less training time and memory cells in comparison to other AIS algorithms. In this paper, classification rule mining has been modelled as a miltiobjective optimisation problem with predictive accuracy. The multiobjective approach is intended to allow the PSO algorithm to return an approximation to the accuracy and comprehensibility border, containing solutions that are spread across the border. We compared our proposed algorithm classification accuracy CS2 with five commonly used CSAs, namely: AIRS1, AIRS2, AIRS-Parallel, CLONALG, and CSCA using eight benchmark datasets. We also compared our proposed algorithm classification accuracy CS2 with other five methods, namely: Naïve Bayes, SVM, MLP, CART, and RFB. The results show that the proposed algorithm is comparable to the 10 studied algorithms. As a result, the hybridisation, built of CSA and PSO, can develop respective merit, compensate opponent defect, and make search-optimal effect and speed better.
Driscoll, Heather E; Murray, Janet M; English, Erika L; Hunter, Timothy C; Pivarski, Kara; Dolci, Elizabeth D
2017-08-01
Here we describe microarray expression data (raw and normalized), experimental metadata, and gene-level data with expression statistics from Saccharomyces cerevisiae exposed to simulated asbestos mine drainage from the Vermont Asbestos Group (VAG) Mine on Belvidere Mountain in northern Vermont, USA. For nearly 100 years (between the late 1890s and 1993), chrysotile asbestos fibers were extracted from serpentinized ultramafic rock at the VAG Mine for use in construction and manufacturing industries. Studies have shown that water courses and streambeds nearby have become contaminated with asbestos mine tailings runoff, including elevated levels of magnesium, nickel, chromium, and arsenic, elevated pH, and chrysotile asbestos-laden mine tailings, due to leaching and gradual erosion of massive piles of mine waste covering approximately 9 km 2 . We exposed yeast to simulated VAG Mine tailings leachate to help gain insight on how eukaryotic cells exposed to VAG Mine drainage may respond in the mine environment. Affymetrix GeneChip® Yeast Genome 2.0 Arrays were utilized to assess gene expression after 24-h exposure to simulated VAG Mine tailings runoff. The chemistry of mine-tailings leachate, mine-tailings leachate plus yeast extract peptone dextrose media, and control yeast extract peptone dextrose media is also reported. To our knowledge this is the first dataset to assess global gene expression patterns in a eukaryotic model system simulating asbestos mine tailings runoff exposure. Raw and normalized gene expression data are accessible through the National Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO) Database Series GSE89875 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89875).
Song, Min
2016-01-01
In biomedicine, scientific literature is a valuable source for knowledge discovery. Mining knowledge from textual data has become an ever important task as the volume of scientific literature is growing unprecedentedly. In this paper, we propose a framework for examining a certain disease based on existing information provided by scientific literature. Disease-related entities that include diseases, drugs, and genes are systematically extracted and analyzed using a three-level network-based approach. A paper-entity network and an entity co-occurrence network (macro-level) are explored and used to construct six entity specific networks (meso-level). Important diseases, drugs, and genes as well as salient entity relations (micro-level) are identified from these networks. Results obtained from the literature-based literature mining can serve to assist clinical applications. PMID:27195695
The non-participation of organic sulphur in acid mine drainage generation
Casagrande, D.J.; Finkelman, R.B.; Caruccio, F.T.
1989-01-01
Acid mine drainage is commonly associated with land disturbances that encounter and expose iron sulphides to oxidising atmospheric conditions. The attendant acidic conditions solubilise a host of trace metals. Within this flow regime the potential exists to contaminate surface drinking water supplies with a variety of trace materials. Accordingly, in evaluating the applications for mines located in the headwaters of water sheds, the pre-mining prediction of the occurrence of acid mine drainage is of paramount importance. There is general agreement among investigators that coal organic sulphur is a nonparticipant in acid mine drainage generation; however, there is no scientific documentation to support this concensus. Using simulated weathering, kinetic, mass balance, petrographic analysis and a peroxide oxidation procedure, coal organic sulphur is shown to be a nonparticipant in acid mine drainage generation. Calculations for assessing the acid-generating potential of a sedimentary rock should not include organic sulphur content. ?? 1989 Sciences and Technology Letters.
Developing an EarthCube Governance Structure for Big Data Preservation and Access
NASA Astrophysics Data System (ADS)
Leetaru, H. E.; Leetaru, K. H.
2012-12-01
The underlying vision of the NSF EarthCube initiative is of an enduring resource serving the needs of the earth sciences for today and the future. We must therefore view this effort through the lens of what the earth sciences will need tomorrow and on how the underlying processes of data compilation, preservation, and access interplay with the scientific processes within the communities EarthCube will serve. Key issues that must be incorporated into the EarthCube governance structure include authentication, retrieval, and unintended use cases, the emerging role of whole-corpus data mining, and how inventory, citation, and archive practices will impact the ability of scientists to use EarthCube's collections into the future. According to the National Academies, the US federal government spends over $140 billion dollars a year in support of the nation's research base. Yet, a critical issue confronting all of the major scientific disciplines in building upon this investment is the lack of processes that guide how data are preserved for the long-term, ensuring that studies can be replicated and that experimental data remains accessible as new analytic methods become available or theories evolve. As datasets are used years or even decades after their creation, far richer metadata is needed to describe the underlying simulation, smoothing algorithms or bounding parameters of the data collection process. This is even truer as data are increasingly used outside their intended disciplines, as geoscience researchers apply algorithms from one discipline to datasets from another, where their analytical techniques may make extensive assumptions about the data. As science becomes increasingly interdisciplinary and emerging computational approaches begin applying whole-corpus methodologies and blending multiple archives together, we are facing new data access modalities distinct from the needs of the past, drawing into focus the question of centralized versus distributed architectures. In the past geoscience data have been distributed, with each site maintaining its own collections and centralized inventory metadata supporting discovery. This was based on the historical search-browse-download modality where access was primarily to download a copy to a researcher's own machine and datasets were measured in gigabytes. New "big data" approaches to the geosciences are already demonstrating the need to analyze the entirety of multiple collections from multiple sites totaling hundreds of terabytes in size. Yet, datasets are outpacing the ability of networks to share them, forcing a new paradigm in high-performance computing where computation must migrate to centralized data stores. The next generation of geoscientists are going to need a system designed for exploring and understanding data from multiple scientific domains and vantages where data are preserved for decades. We are not alone in this endeavor and there are many lessons we can learn from similar initiatives such as more than 40 years of governance policies for data warehouses and 15 years of open web archives, all of which face the same challenges. The entire EarthCube project will fail if the new governance structure does not account for the needs of integrated cyberinfrastructure that allows big data to stored, archived, analyzed, and made accessible to large numbers of scientists.
Advancing Science through Mining Libraries, Ontologies, and Communities*
Evans, James A.; Rzhetsky, Andrey
2011-01-01
Life scientists today cannot hope to read everything relevant to their research. Emerging text-mining tools can help by identifying topics and distilling statements from books and articles with increased accuracy. Researchers often organize these statements into ontologies, consistent systems of reality claims. Like scientific thinking and interchange, however, text-mined information (even when accurately captured) is complex, redundant, sometimes incoherent, and often contradictory: it is rooted in a mixture of only partially consistent ontologies. We review work that models scientific reason and suggest how computational reasoning across ontologies and the broader distribution of textual statements can assess the certainty of statements and the process by which statements become certain. With the emergence of digitized data regarding networks of scientific authorship, institutions, and resources, we explore the possibility of accounting for social dependences and cultural biases in reasoning models. Computational reasoning is starting to fill out ontologies and flag internal inconsistencies in several areas of bioscience. In the not too distant future, scientists may be able to use statements and rich models of the processes that produced them to identify underexplored areas, resurrect forgotten findings and ideas, deconvolute the spaghetti of underlying ontologies, and synthesize novel knowledge and hypotheses. PMID:21566119
Sahadevan, S; Hofmann-Apitius, M; Schellander, K; Tesfaye, D; Fluck, J; Friedrich, C M
2012-10-01
In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from cattle and pigs. For this purpose, the rather noncomprehensive resources of pig and cattle gene and protein terminologies were enriched with orthologue synonyms, integrated in the NER platform, ProMiner, which is successfully used in human genomics domain. Based on the performance tests done, the present system achieved a fair performance with precision 0.64, recall 0.74, and F(1) measure of 0.69 in a test scenario based on cattle literature.
A hybrid organic-inorganic perovskite dataset
NASA Astrophysics Data System (ADS)
Kim, Chiho; Huan, Tran Doan; Krishnan, Sridevi; Ramprasad, Rampi
2017-05-01
Hybrid organic-inorganic perovskites (HOIPs) have been attracting a great deal of attention due to their versatility of electronic properties and fabrication methods. We prepare a dataset of 1,346 HOIPs, which features 16 organic cations, 3 group-IV cations and 4 halide anions. Using a combination of an atomic structure search method and density functional theory calculations, the optimized structures, the bandgap, the dielectric constant, and the relative energies of the HOIPs are uniformly prepared and validated by comparing with relevant experimental and/or theoretical data. We make the dataset available at Dryad Digital Repository, NoMaD Repository, and Khazana Repository (http://khazana.uconn.edu/), hoping that it could be useful for future data-mining efforts that can explore possible structure-property relationships and phenomenological models. Progressive extension of the dataset is expected as new organic cations become appropriate within the HOIP framework, and as additional properties are calculated for the new compounds found.
Decision tree methods: applications for classification and prediction.
Song, Yan-Yan; Lu, Ying
2015-04-25
Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.
Multicriteria analysis for sources of renewable energy using data from remote sensing
NASA Astrophysics Data System (ADS)
Matejicek, L.
2015-04-01
Renewable energy sources are major components of the strategy to reduce harmful emissions and to replace depleting fossil energy resources. Data from remote sensing can provide information for multicriteria analysis for sources of renewable energy. Advanced land cover quantification makes it possible to search for suitable sites. Multicriteria analysis, together with other data, is used to determine the energy potential and socially acceptability of suggested locations. The described case study is focused on an area of surface coal mines in the northwestern region of the Czech Republic, where the impacts of surface mining and reclamation constitute a dominant force in land cover changes. High resolution satellite images represent the main input datasets for identification of suitable sites. Solar mapping, wind predictions, the location of weirs in watersheds, road maps and demographic information complement the data from remote sensing for multicriteria analysis, which is implemented in a geographic information system (GIS). The input spatial datasets for multicriteria analysis in GIS are reclassified to a common scale and processed with raster algebra tools to identify suitable sites for sources of renewable energy. The selection of suitable sites is limited by the CORINE land cover database to mining and agricultural areas. The case study is focused on long term land cover changes in the 1985-2015 period. Multicriteria analysis based on CORINE data shows moderate changes in mapping of suitable sites for utilization of selected sources of renewable energy in 1990, 2000, 2006 and 2012. The results represent map layers showing the energy potential on a scale of a few preference classes (1-7), where the first class is linked to minimum preference and the last class to maximum preference. The attached histograms show the moderate variability of preference classes due to land cover changes caused by mining activities. The results also show a slight increase in the more preferred classes for utilization of sources of renewable energy due to an increase area of reclaimed sites. Using data from remote sensing, such as the multispectral images and the CORINE land cover datasets, can reduce the financial resources currently required for finding and assessing suitable areas.
USDA-ARS?s Scientific Manuscript database
Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...
Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value.
Bishop, Dorothy V M; Thompson, Paul A
2016-01-01
Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.
The purpose of this workshop is to provide a forum for the exchange of scientific infomation on current approaches for assessing the characterization, monitoring, treatment and/or remediation of impacts on aquatic ecosystems including pit lakes from mining-related contamination i...
Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou
2006-01-01
Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart. PMID:16776838
Kavakiotis, Ioannis; Xochelli, Aliki; Agathangelidis, Andreas; Tsoumakas, Grigorios; Maglaveras, Nicos; Stamatopoulos, Kostas; Hadzidimitriou, Anastasia; Vlahavas, Ioannis; Chouvarda, Ioanna
2016-06-06
Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions.
RANWAR: rank-based weighted association rule mining from gene expression and methylation data.
Mallik, Saurav; Mukhopadhyay, Anirban; Maulik, Ujjwal
2015-01-01
Ranking of association rules is currently an interesting topic in data mining and bioinformatics. The huge number of evolved rules of items (or, genes) by association rule mining (ARM) algorithms makes confusion to the decision maker. In this article, we propose a weighted rule-mining technique (say, RANWAR or rank-based weighted association rule-mining) to rank the rules using two novel rule-interestingness measures, viz., rank-based weighted condensed support (wcs) and weighted condensed confidence (wcc) measures to bypass the problem. These measures are basically depended on the rank of items (genes). Using the rank, we assign weight to each item. RANWAR generates much less number of frequent itemsets than the state-of-the-art association rule mining algorithms. Thus, it saves time of execution of the algorithm. We run RANWAR on gene expression and methylation datasets. The genes of the top rules are biologically validated by Gene Ontologies (GOs) and KEGG pathway analyses. Many top ranked rules extracted from RANWAR that hold poor ranks in traditional Apriori, are highly biologically significant to the related diseases. Finally, the top rules evolved from RANWAR, that are not in Apriori, are reported.
Introduction to Agent Mining Interaction and Integration
NASA Astrophysics Data System (ADS)
Cao, Longbing
In recent years, more and more researchers have been involved in research on both agent technology and data mining. A clear disciplinary effort has been activated toward removing the boundary between them, that is the interaction and integration between agent technology and data mining. We refer this to agent mining as a new area. The marriage of agents and data mining is driven by challenges faced by both communities, and the need of developing more advanced intelligence, information processing and systems. This chapter presents an overall picture of agent mining from the perspective of positioning it as an emerging area. We summarize the main driving forces, complementary essence, disciplinary framework, applications, case studies, and trends and directions, as well as brief observation on agent-driven data mining, data mining-driven agents, and mutual issues in agent mining. Arguably, we draw the following conclusions: (1) agent mining emerges as a new area in the scientific family, (2) both agent technology and data mining can greatly benefit from agent mining, (3) it is very promising to result in additional advancement in intelligent information processing and systems. However, as a new open area, there are many issues waiting for research and development from theoretical, technological and practical perspectives.
New investigations of the October 1999 Hector Mine Earthquake surface rupture
NASA Astrophysics Data System (ADS)
Sousa, F.; Harvey, J. C.; Hudnut, K. W.; Akciz, S. O.; Stock, J. M.
2013-12-01
We report on new field and computer based investigation of the surface rupture of the October 16, 1999 Hector Mine Earthquake. In cooperation with the United States Marine Corps Air Ground Combat Center Twentynine Palms (MCAGCC), our team was allowed ground and aerial access to the extent of the surface rupture for limited times during October - December 2012. As far as we know, this was the first scientific access granted to the entire surface rupture since the immediate aftermath of the earthquake, and the first scientific access of any kind to some parts of the maximum slip zone since before the event. This locale is an excellent natural laboratory for detailed study of a major earthquake surface rupture because: 1) complete circumscription within the boundaries of MCAGCC severely limit both past and future human disruption of the rupture, particularly in the mountainous maximum slip zone; 2) groundbreaking aerial LiDAR survey carried out six months after the earthquake was followed up by a higher density, wider swath LiDAR survey in May 2012, making the temporal evolution of this rupture perhaps the most completely physically documented of any major rupture; and 3) field investigation immediately following the event was followed up by computer based offset measurements using the April 2000 LiDAR dataset, providing a database of published offset measurements. Due to time constraints imposed by MCAGGC we focused our new research effort along the ~8 km long maximum slip zone of the rupture, roughly corresponding to the zone of >4 m dextral offset. Our investigation includes 1) walking this entire section of the fault and making >30 measurements of dextral slip while photo documenting the current state of the rupture; 2) creating a difference raster for the entire 8 km maximum slip zone from exactly congruent DEM's made from the 2000 and 2012 LiDAR data sets; 3) documenting the fault traces with a Trimble GeoXH high precision handheld GPS unit (+/- 10 cm); 4) carrying out field checks of a small number of computer-based offset measurements made using the 2000 LiDAR dataset; and 5) high-resolution low-altitude (<100 m AGL) photography of the maximum slip zone during a helicopter overflight. To date, important results include 1) identification of two new maximum slip locations where features are offset 7.9 m +/- 0.5 m and 6.7 m +/- 0.5 m; 2) a database of >30 offset measurements (georeferenced and photo documented) made by our team on the ground; 3) clear changes in fracture visibility in the field, with some fractures more visible, and others no longer visible, compared to the 1999-2000 studies; and 4) examples of a few field checks that both strongly agree and disagree with computer based LiDAR offset measurements.
Enrichment of Data Publications in Earth Sciences - Data Reports as a Missing Link
NASA Astrophysics Data System (ADS)
Elger, Kirsten; Bertelmann, Roland; Haberland, Christian; Evans, Peter L.
2015-04-01
During the past decade, the relevance of research data stewardship has been rising significantly. Preservation and publication of scientific data for long-term use, including the storage in adequate repositories has been identified as a key issue by the scientific community as well as by bodies like research agencies. Essential for any kind of re-use is a proper description of the datasets. As a result of the increasing interest, data repositories have been developed and the included research data is accompanied with at least a minimum set of metadata. This metadata is useful for data discovery and a first insight to the content of a dataset. But often data re-use needs more and extended information. Many datasets are accompanied by a small 'readme file' with basic information on the data structure, or other accompanying documents. A source of additional information could be an article published in one of the newly emerging data journals (e.g. Copernicus's ESSD Earth System Science Data or Nature's Scientific Data). Obviously there is an information gap between a 'readme file', that is only accessible after data download (which often leads to less usage of published datasets than if the information was available beforehand) and the much larger effort to prepare an article for a peer-reviewed data journal. For many years, GFZ German Research Centre for Geosciences publishes 'Scientific Technical Reports (STR)' as a report series which is electronically persistently available and citable with assigned DOIs. This series was opened for the description of parallel published datasets as 'STR Data'. These are internally reviewed and offer a flexible publication format describing published data in depth, suitable for different datasets ranging from long-term monitoring time series of observatories to field data, to (meta-)databases, and software publications. STR Data offer a full and consistent overview and description to all relevant parameters of a linked published dataset. These reports are readable and citable on their own, but are, of course, closely connected to the respective datasets. Therefore, they give full insight into the framework of the data before data download. This is especially relevant for large and often heterogeneous datasets, like e.g. controlled-source seismic data gathered with instruments of the 'Geophysical Instrument Pool Potsdam GIPP'. Here, details of the instrumentation, data organization, data format, accuracy, geographical coordinates, timing and data completeness, etc. need to be documented. STR Data are also attractive for the publication of historic datasets, e.g. 30-40 years old seismic experiments. It is also possible for one STR Data to describe several datasets, e.g. from multiple diverse instruments types, or distinct regions of interest. The publication of DOI-assigned data reports is a helpful tool to fill the gap between basic metadata and restricted 'readme' information on the one hand and preparing extended journal articles on the other hand. They open the way for informed re-use and, with their comprehensive data description, may act as 'appetizer' for the re-use of published datasets.
PubRunner: A light-weight framework for updating text mining results.
Anekalla, Kishore R; Courneya, J P; Fiorini, Nicolas; Lever, Jake; Muchow, Michael; Busby, Ben
2017-01-01
Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering
ERIC Educational Resources Information Center
Chahine, Firas Safwan
2012-01-01
Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…
Oztekin, Asil; Delen, Dursun; Kong, Zhenyu James
2009-12-01
Predicting the survival of heart-lung transplant patients has the potential to play a critical role in understanding and improving the matching procedure between the recipient and graft. Although voluminous data related to the transplantation procedures is being collected and stored, only a small subset of the predictive factors has been used in modeling heart-lung transplantation outcomes. The previous studies have mainly focused on applying statistical techniques to a small set of factors selected by the domain-experts in order to reveal the simple linear relationships between the factors and survival. The collection of methods known as 'data mining' offers significant advantages over conventional statistical techniques in dealing with the latter's limitations such as normality assumption of observations, independence of observations from each other, and linearity of the relationship between the observations and the output measure(s). There are statistical methods that overcome these limitations. Yet, they are computationally more expensive and do not provide fast and flexible solutions as do data mining techniques in large datasets. The main objective of this study is to improve the prediction of outcomes following combined heart-lung transplantation by proposing an integrated data-mining methodology. A large and feature-rich dataset (16,604 cases with 283 variables) is used to (1) develop machine learning based predictive models and (2) extract the most important predictive factors. Then, using three different variable selection methods, namely, (i) machine learning methods driven variables-using decision trees, neural networks, logistic regression, (ii) the literature review-based expert-defined variables, and (iii) common sense-based interaction variables, a consolidated set of factors is generated and used to develop Cox regression models for heart-lung graft survival. The predictive models' performance in terms of 10-fold cross-validation accuracy rates for two multi-imputed datasets ranged from 79% to 86% for neural networks, from 78% to 86% for logistic regression, and from 71% to 79% for decision trees. The results indicate that the proposed integrated data mining methodology using Cox hazard models better predicted the graft survival with different variables than the conventional approaches commonly used in the literature. This result is validated by the comparison of the corresponding Gains charts for our proposed methodology and the literature review based Cox results, and by the comparison of Akaike information criteria (AIC) values received from each. Data mining-based methodology proposed in this study reveals that there are undiscovered relationships (i.e. interactions of the existing variables) among the survival-related variables, which helps better predict the survival of the heart-lung transplants. It also brings a different set of variables into the scene to be evaluated by the domain-experts and be considered prior to the organ transplantation.
Anawar, Hossain Md
2015-08-01
The oxidative dissolution of sulfidic minerals releases the extremely acidic leachate, sulfate and potentially toxic elements e.g., As, Ag, Cd, Cr, Cu, Hg, Ni, Pb, Sb, Th, U, Zn, etc. from different mine tailings and waste dumps. For the sustainable rehabilitation and disposal of mining waste, the sources and mechanisms of contaminant generation, fate and transport of contaminants should be clearly understood. Therefore, this study has provided a critical review on (1) recent insights in mechanisms of oxidation of sulfidic minerals, (2) environmental contamination by mining waste, and (3) remediation and rehabilitation techniques, and (4) then developed the GEMTEC conceptual model/guide [(bio)-geochemistry-mine type-mineralogy- geological texture-ore extraction process-climatic knowledge)] to provide the new scientific approach and knowledge for remediation of mining wastes and acid mine drainage. This study has suggested the pre-mining geological, geochemical, mineralogical and microtextural characterization of different mineral deposits, and post-mining studies of ore extraction processes, physical, geochemical, mineralogical and microbial reactions, natural attenuation and effect of climate change for sustainable rehabilitation of mining waste. All components of this model should be considered for effective and integrated management of mining waste and acid mine drainage. Copyright © 2015 Elsevier Ltd. All rights reserved.
Data Mining Web Services for Science Data Repositories
NASA Astrophysics Data System (ADS)
Graves, S.; Ramachandran, R.; Keiser, K.; Maskey, M.; Lynnes, C.; Pham, L.
2006-12-01
The maturation of web services standards and technologies sets the stage for a distributed "Service-Oriented Architecture" (SOA) for NASA's next generation science data processing. This architecture will allow members of the scientific community to create and combine persistent distributed data processing services and make them available to other users over the Internet. NASA has initiated a project to create a suite of specialized data mining web services designed specifically for science data. The project leverages the Algorithm Development and Mining (ADaM) toolkit as its basis. The ADaM toolkit is a robust, mature and freely available science data mining toolkit that is being used by several research organizations and educational institutions worldwide. These mining services will give the scientific community a powerful and versatile data mining capability that can be used to create higher order products such as thematic maps from current and future NASA satellite data records with methods that are not currently available. The package of mining and related services are being developed using Web Services standards so that community-based measurement processing systems can access and interoperate with them. These standards-based services allow users different options for utilizing them, from direct remote invocation by a client application to deployment of a Business Process Execution Language (BPEL) solutions package where a complex data mining workflow is exposed to others as a single service. The ability to deploy and operate these services at a data archive allows the data mining algorithms to be run where the data are stored, a more efficient scenario than moving large amounts of data over the network. This will be demonstrated in a scenario in which a user uses a remote Web-Service-enabled clustering algorithm to create cloud masks from satellite imagery at the Goddard Earth Sciences Data and Information Services Center (GES DISC).
VizioMetrics: Mining the Scientific Visual Literature
ERIC Educational Resources Information Center
Lee, Po-Shen
2017-01-01
Scientific results are communicated visually in the literature through diagrams, visualizations, and photographs. In this thesis, we developed a figure processing pipeline to classify more than 8 million figures from PubMed Central into different figure types and study the resulting patterns of visual information as they relate to scholarly…
Federal Register 2010, 2011, 2012, 2013, 2014
2013-02-19
... Scientific Counselors, National Institute for Occupational Safety and Health (BSC, NIOSH) In accordance with..., research, experiments, and demonstrations relating to occupational safety and health and to mine health... Occupational Safety and Health on research and prevention programs. Specifically, the Board shall provide...
Federal Register 2010, 2011, 2012, 2013, 2014
2012-08-10
... Scientific Counselors, National Institute for Occupational Safety and Health (BSC, NIOSH) In accordance with..., research, experiments, and demonstrations relating to occupational safety and health and to mine health... Occupational Safety and Health on research and prevention programs. Specifically, the Board shall provide...
Automatic detection of referral patients due to retinal pathologies through data mining.
Quellec, Gwenolé; Lamard, Mathieu; Erginay, Ali; Chabouis, Agnès; Massin, Pascale; Cochener, Béatrice; Cazuguel, Guy
2016-04-01
With the increased prevalence of retinal pathologies, automating the detection of these pathologies is becoming more and more relevant. In the past few years, many algorithms have been developed for the automated detection of a specific pathology, typically diabetic retinopathy, using eye fundus photography. No matter how good these algorithms are, we believe many clinicians would not use automatic detection tools focusing on a single pathology and ignoring any other pathology present in the patient's retinas. To solve this issue, an algorithm for characterizing the appearance of abnormal retinas, as well as the appearance of the normal ones, is presented. This algorithm does not focus on individual images: it considers examination records consisting of multiple photographs of each retina, together with contextual information about the patient. Specifically, it relies on data mining in order to learn diagnosis rules from characterizations of fundus examination records. The main novelty is that the content of examination records (images and context) is characterized at multiple levels of spatial and lexical granularity: 1) spatial flexibility is ensured by an adaptive decomposition of composite retinal images into a cascade of regions, 2) lexical granularity is ensured by an adaptive decomposition of the feature space into a cascade of visual words. This multigranular representation allows for great flexibility in automatically characterizing normality and abnormality: it is possible to generate diagnosis rules whose precision and generalization ability can be traded off depending on data availability. A variation on usual data mining algorithms, originally designed to mine static data, is proposed so that contextual and visual data at adaptive granularity levels can be mined. This framework was evaluated in e-ophtha, a dataset of 25,702 examination records from the OPHDIAT screening network, as well as in the publicly-available Messidor dataset. It was successfully applied to the detection of patients that should be referred to an ophthalmologist and also to the specific detection of several pathologies. Copyright © 2016 Elsevier B.V. All rights reserved.
Arguments for the need of mining education continuity and development in Romania
NASA Astrophysics Data System (ADS)
Bud, I.; Duma, S.; Pasca, I.; Gusat, D.
2018-01-01
Mining is considered the oldest conscious man activity. In the beginning, man searched for hard rocks in the outskirts area and used it to make weapons and ornaments. Subsequently, civilizations evolved through the development of infrastructure, buildings and monuments and finally, weapons. For all these it was necessary to have mineral raw materials obtained under increasingly difficult conditions, through increasingly evolved techniques. In this way, the art of mining and metallurgy was born, which led to the formation of scientific bases. The mining activity was equally art and science. The art and the science have been learned and taught in schools since ancient times and continue today in large universities with mining engineering, metallurgy, mining topography, mining environmental protection, and geology. Lately, in Romania, the mining high school has reached a deadlock and the middle and professional school has collapsed. The development of infrastructure, construction, etc. requires the exploitation and valorisation of mineral resources based on specialists. The paper warns against the danger of losing tradition and skills in mining engineers formation and militate for the re-establishment of professional and technical schools.
75 FR 28281 - Notice of Proposed Withdrawal Extension and Public Meeting; Oregon
Federal Register 2010, 2011, 2012, 2013, 2014
2010-05-20
... approximately 11,675.51 acres of National Forest System land from mining in order to protect the scientific and... Natural Area. The withdrawal created by PLO No. 6880 will expire on September 29, 2011, unless extended... County, Oregon, from location and entry under the United States mining laws (30 U.S.C. ch. 2). The area...
Large Scale Data Mining to Improve Usability of Data: An Intelligent Archive Testbed
NASA Technical Reports Server (NTRS)
Ramapriyan, Hampapuram; Isaac, David; Yang, Wenli; Morse, Steve
2005-01-01
Research in certain scientific disciplines - including Earth science, particle physics, and astrophysics - continually faces the challenge that the volume of data needed to perform valid scientific research can at times overwhelm even a sizable research community. The desire to improve utilization of this data gave rise to the Intelligent Archives project, which seeks to make data archives active participants in a knowledge building system capable of discovering events or patterns that represent new information or knowledge. Data mining can automatically discover patterns and events, but it is generally viewed as unsuited for large-scale use in disciplines like Earth science that routinely involve very high data volumes. Dozens of research projects have shown promising uses of data mining in Earth science, but all of these are based on experiments with data subsets of a few gigabytes or less, rather than the terabytes or petabytes typically encountered in operational systems. To bridge this gap, the Intelligent Archives project is establishing a testbed with the goal of demonstrating the use of data mining techniques in an operationally-relevant environment. This paper discusses the goals of the testbed and the design choices surrounding critical issues that arose during testbed implementation.
Daily monitoring of the land surface of the Earth
NASA Astrophysics Data System (ADS)
Mascaro, J.
2016-12-01
Planet is an integrated aerospace and data analytics company that operates the largest fleet of Earth-imaging satellites. With more than 140 cube-sats successfully launched to date, Planet is now collecting approximately 10 million square kilometers of imagery per day (3-5m per pixel, in red, green, blue and near infrared spectral bands). By early 2017, Planet's constellation will image the entire land surface of the Earth on a daily basis. Due to investments in cloud storage and computing, approximately 75% of imagery collected is available to Planet's partners within 24 hours of capture through an Application Program Interface. This unique dataset has enormous applications for monitoring the status of Earth's natural ecosystems, as well as human settlements and agricultural welfare. Through our Ambassadors Program, Planet has made data available for researchers in areas as disparate as human rights monitoring in refugee camps, to assessments of the impact of hydroelectric installations, to tracking illegal gold mining in Amazon forests, to assessing the status of the cryosphere. Here, we share early results from Planet's research partner network, including enhanced spatial and temporal resolution of NDVI data for agricultural health in Saudi Arabia, computation of rates of illegal deforestation in Southern Peru, estimates of tropical forest carbon stocks based on data integration with active sensors, and estimates of glacial flow rates. We synthesize the potentially enormous research and scientific value of Planet's persistent monitoring capability, and discuss methods by which the data will be disseminated into the scientific community.
Kokel, David; Rennekamp, Andrew J; Shah, Asmi H; Liebel, Urban; Peterson, Randall T
2012-08-01
For decades, studying the behavioral effects of individual drugs and genetic mutations has been at the heart of efforts to understand and treat nervous system disorders. High-throughput technologies adapted from other disciplines (e.g., high-throughput chemical screening, genomics) are changing the scale of data acquisition in behavioral neuroscience. Massive behavioral datasets are beginning to emerge, particularly from zebrafish labs, where behavioral assays can be performed rapidly and reproducibly in 96-well, high-throughput format. Mining these datasets and making comparisons across different assays are major challenges for the field. Here, we review behavioral barcoding, a process by which complex behavioral assays are reduced to a string of numeric features, facilitating analysis and comparison within and across datasets. Copyright © 2012 Elsevier Ltd. All rights reserved.
Anonymizing 1:M microdata with high utility
Gong, Qiyuan; Luo, Junzhou; Yang, Ming; Ni, Weiwei; Li, Xiao-Bai
2016-01-01
Preserving privacy and utility during data publishing and data mining is essential for individuals, data providers and researchers. However, studies in this area typically assume that one individual has only one record in a dataset, which is unrealistic in many applications. Having multiple records for an individual leads to new privacy leakages. We call such a dataset a 1:M dataset. In this paper, we propose a novel privacy model called (k, l)-diversity that addresses disclosure risks in 1:M data publishing. Based on this model, we develop an efficient algorithm named 1:M-Generalization to preserve privacy and data utility, and compare it with alternative approaches. Extensive experiments on real-world data show that our approach outperforms the state-of-the-art technique, in terms of data utility and computational cost. PMID:28603388
A Computational Approach to Qualitative Analysis in Large Textual Datasets
Evans, Michael S.
2014-01-01
In this paper I introduce computational techniques to extend qualitative analysis into the study of large textual datasets. I demonstrate these techniques by using probabilistic topic modeling to analyze a broad sample of 14,952 documents published in major American newspapers from 1980 through 2012. I show how computational data mining techniques can identify and evaluate the significance of qualitatively distinct subjects of discussion across a wide range of public discourse. I also show how examining large textual datasets with computational methods can overcome methodological limitations of conventional qualitative methods, such as how to measure the impact of particular cases on broader discourse, how to validate substantive inferences from small samples of textual data, and how to determine if identified cases are part of a consistent temporal pattern. PMID:24498398
Combination of minimum enclosing balls classifier with SVM in coal-rock recognition.
Song, QingJun; Jiang, HaiYan; Song, Qinghui; Zhao, XieGuang; Wu, Xiaoxuan
2017-01-01
Top-coal caving technology is a productive and efficient method in modern mechanized coal mining, the study of coal-rock recognition is key to realizing automation in comprehensive mechanized coal mining. In this paper we propose a new discriminant analysis framework for coal-rock recognition. In the framework, a data acquisition model with vibration and acoustic signals is designed and the caving dataset with 10 feature variables and three classes is got. And the perfect combination of feature variables can be automatically decided by using the multi-class F-score (MF-Score) feature selection. In terms of nonlinear mapping in real-world optimization problem, an effective minimum enclosing ball (MEB) algorithm plus Support vector machine (SVM) is proposed for rapid detection of coal-rock in the caving process. In particular, we illustrate how to construct MEB-SVM classifier in coal-rock recognition which exhibit inherently complex distribution data. The proposed method is examined on UCI data sets and the caving dataset, and compared with some new excellent SVM classifiers. We conduct experiments with accuracy and Friedman test for comparison of more classifiers over multiple on the UCI data sets. Experimental results demonstrate that the proposed algorithm has good robustness and generalization ability. The results of experiments on the caving dataset show the better performance which leads to a promising feature selection and multi-class recognition in coal-rock recognition.
Combination of minimum enclosing balls classifier with SVM in coal-rock recognition
Song, QingJun; Jiang, HaiYan; Song, Qinghui; Zhao, XieGuang; Wu, Xiaoxuan
2017-01-01
Top-coal caving technology is a productive and efficient method in modern mechanized coal mining, the study of coal-rock recognition is key to realizing automation in comprehensive mechanized coal mining. In this paper we propose a new discriminant analysis framework for coal-rock recognition. In the framework, a data acquisition model with vibration and acoustic signals is designed and the caving dataset with 10 feature variables and three classes is got. And the perfect combination of feature variables can be automatically decided by using the multi-class F-score (MF-Score) feature selection. In terms of nonlinear mapping in real-world optimization problem, an effective minimum enclosing ball (MEB) algorithm plus Support vector machine (SVM) is proposed for rapid detection of coal-rock in the caving process. In particular, we illustrate how to construct MEB-SVM classifier in coal-rock recognition which exhibit inherently complex distribution data. The proposed method is examined on UCI data sets and the caving dataset, and compared with some new excellent SVM classifiers. We conduct experiments with accuracy and Friedman test for comparison of more classifiers over multiple on the UCI data sets. Experimental results demonstrate that the proposed algorithm has good robustness and generalization ability. The results of experiments on the caving dataset show the better performance which leads to a promising feature selection and multi-class recognition in coal-rock recognition. PMID:28937987
Gurulingappa, Harsha; Toldo, Luca; Rajput, Abdul Mateen; Kors, Jan A; Taweel, Adel; Tayrouz, Yorki
2013-11-01
The aim of this study was to assess the impact of automatically detected adverse event signals from text and open-source data on the prediction of drug label changes. Open-source adverse effect data were collected from FAERS, Yellow Cards and SIDER databases. A shallow linguistic relation extraction system (JSRE) was applied for extraction of adverse effects from MEDLINE case reports. Statistical approach was applied on the extracted datasets for signal detection and subsequent prediction of label changes issued for 29 drugs by the UK Regulatory Authority in 2009. 76% of drug label changes were automatically predicted. Out of these, 6% of drug label changes were detected only by text mining. JSRE enabled precise identification of four adverse drug events from MEDLINE that were undetectable otherwise. Changes in drug labels can be predicted automatically using data and text mining techniques. Text mining technology is mature and well-placed to support the pharmacovigilance tasks. Copyright © 2013 John Wiley & Sons, Ltd.
Clinical diabetes research using data mining: a Canadian perspective.
Shah, Baiju R; Lipscombe, Lorraine L
2015-06-01
With the advent of the digitization of large amounts of information and the computer power capable of analyzing this volume of information, data mining is increasingly being applied to medical research. Datasets created for administration of the healthcare system provide a wealth of information from different healthcare sectors, and Canadian provinces' single-payer universal healthcare systems mean that data are more comprehensive and complete in this country than in many other jurisdictions. The increasing ability to also link clinical information, such as electronic medical records, laboratory test results and disease registries, has broadened the types of data available for analysis. Data-mining methods have been used in many different areas of diabetes clinical research, including classic epidemiology, effectiveness research, population health and health services research. Although methodologic challenges and privacy concerns remain important barriers to using these techniques, data mining remains a powerful tool for clinical research. Copyright © 2015 Canadian Diabetes Association. Published by Elsevier Inc. All rights reserved.
Due Diligence Processes for Public Acquisition of Mining-Impacted Landscapes
NASA Astrophysics Data System (ADS)
Martin, E.; Monohan, C.; Keeble-Toll, A. K.
2016-12-01
The acquisition of public land is critical for achieving conservation and habitat goals in rural regions projected to experience continuously high rates of population growth. To ensure that public funds are utilized responsibly in the purchase of conservation easements appropriate due diligence processes must be established that limit landowner liability post-acquisition. Traditional methods of characterizing contamination in regions where legacy mining activities were prevalent may not utilize current scientific knowledge and understanding of contaminant fate, transport and bioavailability, and therefore are likely to have type two error. Agency prescribed assessment methods utilized under CERLA in many cases fail to detect contamination that presents liability issues by failing to require water quality sampling that would reveal offsite transport potential of contaminants posing human health risks, including mercury. Historical analysis can be used to inform judgmental sampling to identify hotspots and contaminants of concern. Land acquisition projects at two historic mine sites in Nevada County, California, the Champion Mine Complex and the Black Swan Preserve have established the necessity of re-thinking due diligence processes for mining-impacted landscapes. These pilot projects demonstrate that pre-acquisition assessment in the Gold Country must include judgmental sampling and evaluation of contaminant transport. Best practices using the current scientific knowledge must be codified by agencies, consultants, and NGOs in order to ensure responsible use of public funds and to safeguard public health.
An environmental partnership - hawks and highwalls at the Jim Bridger mine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Harshbarger, R.M.
1996-12-31
Can industry and the environment coexist to the benefit of both? More specifically can coal mining and the environment have beneficial partnerships? Based on information normally available in the popular press, a partnership would seem impossible. Millions of pages and thousands of hours of air time have been used to convey a misinformed message that mining is always detrimental to the environment. The message is clear and one sided. Mining companies destroy the environment in their greed to extract mineral wealth from the ground. They leave barren, disturbed landscapes devoid of life. Mining and the environment can not coexist. Theremore » can be no partnership. Bridger Coal Company`s environmental programs, including the raptor mitigation program, demonstrate that mining companies in cooperation with regulatory agencies can utilize the latest scientific developments to protect environmental resources, maintain operational efficiency and in the process raise the industry standard.« less
Mining hydrogeological data from existing AEM datasets for mineral Mining
NASA Astrophysics Data System (ADS)
Menghini, Antonio; Viezzoli, Andrea; Teatini, Pietro; Cattarossi, Andrea
2017-04-01
Large amount of existing Airborne Electromagnetic (AEM) data are potentially available all over the World. Originally acquired for mining purposes, AEM data traditionally do not get processed in detail and inverted: most of the orebodies can be easily detected by analyzing just the peak anomaly directly evidenced by voltage values (the so-called "bump detection"). However, the AEM acquisitions can be accurately re-processed and inverted to provide detailed 3D models of resistivity: a first step towards hydrogeological studies and modelling. This is a great opportunity especially for the African continent, where the detection of exploitable groundwater resources is a crucial issue. In many cases, a while after AEM data have been acquired by the mining company, Governments become owners of those datasets and have the opportunity to develop detailed hydrogeological characterizations at very low costs. We report the case in which existing VTEM (Versatile Time Domain Electromagnetic - Geotech Ltd) data, originally acquired to detect gold deposits, are used to improve the hydrogeological knowledge of a roughly 50 km2 pilot-test area in Sierra Leone. Thanks to an accurate processing workflow and an advanced data inversion, based on the Spatially Constrained Inversion (SCI) algorithm, we have been able to resolve the thickness of the regolith aquifer and the top of the granitic-gneiss or greenstone belt bedrock. Moreover, the occurrence of different lithological units (more or less conductive) directly related to groundwater flow, sometimes having also a high chargeability (e.g. in the case of lateritic units), has been detailed within the regolith. The most promising areas to drill new productive wells have been recognized where the bedrock is deeper and the regolith thickness is larger. A further info that was considered in hydrogeological mapping is the resistivity of the regolith, provided that the most permeable layers coincide with the most resistive units. The resistivity model thus produced has allowed us to detect some alignments of conductive dykes, perforating the greenstone belt (made by volcanic Mafic and Ultramafic rocks or Metasedimentary formations), and correlated with the gold mineralization. Moreover, the conductive response of the basal serpentine-chloritized Ultramafic volcanic rocks, has allowed reconstructing the deeper structural features of the area. Therefore, the advantage in re-processing existing AEM data has been twofold, i.e. for both hydrogeological and geological-structural (hence mining) purposes. Concluding, we advocate for re-using of existing AEM datasets covering wide areas in underdeveloped and developing countries in to improve the hydrogeological characterizations of these nations where groundwater resources could cope with need of providing fresh / safe water to the population.
MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format.
Ahmed, Zeeshan; Dandekar, Thomas
2015-01-01
Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool 'Mining Scientific Literature (MSL)', which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system's output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format.
The Weather Forecast Using Data Mining Research Based on Cloud Computing.
NASA Astrophysics Data System (ADS)
Wang, ZhanJie; Mazharul Mujib, A. B. M.
2017-10-01
Weather forecasting has been an important application in meteorology and one of the most scientifically and technologically challenging problem around the world. In my study, we have analyzed the use of data mining techniques in forecasting weather. This paper proposes a modern method to develop a service oriented architecture for the weather information systems which forecast weather using these data mining techniques. This can be carried out by using Artificial Neural Network and Decision tree Algorithms and meteorological data collected in Specific time. Algorithm has presented the best results to generate classification rules for the mean weather variables. The results showed that these data mining techniques can be enough for weather forecasting.
Kim, Tae-Goun
2009-10-01
This article develops a dynamic model of efficient use of exhaustible marine sand resources in the context of marine mining externalities. The classical Hotelling extraction model is applied to sand mining in Ongjin, Korea and extended to include the estimated marginal external costs that mining imposes on marine fisheries. The socially efficient sand extraction plan is compared with the extraction paths suggested by scientific research. If marginal environmental costs are correctly estimated, the developed efficient extraction plan considering the resource rent may increase the social welfare and reduce the conflicts among the marine sand resource users. The empirical results are interpreted with an emphasis on guidelines for coastal resource management policy.
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
The emergence of spatial cyberinfrastructure.
Wright, Dawn J; Wang, Shaowen
2011-04-05
Cyberinfrastructure integrates advanced computer, information, and communication technologies to empower computation-based and data-driven scientific practice and improve the synthesis and analysis of scientific data in a collaborative and shared fashion. As such, it now represents a paradigm shift in scientific research that has facilitated easy access to computational utilities and streamlined collaboration across distance and disciplines, thereby enabling scientific breakthroughs to be reached more quickly and efficiently. Spatial cyberinfrastructure seeks to resolve longstanding complex problems of handling and analyzing massive and heterogeneous spatial datasets as well as the necessity and benefits of sharing spatial data flexibly and securely. This article provides an overview and potential future directions of spatial cyberinfrastructure. The remaining four articles of the special feature are introduced and situated in the context of providing empirical examples of how spatial cyberinfrastructure is extending and enhancing scientific practice for improved synthesis and analysis of both physical and social science data. The primary focus of the articles is spatial analyses using distributed and high-performance computing, sensor networks, and other advanced information technology capabilities to transform massive spatial datasets into insights and knowledge.
The emergence of spatial cyberinfrastructure
Wright, Dawn J.; Wang, Shaowen
2011-01-01
Cyberinfrastructure integrates advanced computer, information, and communication technologies to empower computation-based and data-driven scientific practice and improve the synthesis and analysis of scientific data in a collaborative and shared fashion. As such, it now represents a paradigm shift in scientific research that has facilitated easy access to computational utilities and streamlined collaboration across distance and disciplines, thereby enabling scientific breakthroughs to be reached more quickly and efficiently. Spatial cyberinfrastructure seeks to resolve longstanding complex problems of handling and analyzing massive and heterogeneous spatial datasets as well as the necessity and benefits of sharing spatial data flexibly and securely. This article provides an overview and potential future directions of spatial cyberinfrastructure. The remaining four articles of the special feature are introduced and situated in the context of providing empirical examples of how spatial cyberinfrastructure is extending and enhancing scientific practice for improved synthesis and analysis of both physical and social science data. The primary focus of the articles is spatial analyses using distributed and high-performance computing, sensor networks, and other advanced information technology capabilities to transform massive spatial datasets into insights and knowledge. PMID:21467227
NASA Astrophysics Data System (ADS)
Assous, Franck; Chaskalovic, Joël
2011-06-01
We propose a new approach that consists in using data mining techniques for scientific computing. Indeed, data mining has proved to be efficient in other contexts which deal with huge data like in biology, medicine, marketing, advertising and communications. Our aim, here, is to deal with the important problem of the exploitation of the results produced by any numerical method. Indeed, more and more data are created today by numerical simulations. Thus, it seems necessary to look at efficient tools to analyze them. In this work, we focus our presentation to a test case dedicated to an asymptotic paraxial approximation to model ultrarelativistic particles. Our method directly deals with numerical results of simulations and try to understand what each order of the asymptotic expansion brings to the simulation results over what could be obtained by other lower-order or less accurate means. This new heuristic approach offers new potential applications to treat numerical solutions to mathematical models.
Huang, Zhenzhen; Duan, Huilong; Li, Haomin
2015-01-01
Large-scale human cancer genomics projects, such as TCGA, generated large genomics data for further study. Exploring and mining these data to obtain meaningful analysis results can help researchers find potential genomics alterations that intervene the development and metastasis of tumors. We developed a web-based gene analysis platform, named TCGA4U, which used statistics methods and models to help translational investigators explore, mine and visualize human cancer genomic characteristic information from the TCGA datasets. Furthermore, through Gene Ontology (GO) annotation and clinical data integration, the genomic data were transformed into biological process, molecular function, cellular component and survival curves to help researchers identify potential driver genes. Clinical researchers without expertise in data analysis will benefit from such a user-friendly genomic analysis platform.
75 FR 51111 - Notice of Proposed Withdrawal Extension and Opportunity for Public Meeting; Oregon
Federal Register 2010, 2011, 2012, 2013, 2014
2010-08-18
... protection of the scientific and ecological research values at the Steamboat Mountain Research Natural Area...,400 acres of National Forest System land from mining in order to protect scientific and ecological values at the Steamboat Mountain Research Natural Area. The withdrawal created by PLO No. 6870 will...
NASA Astrophysics Data System (ADS)
Abdaal, Ahmed; Jordan, Gyozo; Bartha, Andras; Fugedi, Ubul
2013-04-01
The Mine Waste Directive 2006/21/EC requires the risk-based inventory of all mine waste sites in Europe. The geochemical documentation concerning inert classification and ranking of the mine wastes requires detailed field study and laboratory testing and analyses of waste material to assess the Acid Mine Drainage potential and toxic element mobility. The procedure applied in this study used a multi-level decision support scheme including: 1) expert judgment, 2) data review, 3) representative field sampling and laboratory analysis of formations listed in the Inert Mining Waste List, and 4) requesting available laboratory analysis data from selected operating mines. Based on expert judgment, the listed formations were classified into three categories. A: inert B: probably inert, but has to be checked, C: probably not inert, has to be examined. This paper discusses the heavy metal contamination risk assessment (RA) in leached quarry-mine waste sites in Hungary. In total 34 mine waste sites (including tailing lagoons and heaps of both abandoned mines and active quarries) have been selected for scientific testing using the EU Pre-selection Protocol. Over 93 field samples have been collected from the mine sites including Ore (Andesite and Ryolite), Coal (Lignite, black and brown coals), Peat, Alginite, Bauxite, Clay and Limestone. Laboratory analyses of the total toxic element content (aqua regia extraction), the mobile toxic element content (deionized water leaching) and the analysis of different forms of sulfur (sulfuric acid potential) ) on the base of Hungarian GKM Decree No. 14/2008. (IV. 3) concerning mining waste management. A detailed geochemical study together with spatial analysis and GIS has been performed to derive a geochemically sound contamination RA of the mine waste sites. Key parameters such as heavy metal and sulphur content, in addition to the distance to the nearest surface and ground water bodies, or to sensitive receptors such as settlements and protected areas are calculated and statistically evaluated using STATGRAPHICS® in order to calibrate the RA methods. Results show that some of the waste rock materials assumed to be inert were found non/inert. Thus, regional RA needs more spatial and petrological examination with special care to rock and mineral deposit genetics.
DEVELOPMENT WORK FOR IMPROVED HEAVY-DUTY VEHICLE MODELING CAPABILITY DATA MINING--FHWA DATASETS
A heavy-duty vehicle can produce 10 to 100 times the emissions (of NOx and PM emissions especially) of a light-duty vehicle, so heavy-duty vehicle activity needs to be well characterized. Key uncertainties with the use of MOBILE6 regarding heavy-duty vehicle emissions include th...
Profiling Students Who Take Online Courses Using Data Mining Methods
ERIC Educational Resources Information Center
Yu, Chong Ho; Digangi, Samuel; Jannasch-Pennell, Angel Kay; Kaprolet, Charles
2008-01-01
The efficacy of online learning programs is tied to the suitability of the program in relation to the target audience. Based on the dataset that provides information on student enrollment, academic performance, and demographics extracted from a data warehouse of a large Southwest institution, this study explored the factors that could distinguish…
Mining relational paths in integrated biomedical data.
He, Bing; Tang, Jie; Ding, Ying; Wang, Huijun; Sun, Yuyin; Shin, Jae Hong; Chen, Bin; Moorthy, Ganesh; Qiu, Judy; Desai, Pankaj; Wild, David J
2011-01-01
Much life science and biology research requires an understanding of complex relationships between biological entities (genes, compounds, pathways, diseases, and so on). There is a wealth of data on such relationships in publicly available datasets and publications, but these sources are overlapped and distributed so that finding pertinent relational data is increasingly difficult. Whilst most public datasets have associated tools for searching, there is a lack of searching methods that can cross data sources and that in particular search not only based on the biological entities themselves but also on the relationships between them. In this paper, we demonstrate how graph-theoretic algorithms for mining relational paths can be used together with a previous integrative data resource we developed called Chem2Bio2RDF to extract new biological insights about the relationships between such entities. In particular, we use these methods to investigate the genetic basis of side-effects of thiazolinedione drugs, and in particular make a hypothesis for the recently discovered cardiac side-effects of Rosiglitazone (Avandia) and a prediction for Pioglitazone which is backed up by recent clinical studies.
Bashir Surfraz, M; Fowkes, Adrian; Plante, Jeffrey P
2017-08-01
The need to find an alternative to costly animal studies for developmental and reproductive toxicity testing has shifted the focus considerably to the assessment of in vitro developmental toxicology models and the exploitation of pharmacological data for relevant molecular initiating events. We hereby demonstrate how automation can be applied successfully to handle heterogeneous oestrogen receptor data from ChEMBL. Applying expert-derived thresholds to specific bioactivities allowed an activity call to be attributed to each data entry. Human intervention further improved this mechanistic dataset which was mined to develop structure-activity relationship alerts and an expert model covering 45 chemical classes for the prediction of oestrogen receptor modulation. The evaluation of the model using FDA EDKB and Tox21 data was quite encouraging. This model can also provide a teratogenicity prediction along with the additional information it provides relevant to the query compound, all of which will require careful assessment of potential risk by experts. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
The Homestake Interim Laboratory and Homestake DUSEL
NASA Astrophysics Data System (ADS)
Lesko, Kevin T.
2011-12-01
The former Homestake gold mine in Lead South Dakota is proposed for the National Science Foundation's Deep Underground Science and Engineering Laboratory (DUSEL). The gold mine provides expedient access to depths in excess of 8000 feet below the surface (>7000 mwe). Homestake's long history of promoting scientific endeavours includes the Davis Solar Neutrino Experiment, a chlorine-based experiment that was hosted at the 4850 Level for more than 30 years. As DUSEL, Homestake would be uncompromised by competition with mining interests or other shared uses. The facility's 600-km of drifts would be available for conversion for scientific and educational uses. The State of South Dakota, under Governor Rounds' leadership, has demonstrated exceptionally strong support for Homestake and the creation of DUSEL. The State has provided funding totalling $46M for the preservation of the site for DUSEL and for the conversion and operation of the Homestake Interim Laboratory. Motivated by the strong educational and outreach potential of Homestake, the State contracted a Conversion Plan by world-recognized mine-engineering contractor to define the process of rehabilitating the facility, establishing the appropriate safety program, and regaining access to the facility. The State of South Dakota has established the South Dakota Science and Technology Authority to oversee the transfer of the Homestake property to the State and the rehabilitation and preservation of the facility. The Homestake Scientific Collaboration and the State of South Dakota's Science and Technology Authority has called for Letters of Interest from scientific, educational and engineering collaborations and institutions that are interested in hosting experiments and uses in the Homestake Interim Facility in advance of the NSF's DUSEL, to define experiments starting as early as 2007. The Homestake Program Advisory Committee has reviewed these Letters and their initial report has been released. Options for developing the Homestake Interim Laboratory and evolving this facility into DUSEL are presented.
Online breakage detection of multitooth tools using classifier ensembles for imbalanced data
NASA Astrophysics Data System (ADS)
Bustillo, Andrés; Rodríguez, Juan J.
2014-12-01
Cutting tool breakage detection is an important task, due to its economic impact on mass production lines in the automobile industry. This task presents a central limitation: real data-sets are extremely imbalanced because breakage occurs in very few cases compared with normal operation of the cutting process. In this paper, we present an analysis of different data-mining techniques applied to the detection of insert breakage in multitooth tools. The analysis applies only one experimental variable: the electrical power consumption of the tool drive. This restriction profiles real industrial conditions more accurately than other physical variables, such as acoustic or vibration signals, which are not so easily measured. Many efforts have been made to design a method that is able to identify breakages with a high degree of reliability within a short period of time. The solution is based on classifier ensembles for imbalanced data-sets. Classifier ensembles are combinations of classifiers, which in many situations are more accurate than individual classifiers. Six different base classifiers are tested: Decision Trees, Rules, Naïve Bayes, Nearest Neighbour, Multilayer Perceptrons and Logistic Regression. Three different balancing strategies are tested with each of the classifier ensembles and compared to their performance with the original data-set: Synthetic Minority Over-Sampling Technique (SMOTE), undersampling and a combination of SMOTE and undersampling. To identify the most suitable data-mining solution, Receiver Operating Characteristics (ROC) graph and Recall-precision graph are generated and discussed. The performance of logistic regression ensembles on the balanced data-set using the combination of SMOTE and undersampling turned out to be the most suitable technique. Finally a comparison using industrial performance measures is presented, which concludes that this technique is also more suited to this industrial problem than the other techniques presented in the bibliography.
Minutes of the CD-ROM Workshop
NASA Technical Reports Server (NTRS)
King, Joseph H.; Grayzeck, Edwin J.
1989-01-01
The workshop described in this document had two goals: (1) to establish guidelines for the CD-ROM as a tool to distribute datasets; and (2) to evaluate current scientific CD-ROM projects as an archive. Workshop attendees were urged to coordinate with European groups to develop CD-ROM, which is already available at low cost in the U.S., as a distribution medium for astronomical datasets. It was noted that NASA has made the CD Publisher at the National Space Science Data Center (NSSDC) available to the scientific community when the Publisher is not needed for NASA work. NSSDC's goal is to provide the Publisher's user with the hardware and software tools needed to design a user's dataset for distribution. This includes producing a master CD and copies. The prerequisite premastering process is described, as well as guidelines for CD-ROM construction. The production of discs was evaluated. CD-ROM projects, guidelines, and problems of the technology were discussed.
pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.
Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan
2015-10-01
The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.
Modeling Spatial Dependencies and Semantic Concepts in Data Mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vatsavai, Raju
Data mining is the process of discovering new patterns and relationships in large datasets. However, several studies have shown that general data mining techniques often fail to extract meaningful patterns and relationships from the spatial data owing to the violation of fundamental geospatial principles. In this tutorial, we introduce basic principles behind explicit modeling of spatial and semantic concepts in data mining. In particular, we focus on modeling these concepts in the widely used classification, clustering, and prediction algorithms. Classification is the process of learning a structure or model (from user given inputs) and applying the known model to themore » new data. Clustering is the process of discovering groups and structures in the data that are ``similar,'' without applying any known structures in the data. Prediction is the process of finding a function that models (explains) the data with least error. One common assumption among all these methods is that the data is independent and identically distributed. Such assumptions do not hold well in spatial data, where spatial dependency and spatial heterogeneity are a norm. In addition, spatial semantics are often ignored by the data mining algorithms. In this tutorial we cover recent advances in explicitly modeling of spatial dependencies and semantic concepts in data mining.« less
NASA Astrophysics Data System (ADS)
Conrads, P. A.; Roehl, E. A.
2010-12-01
Natural-resource managers face the difficult problem of controlling the interactions between hydrologic and man-made systems in ways that preserve resources while optimally meeting the needs of disparate stakeholders. Finding success depends on obtaining and employing detailed scientific knowledge about the cause-effect relations that govern the physics of these hydrologic systems. This knowledge is most credible when derived from large field-based datasets that encompass the wide range of variability in the parameters of interest. The means of converting data into knowledge of the hydrologic system often involves developing computer models that predict the consequences of alternative management practices to guide resource managers towards the best path forward. Complex hydrologic systems are typically modeled using computer programs that implement traditional, generalized, physical equations, which are calibrated to match the field data as closely as possible. This type of model commonly is limited in terms of demonstrable predictive accuracy, development time, and cost. The science of data mining presents a powerful complement to physics-based models. Data mining is a relatively new science that assists in converting large databases into knowledge and is uniquely able to leverage the real-time, multivariate data now being collected for hydrologic systems. In side-by-side comparisons with state-of-the-art physics-based hydrologic models, the authors have found data-mining solutions have been substantially more accurate, less time consuming to develop, and embeddable into spreadsheets and sophisticated decision support systems (DSS), making them easy to use by regulators and stakeholders. Three data-mining applications will be presented that demonstrate how data-mining techniques can be applied to existing environmental databases to address regional concerns of long-term consequences. In each case, data were transformed into information, and ultimately, into knowledge. In each case, DSSs were developed that facilitated the use of simulation models and analysis of model output to a broad range of end users with various technical abilities. When compared to other modeling projects of comparable scope and complexity, these DSSs were able to pass through needed technical reviews much more quickly. Unlike programs such as finite-element flow models, DSSs are by design open systems that are easy to use and readily disseminated directly to decision makers. The DSSs provide direct coupling of predictive models with the real-time databases that drive them, graphical user interfaces for point-and-click program control, and streaming displays of numerical and graphical results so that users can monitor the progress of long-term simulations. Customizations for specific problems include numerical optimization loops that invert predictive models; integrations with a three-dimensional finite-element flow model, GIS packages, and a plant ecology model; and color contouring of simulation output data.
Connecting Provenance with Semantic Descriptions in the NASA Earth Exchange (NEX)
NASA Astrophysics Data System (ADS)
Votava, P.; Michaelis, A.; Nemani, R. R.
2012-12-01
NASA Earth Exchange (NEX) is a data, modeling and knowledge collaboratory that houses NASA satellite data, climate data and ancillary data where a focused community may come together to share modeling and analysis codes, scientific results, knowledge and expertise on a centralized platform. Some of the main goals of NEX are transparency and repeatability and to that extent we have been adding components that enable tracking of provenance of both scientific processes and datasets produced by these processes. As scientific processes become more complex, they are often developed collaboratively and it becomes increasingly important for the research team to be able to track the development of the process and the datasets that are produced along the way. Additionally, we want to be able to link the processes and the datasets developed on NEX to an existing information and knowledge, so that the users can query and compare the provenance of any dataset or process with regard to the component-specific attributes such as data quality, geographic location, related publications, user comments and annotations etc. We have developed several ontologies that describe datasets and workflow components available on NEX using the OWL ontology language as well as a simple ontology that provides linking mechanism to the collected provenance information. The provenance is captured in two ways - we utilize existing provenance infrastructure of VisTrails, which is used as a workflow engine on NEX, and we extend the captured provenance using the PROV data model expressed through the PROV-O ontology. We do this in order to link and query the provenance easier in the context of the existing NEX information and knowledge. The captured provenance graph is processed and stored using RDFlib with MySQL backend that can be queried using either RDFLib or SPARQL. As a concrete example, we show how this information is captured during anomaly detection process in large satellite datasets.
Enabling Linked Science in Global Climate Uncertainty Quantification (UQ) Research
NASA Astrophysics Data System (ADS)
Elsethagen, T.; Stephan, E.; Lin, G.; Williams, D.; Banks, E.
2012-12-01
This paper shares a real-world global climate UQ science use case and illustrates how a linked science application called Provenance Environment (ProvEn), currently being developed, enables and facilitates scientific teams to publish, share, link, and discover new links over their UQ research results. UQ results include terascale datasets that are published to an Earth Systems Grid Federation (ESGF) repository. ProvEn demonstrates how a scientific team conducting UQ studies can discover dataset links using its domain knowledgebase, allowing them to better understand the UQ study research objectives, the experimental protocol used, the resulting dataset lineage, related analytical findings, ancillary literature citations, along with the social network of scientists associated with the study. This research claims that scientists using this linked science approach will not only allow them to greatly benefit from understanding a particular dataset within a knowledge context, a benefit can also be seen by the cross reference of knowledge among the numerous UQ studies being stored in ESGF. ProvEn collects native forms of data provenance resources as the UQ study is carried out. The native data provenance resources can be collected from a variety of sources such as scripts, a workflow engine log, simulation log files, scientific team members etc. Schema alignment is used to translate the native forms of provenance into a set of W3C PROV-O semantic statements used as a common interchange format which will also contain URI references back to resources in the UQ study dataset for querying and cross referencing. ProvEn leverages Fedora Commons' digital object model in a Resource Oriented Architecture (ROA) (i.e. a RESTful framework) to logically organize and partition native and translated provenance resources by UQ study. The ROA also provides scientists the means to both search native and translated forms of provenance.
A Swarm Optimization approach for clinical knowledge mining.
Christopher, J Jabez; Nehemiah, H Khanna; Kannan, A
2015-10-01
Rule-based classification is a typical data mining task that is being used in several medical diagnosis and decision support systems. The rules stored in the rule base have an impact on classification efficiency. Rule sets that are extracted with data mining tools and techniques are optimized using heuristic or meta-heuristic approaches in order to improve the quality of the rule base. In this work, a meta-heuristic approach called Wind-driven Swarm Optimization (WSO) is used. The uniqueness of this work lies in the biological inspiration that underlies the algorithm. WSO uses Jval, a new metric, to evaluate the efficiency of a rule-based classifier. Rules are extracted from decision trees. WSO is used to obtain different permutations and combinations of rules whereby the optimal ruleset that satisfies the requirement of the developer is used for predicting the test data. The performance of various extensions of decision trees, namely, RIPPER, PART, FURIA and Decision Tables are analyzed. The efficiency of WSO is also compared with the traditional Particle Swarm Optimization. Experiments were carried out with six benchmark medical datasets. The traditional C4.5 algorithm yields 62.89% accuracy with 43 rules for liver disorders dataset where as WSO yields 64.60% with 19 rules. For Heart disease dataset, C4.5 is 68.64% accurate with 98 rules where as WSO is 77.8% accurate with 34 rules. The normalized standard deviation for accuracy of PSO and WSO are 0.5921 and 0.5846 respectively. WSO provides accurate and concise rulesets. PSO yields results similar to that of WSO but the novelty of WSO lies in its biological motivation and it is customization for rule base optimization. The trade-off between the prediction accuracy and the size of the rule base is optimized during the design and development of rule-based clinical decision support system. The efficiency of a decision support system relies on the content of the rule base and classification accuracy. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Two-step web-mining approach to study geology/geophysics-related open-source software projects
NASA Astrophysics Data System (ADS)
Behrends, Knut; Conze, Ronald
2013-04-01
Geology/geophysics is a highly interdisciplinary science, overlapping with, for instance, physics, biology and chemistry. In today's software-intensive work environments, geoscientists often encounter new open-source software from scientific fields that are only remotely related to the own field of expertise. We show how web-mining techniques can help to carry out systematic discovery and evaluation of such software. In a first step, we downloaded ~500 abstracts (each consisting of ~1 kb UTF-8 text) from agu-fm12.abstractcentral.com. This web site hosts the abstracts of all publications presented at AGU Fall Meeting 2012, the world's largest annual geology/geophysics conference. All abstracts belonged to the category "Earth and Space Science Informatics", an interdisciplinary label cross-cutting many disciplines such as "deep biosphere", "atmospheric research", and "mineral physics". Each publication was represented by a highly structured record with ~20 short data attributes, the largest authorship-record being the unstructured "abstract" field. We processed texts of the abstracts with the statistics software "R" to calculate a corpus and a term-document matrix. Using R package "tm", we applied text-mining techniques to filter data and develop hypotheses about software-development activities happening in various geology/geophysics fields. Analyzing the term-document matrix with basic techniques (e.g., word frequencies, co-occurences, weighting) as well as more complex methods (clustering, classification) several key pieces of information were extracted. For example, text-mining can be used to identify scientists who are also developers of open-source scientific software, and the names of their programming projects and codes can also be identified. In a second step, based on the intermediate results found by processing the conference-abstracts, any new hypotheses can be tested in another webmining subproject: by merging the dataset with open data from github.com and stackoverflow.com. These popular, developer-centric websites have powerful application-programmer interfaces, and follow an open-data policy. In this regard, these sites offer a web-accessible reservoir of information that can be tapped to study questions such as: which open source software projects are eminent in the various geoscience fields? What are the most popular programming languages? How are they trending? Are there any interesting temporal patterns in committer activities? How large are programming teams and how do they change over time? What free software packages exist in the vast realms of related fields? Does the software from these fields have capabilities that might still be useful to me as a researcher, or can help me perform my work better? Are there any open-source projects that might be commercially interesting? This evaluation strategy reveals programming projects that tend to be new. As many important legacy codes are not hosted on open-source code-repositories, the presented search method might overlook some older projects.
The EB Factory: Fundamental Stellar Astrophysics with Eclipsing Binary Stars Discovered by Kepler
NASA Astrophysics Data System (ADS)
Stassun, Keivan
Eclipsing binaries (EBs) are key laboratories for determining the fundamental properties of stars. EBs are therefore foundational objects for constraining stellar evolution models, which in turn are central to determinations of stellar mass functions, of exoplanet properties, and many other areas. The primary goal of this proposal is to mine the Kepler mission light curves for: (1) EBs that include a subgiant star, from which precise ages can be derived and which can thus serve as critically needed age benchmarks; and within these, (2) long-period EBs that include low-mass M stars or brown dwarfs, which are increa-singly becoming the focus of exoplanet searches, but for which there are the fewest available fundamental mass- radius-age benchmarks. A secondary goal of this proposal is to develop an end-to-end computational pipeline -- the Kepler EB Factory -- that allows automatic processing of Kepler light curves for EBs, from period finding, to object classification, to determination of EB physical properties for the most scientifically interesting EBs, and finally to accurate modeling of these EBs for detailed tests and benchmarking of theoretical stellar evolution models. We will integrate the most successful algorithms into a single, cohesive workflow environment, and apply this 'Kepler EB Factory' to the full public Kepler dataset to find and characterize new "benchmark grade" EBs, and will disseminate both the enhanced data products from this pipeline and the pipeline itself to the broader NASA science community. The proposed work responds directly to two of the defined Research Areas of the NASA Astrophysics Data Analysis Program (ADAP), specifically Research Area #2 (Stellar Astrophysics) and Research Area #9 (Astrophysical Databases). To be clear, our primary goal is the fundamental stellar astrophysics that will be enabled by the discovery and analysis of relatively rare, benchmark-grade EBs in the Kepler dataset. At the same time, to enable this goal will require bringing a suite of extant and new custom algorithms to bear on the Kepler data, and thus our development of the Kepler EB Factory represents a value-added product that will allow the widest scientific impact of the in-formation locked within the vast reservoir of the Kepler light curves.
Igloo-Plot: a tool for visualization of multidimensional datasets.
Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S
2014-01-01
Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.
Wang, Liqin; Haug, Peter J; Del Fiol, Guilherme
2017-05-01
Mining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies. We initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: (1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and (2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s). Simple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value<0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease's dataset. It is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases. Copyright © 2017 Elsevier Inc. All rights reserved.
Towards an effective data peer review
NASA Astrophysics Data System (ADS)
Düsterhus, André; Hense, Andreas
2014-05-01
Peer review is an established procedure to ensure the quality of scientific publications and is currently used as a prerequisite for acceptance of papers in the scientific community. In the past years the publication of raw data and its metadata got increased attention, which led to the idea of bringing it to the same standards the journals for traditional publications have. One missing element to achieve this is a comparable peer review scheme. This contribution introduces the idea of a quality evaluation process, which is designed to analyse the technical quality as well as the content of a dataset. It bases on quality tests, which results are evaluated with the help of the knowledge of an expert. The results of the tests and the expert knowledge are evaluated probabilistically and are statistically combined. As a result the quality of a dataset is estimated with a single value only. This approach allows the reviewer to quickly identify the potential weaknesses of a dataset and generate a transparent and comprehensible report. To demonstrate the scheme, an application on a large meteorological dataset will be shown. Furthermore, potentials and risks of such a scheme will be introduced and practical implications for its possible introduction to data centres investigated. Especially, the effects of reducing the estimate of quality of a dataset to a single number will be critically discussed.
NASA Astrophysics Data System (ADS)
Chegwidden, O.; Nijssen, B.; Pytlak, E.
2017-12-01
Any model simulation has errors, including errors in meteorological data, process understanding, model structure, and model parameters. These errors may express themselves as bias, timing lags, and differences in sensitivity between the model and the physical world. The evaluation and handling of these errors can greatly affect the legitimacy, validity and usefulness of the resulting scientific product. In this presentation we will discuss a case study of handling and communicating model errors during the development of a hydrologic climate change dataset for the Pacific Northwestern United States. The dataset was the result of a four-year collaboration between the University of Washington, Oregon State University, the Bonneville Power Administration, the United States Army Corps of Engineers and the Bureau of Reclamation. Along the way, the partnership facilitated the discovery of multiple systematic errors in the streamflow dataset. Through an iterative review process, some of those errors could be resolved. For the errors that remained, honest communication of the shortcomings promoted the dataset's legitimacy. Thoroughly explaining errors also improved ways in which the dataset would be used in follow-on impact studies. Finally, we will discuss the development of the "streamflow bias-correction" step often applied to climate change datasets that will be used in impact modeling contexts. We will describe the development of a series of bias-correction techniques through close collaboration among universities and stakeholders. Through that process, both universities and stakeholders learned about the others' expectations and workflows. This mutual learning process allowed for the development of methods that accommodated the stakeholders' specific engineering requirements. The iterative revision process also produced a functional and actionable dataset while preserving its scientific merit. We will describe how encountering earlier techniques' pitfalls allowed us to develop improved methods for scientists and practitioners alike.
The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping.
Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W
2015-01-01
This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal.
The GAAIN Entity Mapper: An Active-Learning System for Medical Data Mapping
Ashish, Naveen; Dewan, Peehoo; Toga, Arthur W.
2016-01-01
This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN1. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal. PMID:26793094
Knowledge discovery with classification rules in a cardiovascular dataset.
Podgorelec, Vili; Kokol, Peter; Stiglic, Milojka Molan; Hericko, Marjan; Rozman, Ivan
2005-12-01
In this paper we study an evolutionary machine learning approach to data mining and knowledge discovery based on the induction of classification rules. A method for automatic rules induction called AREX using evolutionary induction of decision trees and automatic programming is introduced. The proposed algorithm is applied to a cardiovascular dataset consisting of different groups of attributes which should possibly reveal the presence of some specific cardiovascular problems in young patients. A case study is presented that shows the use of AREX for the classification of patients and for discovering possible new medical knowledge from the dataset. The defined knowledge discovery loop comprises a medical expert's assessment of induced rules to drive the evolution of rule sets towards more appropriate solutions. The final result is the discovery of a possible new medical knowledge in the field of pediatric cardiology.
Semantic technologies improving the recall and precision of the Mercury metadata search engine
NASA Astrophysics Data System (ADS)
Pouchard, L. C.; Cook, R. B.; Green, J.; Palanisamy, G.; Noy, N.
2011-12-01
The Mercury federated metadata system [1] was developed at the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC), a NASA-sponsored effort holding datasets about biogeochemical dynamics, ecological data, and environmental processes. Mercury currently indexes over 100,000 records from several data providers conforming to community standards, e.g. EML, FGDC, FGDC Biological Profile, ISO 19115 and DIF. With the breadth of sciences represented in Mercury, the potential exists to address some key interdisciplinary scientific challenges related to climate change, its environmental and ecological impacts, and mitigation of these impacts. However, this wealth of metadata also hinders pinpointing datasets relevant to a particular inquiry. We implemented a semantic solution after concluding that traditional search approaches cannot improve the accuracy of the search results in this domain because: a) unlike everyday queries, scientific queries seek to return specific datasets with numerous parameters that may or may not be exposed to search (Deep Web queries); b) the relevance of a dataset cannot be judged by its popularity, as each scientific inquiry tends to be unique; and c)each domain science has its own terminology, more or less curated, consensual, and standardized depending on the domain. The same terms may refer to different concepts across domains (homonyms), but different terms mean the same thing (synonyms). Interdisciplinary research is arduous because an expert in a domain must become fluent in the language of another, just to find relevant datasets. Thus, we decided to use scientific ontologies because they can provide a context for a free-text search, in a way that string-based keywords never will. With added context, relevant datasets are more easily discoverable. To enable search and programmatic access to ontology entities in Mercury, we are using an instance of the BioPortal ontology repository. Mercury accesses ontology entities using the BioPortal REST API by passing a search parameter to BioPortal that may return domain context, parameter attribute, or entity annotations depending on the entity's associated ontological relationships. As Mercury's facetted search is popular with users, the results are displayed as facets. Unlike a facetted search however, the ontology-based solution implements both restrictions (improving precision) and expansions (improving recall) on the results of the initial search. For instance, "carbon" acquires a scientific context and additional key terms or phrases for discovering domain-specific datasets. A limitation of our solution is that the user must perform an additional step. Another limitation is that the quality of the newly discovered metadata is contingent upon the quality of the ontologies we use. Our solution leverages Mercury's federated capabilities to collect records from heterogeneous domains, and BioPortal's storage, curation and access capabilities for ontology entities. With minimal additional development, our approach builds on two mature systems for finding relevant datasets for interdisciplinary inquiries. We thus indicate a path forward for linking environmental, ecological and biological sciences. References: [1] Devarakonda, R., Palanisamy, G., Wilson, B. E., & Green, J. M. (2010). Mercury: reusable metadata management, data discovery and access system. Earth Science Informatics, 3(1-2), 87-94.
Text mining patents for biomedical knowledge.
Rodriguez-Esteban, Raul; Bundschus, Markus
2016-06-01
Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery. Copyright © 2016 Elsevier Ltd. All rights reserved.
Innovative Competencies of Mining engineers in Transition to the Sustainable Development
NASA Astrophysics Data System (ADS)
Krechetov, Andrey; Khoreshok, Alexey; Blumenstein, Valery
2017-11-01
The transition to the sustainable development posed new challenges to the system of mining higher education. They are determined by the acceleration of scientific and technological progress and widespread introduction of innovations, convergence of technologies from various industries. On the one hand, globalization and rapid technology development are constantly increasing quality requirements for the labor resources of the mineral and raw materials complex and constant improvement of their skills. On the other hand, the transition to the sustainable development provides the necessity for rational use of raw materials and environmental protection. This requires the improvement of staff support system for mining operations and the interaction of enterprises with universities training mining engineers, aimed at the innovative competencies development of future miners.
Biomedical data mining in clinical routine: expanding the impact of hospital information systems.
Müller, Marcel; Markó, Kornel; Daumke, Philipp; Paetzold, Jan; Roesner, Arnold; Klar, Rüdiger
2007-01-01
In this paper we want to describe how the promising technology of biomedical data mining can improve the use of hospital information systems: a large set of unstructured, narrative clinical data from a dermatological university hospital like discharge letters or other dermatological reports were processed through a morpho-semantic text retrieval engine ("MorphoSaurus") and integrated with other clinical data using a web-based interface and brought into daily clinical routine. The user evaluation showed a very high user acceptance - this system seems to meet the clinicians' requirements for a vertical data mining in the electronic patient records. What emerges is the need for integration of biomedical data mining into hospital information systems for clinical, scientific, educational and economic reasons.
NASA Astrophysics Data System (ADS)
Sakala, E.; Fourie, F.; Gomo, M.; Coetzee, H.
2018-01-01
In the last 20 years, the popular mineral systems approach has been used successfully for the exploration of various mineral commodities at various scales owing to its scientific soundness, cost effectiveness and simplicity in mapping the critical processes required for the formation of deposits. In the present study this approach was modified for the assessment of groundwater vulnerability. In terms of the modified approach, water drives the pollution migration processes, with various analogies having been derived from the mineral systems approach. The modified approach is illustrated here by the discussion of a case study of acid mine drainage (AMD) pollution in the Witbank, Ermelo and Highveld coalfields of the Mpumalanga and KwaZulu-Natal Provinces in South Africa. Many AMD cases have been reported in these provinces in recent years and are a cause of concern for local municipalities, mining and environmental agencies. In the Witbank, Ermelo and Highveld coalfields, several areas have been mined out while mining has not yet started in others, hence the need to identify groundwater regions prone to AMD pollution in order to avoid further impacts on the groundwater resources. A knowledge-based fuzzy expert system was built using vulnerability factors (energy sources, ligands sources, pollutant sources, transportation pathways and traps) to generate a groundwater vulnerability model of the coalfields. Highly vulnerable areas were identified in Witbank coalfield and the eastern part of the Ermelo coalfield which are characterised by the presence of AMD sources, good subsurface transport coupled with poor AMD pollution trapping properties. The results from the analysis indicate significant correlations between model values and both groundwater sulphate concentrations as well as pH. This shows that the proposed approach can indeed be used as an alternative to traditional methods of groundwater vulnerability assessment. The methodology only considers the AMD pollution attenuation and migration at a regional scale and does not account for local-scale sources of pollution and attenuation. Further research to refine the approach may include the incorporation of groundwater flow direction, rock-pollution reaction time, and temporal datasets for the future prediction of groundwater vulnerability. The approach may be applied to other coalfields to assess its robustness to changing hydrogeological conditions.
Data Streams: An Overview and Scientific Applications
NASA Astrophysics Data System (ADS)
Aggarwal, Charu C.
In recent years, advances in hardware technology have facilitated the ability to collect data continuously. Simple transactions of everyday life such as using a credit card, a phone, or browsing the web lead to automated data storage. Similarly, advances in information technology have lead to large flows of data across IP networks. In many cases, these large volumes of data can be mined for interesting and relevant information in a wide variety of applications. When the volume of the underlying data is very large, it leads to a number of computational and mining challenges: With increasing volume of the data, it is no longer possible to process the data efficiently by using multiple passes. Rather, one can process a data item at most once. This leads to constraints on the implementation of the underlying algorithms. Therefore, stream mining algorithms typically need to be designed so that the algorithms work with one pass of the data. In most cases, there is an inherent temporal component to the stream mining process. This is because the data may evolve over time. This behavior of data streams is referred to as temporal locality. Therefore, a straightforward adaptation of one-pass mining algorithms may not be an effective solution to the task. Stream mining algorithms need to be carefully designed with a clear focus on the evolution of the underlying data. Another important characteristic of data streams is that they are often mined in a distributed fashion. Furthermore, the individual processors may have limited processing and memory. Examples of such cases include sensor networks, in which it may be desirable to perform in-network processing of data stream with limited processing and memory [1, 2]. This chapter will provide an overview of the key challenges in stream mining algorithms which arise from the unique setup in which these problems are encountered. This chapter is organized as follows. In the next section, we will discuss the generic challenges that stream mining poses to a variety of data management and data mining problems. The next section also deals with several issues which arise in the context of data stream management. In Sect. 3, we discuss several mining algorithms on the data stream model. Section 4 discusses various scientific applications of data streams. Section 5 discusses the research directions and conclusions.
Goddard Atmospheric Composition Data Center: Aura Data and Services in One Place
NASA Technical Reports Server (NTRS)
Leptoukh, G.; Kempler, S.; Gerasimov, I.; Ahmad, S.; Johnson, J.
2005-01-01
The Goddard Atmospheric Composition Data and Information Services Center (AC-DISC) is a portal to the Atmospheric Composition specific, user driven, multi-sensor, on-line, easy access archive and distribution system employing data analysis and visualization, data mining, and other user requested techniques for the better science data usage. It provides convenient access to Atmospheric Composition data and information from various remote-sensing missions, from TOMS, UARS, MODIS, and AIRS, to the most recent data from Aura OMI, MLS, HIRDLS (once these datasets are released to the public), as well as Atmospheric Composition datasets residing at other remote archive site.
Enabling Data Fusion via a Common Data Model and Programming Interface
NASA Astrophysics Data System (ADS)
Lindholm, D. M.; Wilson, A.
2011-12-01
Much progress has been made in scientific data interoperability, especially in the areas of metadata and discovery. However, while a data user may have improved techniques for finding data, there is often a large chasm to span when it comes to acquiring the desired subsets of various datasets and integrating them into a data processing environment. Some tools such as OPeNDAP servers and the Unidata Common Data Model (CDM) have introduced improved abstractions for accessing data via a common interface, but they alone do not go far enough to enable fusion of data from multidisciplinary sources. Although data from various scientific disciplines may represent semantically similar concepts (e.g. time series), the user may face widely varying structural representations of the data (e.g. row versus column oriented), not to mention radically different storage formats. It is not enough to convert data to a common format. The key to fusing scientific data is to represent each dataset with consistent sampling. This can best be done by using a data model that expresses the functional relationship that each dataset represents. The domain of those functions determines how the data can be combined. The Visualization for Algorithm Development (VisAD) Java API has provided a sophisticated data model for representing the functional nature of scientific datasets for well over a decade. Because VisAD is largely designed for its visualization capabilities, the data model can be cumbersome to use for numerical computation, especially for those not comfortable with Java. Although both VisAD and the implementation of the CDM are written in Java, neither defines a pure Java interface that others could implement and program to, further limiting potential for interoperability. In this talk, we will present a solution for data integration based on a simple discipline-agnostic scientific data model and programming interface that enables a dataset to be defined in terms of three variable types: Scalar (a), Tuple (a,b), and Function (a -> b). These basic building blocks can be combined and nested to represent any arbitrarily complex dataset. For example, a time series of surface temperature and pressure could be represented as: time -> ((lon,lat) -> (T,P)). Our data model is expressed in UML and can be implemented in numerous programming languages. We will demonstrate an implementation of our data model and interface using the Scala programming language. Given its functional programming constructs, sophisticated type system, and other language features, Scala enables us to construct complex data structures that can be manipulated using natural mathematical expressions while taking advantage of the language's ability to operate on collections in parallel. This API will be applied to the problem of assimilating various measurements of the solar spectrum and other proxies from multiple sources to construct a composite Lyman-alpha irradiance dataset.
MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format
Ahmed, Zeeshan; Dandekar, Thomas
2018-01-01
Published scientific literature contains millions of figures, including information about the results obtained from different scientific experiments e.g. PCR-ELISA data, microarray analysis, gel electrophoresis, mass spectrometry data, DNA/RNA sequencing, diagnostic imaging (CT/MRI and ultrasound scans), and medicinal imaging like electroencephalography (EEG), magnetoencephalography (MEG), echocardiography (ECG), positron-emission tomography (PET) images. The importance of biomedical figures has been widely recognized in scientific and medicine communities, as they play a vital role in providing major original data, experimental and computational results in concise form. One major challenge for implementing a system for scientific literature analysis is extracting and analyzing text and figures from published PDF files by physical and logical document analysis. Here we present a product line architecture based bioinformatics tool ‘Mining Scientific Literature (MSL)’, which supports the extraction of text and images by interpreting all kinds of published PDF files using advanced data mining and image processing techniques. It provides modules for the marginalization of extracted text based on different coordinates and keywords, visualization of extracted figures and extraction of embedded text from all kinds of biological and biomedical figures using applied Optimal Character Recognition (OCR). Moreover, for further analysis and usage, it generates the system’s output in different formats including text, PDF, XML and images files. Hence, MSL is an easy to install and use analysis tool to interpret published scientific literature in PDF format. PMID:29721305
Medical Concept Normalization in Social Media Posts with Recurrent Neural Networks.
Tutubalina, Elena; Miftahutdinov, Zulfat; Nikolenko, Sergey; Malykh, Valentin
2018-06-12
Text mining of scientific libraries and social media has already proven itself as a reliable tool for drug repurposing and hypothesis generation. The task of mapping a disease mention to a concept in a controlled vocabulary, typically to the standard thesaurus in the Unified Medical Language System (UMLS), is known as medical concept normalization. This task is challenging due to the differences in the use of medical terminology between health care professionals and social media texts coming from the lay public. To bridge this gap, we use sequence learning with recurrent neural networks and semantic representation of one- or multi-word expressions: we develop end-to-end architectures directly tailored to the task, including bidirectional Long Short-Term Memory, Gated Recurrent Units with an attention mechanism, and additional semantic similarity features based on UMLS. Our evaluation against a standard benchmark shows that recurrent neural networks improve results over an effective baseline for classification based on convolutional neural networks. A qualitative examination of mentions discovered in a dataset of user reviews collected from popular online health information platforms as well as a quantitative evaluation both show improvements in the semantic representation of health-related expressions in social media. Copyright © 2018. Published by Elsevier Inc.
SPACE: the SPectroscopic, All-Sky Cosmic Explorer
NASA Technical Reports Server (NTRS)
Cimatti, A.; Robberto, M.; Baugh, C.; Beckwith, S. W. V.; Content, R.; Daddi, E.; deLucia, G.; Garilli, B.; Guzzo, L.; Kauffmann, G.;
2007-01-01
We describe the scientific motivations, the mission concept and the instrumentation of SPACE, a class-M mission proposed for concept study at the first call of the ESA Cosmic-Vision 2015-2025 planning cycle. SPACE aims at producing the largest three-dimensional evolutionary map of the Universe over the past 10 billion years by taking near-IR spectra and measuring redshifts of more than half a billion galaxies at 0 < z < 2 down to AB approximately 23 over 37r sr of the sky. In addition, SPACE will also target a smaller sky field, performing a deep spectroscopic survey of millions of galaxies to AB approximately 26 and at 2 < z < l0+. Owing to the depth, redshift range, volume coverage and quality of its spectra, SPACE will reveal with unique sensitivity most of the fundamental cosmological signatures, including the power spectrum of density fluctuations and its turnover, the baryonic acoustic oscillations imprinted when matter and radiation decoupled, the distance-luminosity relation of cosmological supernovae, the evolution of the cosmic expansion rate, the growth rate of cosmic large-scale structure, the large scale distribution of galaxies. The datasets from the SPACE mission will represent a long lasting legacy that will be data mined for many years to come.
Systematic Processing of Clementine Data for Scientific Analyses
NASA Technical Reports Server (NTRS)
Mcewen, A. S.
1993-01-01
If fully successful, the Clementine mission will return about 3,000,000 lunar images and more than 5000 images of Geographos. Effective scientific analyses of such large datasets require systematic processing efforts. Concepts for two such efforts are described: glogal multispectral imaging of the moon; and videos of Geographos.
Ontology-Driven Discovery of Scientific Computational Entities
ERIC Educational Resources Information Center
Brazier, Pearl W.
2010-01-01
Many geoscientists use modern computational resources, such as software applications, Web services, scientific workflows and datasets that are readily available on the Internet, to support their research and many common tasks. These resources are often shared via human contact and sometimes stored in data portals; however, they are not necessarily…
Sixty-five chemicals in the ToxCast high-throughput screening (HTS) dataset have been linked to cleft palate based on data from ToxRefDB (rat or rabbit prenatal developmental toxicity studies) or from literature reports. These compounds are structurally diverse and thus likely to...
Stream-Dashboard: A Big Data Stream Clustering Framework with Applications to Social Media Streams
ERIC Educational Resources Information Center
Hawwash, Basheer
2013-01-01
Data mining is concerned with detecting patterns of data in raw datasets, which are then used to unearth knowledge that might not have been discovered using conventional querying or statistical methods. This discovered knowledge has been used to empower decision makers in countless applications spanning across many multi-disciplinary areas…
2012-09-30
results of running it on the GLINT10 AUV Unicorn dataset is given in Fig. 3. This figure is a histogram which shows the number of messages from the...as well as Non- Government Institutions such as Battelle Memorial Institute. PUBLICATIONS 1. T. Schneider and H. Schmidt, "Goby-Acomms
Jones, Krista L.; O'Connor, Jim E.; Keith, Mackenzie K.; Mangano, Joseph F.; Wallick, J. Rose
2012-01-01
This report summarizes a preliminary study of bed-material transport, vertical and lateral channel changes, and existing datasets for the Coquille River basin, which encompasses 2,745 km2 (square kilometers) of the southwestern Oregon coast. This study, conducted to inform permitting decisions regarding instream gravel mining, revealed that:
How Do the Different Types of Computer Use Affect Math Achievement?
ERIC Educational Resources Information Center
Flores, Raymond; Inan, Fethi; Lin, Zhangxi
2013-01-01
In this study, the National Educational Longitudinal Study (ELS:2002) dataset was used and a predictive data mining technique, decision tree analysis, was implemented in order to examine which factors, in conjunction to computer use, can be used to predict high or low probability of success in high school mathematics. Specifically, this study…
Erdoğan, Onur; Aydin Son, Yeşim
2014-01-01
Single Nucleotide Polymorphisms (SNPs) are the most common genomic variations where only a single nucleotide differs between individuals. Individual SNPs and SNP profiles associated with diseases can be utilized as biological markers. But there is a need to determine the SNP subsets and patients' clinical data which is informative for the diagnosis. Data mining approaches have the highest potential for extracting the knowledge from genomic datasets and selecting the representative SNPs as well as most effective and informative clinical features for the clinical diagnosis of the diseases. In this study, we have applied one of the widely used data mining classification methodology: "decision tree" for associating the SNP biomarkers and significant clinical data with the Alzheimer's disease (AD), which is the most common form of "dementia". Different tree construction parameters have been compared for the optimization, and the most accurate tree for predicting the AD is presented.
Dataset of aggregate producers in New Mexico
Orris, Greta J.
2000-01-01
This report presents data, including latitude and longitude, for aggregate sites in New Mexico that were believed to be active in the period 1997-1999. The data are presented in paper form in Part A of this report and as Microsoft Excel 97 and Data Interchange Format (DIF) files in Part B. The work was undertaken as part of the effort to update information for the National Atlas. This compilation includes data from: the files of U.S. Geological Survey (USGS); company contacts; the New Mexico Bureau of Mines and Mineral Resources, New Mexico Bureau of Mine Inspection, and the Mining and Minerals Division of the New Mexico Energy, Minerals and Natural Resources Department (Hatton and others, 1998); the Bureau of Land Management Information; and direct communications with some of the aggregate operators. Additional information on most of the sites is available in Hatton and others (1998).
A fuzzy hill-climbing algorithm for the development of a compact associative classifier
NASA Astrophysics Data System (ADS)
Mitra, Soumyaroop; Lam, Sarah S.
2012-02-01
Classification, a data mining technique, has widespread applications including medical diagnosis, targeted marketing, and others. Knowledge discovery from databases in the form of association rules is one of the important data mining tasks. An integrated approach, classification based on association rules, has drawn the attention of the data mining community over the last decade. While attention has been mainly focused on increasing classifier accuracies, not much efforts have been devoted towards building interpretable and less complex models. This paper discusses the development of a compact associative classification model using a hill-climbing approach and fuzzy sets. The proposed methodology builds the rule-base by selecting rules which contribute towards increasing training accuracy, thus balancing classification accuracy with the number of classification association rules. The results indicated that the proposed associative classification model can achieve competitive accuracies on benchmark datasets with continuous attributes and lend better interpretability, when compared with other rule-based systems.
Kibinge, Nelson; Ono, Naoaki; Horie, Masafumi; Sato, Tetsuo; Sugiura, Tadao; Altaf-Ul-Amin, Md; Saito, Akira; Kanaya, Shigehiko
2016-06-01
Conventionally, workflows examining transcription regulation networks from gene expression data involve distinct analytical steps. There is a need for pipelines that unify data mining and inference deduction into a singular framework to enhance interpretation and hypotheses generation. We propose a workflow that merges network construction with gene expression data mining focusing on regulation processes in the context of transcription factor driven gene regulation. The pipeline implements pathway-based modularization of expression profiles into functional units to improve biological interpretation. The integrated workflow was implemented as a web application software (TransReguloNet) with functions that enable pathway visualization and comparison of transcription factor activity between sample conditions defined in the experimental design. The pipeline merges differential expression, network construction, pathway-based abstraction, clustering and visualization. The framework was applied in analysis of actual expression datasets related to lung, breast and prostrate cancer. Copyright © 2016 Elsevier Inc. All rights reserved.
Data publication and sharing using the SciDrive service
NASA Astrophysics Data System (ADS)
Mishin, Dmitry; Medvedev, D.; Szalay, A. S.; Plante, R. L.
2014-01-01
Despite the last years progress in scientific data storage, still remains the problem of public data storage and sharing system for relatively small scientific datasets. These are collections forming the “long tail” of power log datasets distribution. The aggregated size of the long tail data is comparable to the size of all data collections from large archives, and the value of data is significant. The SciDrive project's main goal is providing the scientific community with a place to reliably and freely store such data and provide access to it to broad scientific community. The primary target audience of the project is astoromy community, and it will be extended to other fields. We're aiming to create a simple way of publishing a dataset, which can be then shared with other people. Data owner controls the permissions to modify and access the data and can assign a group of users or open the access to everyone. The data contained in the dataset will be automaticaly recognized by a background process. Known data formats will be extracted according to the user's settings. Currently tabular data can be automatically extracted to the user's MyDB table where user can make SQL queries to the dataset and merge it with other public CasJobs resources. Other data formats can be processed using a set of plugins that upload the data or metadata to user-defined side services. The current implementation targets some of the data formats commonly used by the astronomy communities, including FITS, ASCII and Excel tables, TIFF images, and YT simulations data archives. Along with generic metadata, format-specific metadata is also processed. For example, basic information about celestial objects is extracted from FITS files and TIFF images, if present. A 100TB implementation has just been put into production at Johns Hopkins University. The system features public data storage REST service supporting VOSpace 2.0 and Dropbox protocols, HTML5 web portal, command-line client and Java standalone client to synchronize a local folder with the remote storage. We use VAO SSO (Single Sign On) service from NCSA for users authentication that provides free registration for everyone.
Liu, Zhao; Zhu, Yunhong; Wu, Chenxue
2016-01-01
Spatial-temporal k-anonymity has become a mainstream approach among techniques for protection of users’ privacy in location-based services (LBS) applications, and has been applied to several variants such as LBS snapshot queries and continuous queries. Analyzing large-scale spatial-temporal anonymity sets may benefit several LBS applications. In this paper, we propose two location prediction methods based on transition probability matrices constructing from sequential rules for spatial-temporal k-anonymity dataset. First, we define single-step sequential rules mined from sequential spatial-temporal k-anonymity datasets generated from continuous LBS queries for multiple users. We then construct transition probability matrices from mined single-step sequential rules, and normalize the transition probabilities in the transition matrices. Next, we regard a mobility model for an LBS requester as a stationary stochastic process and compute the n-step transition probability matrices by raising the normalized transition probability matrices to the power n. Furthermore, we propose two location prediction methods: rough prediction and accurate prediction. The former achieves the probabilities of arriving at target locations along simple paths those include only current locations, target locations and transition steps. By iteratively combining the probabilities for simple paths with n steps and the probabilities for detailed paths with n-1 steps, the latter method calculates transition probabilities for detailed paths with n steps from current locations to target locations. Finally, we conduct extensive experiments, and correctness and flexibility of our proposed algorithm have been verified. PMID:27508502
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains
Lu, Zhiyong
2015-01-01
The automatic recognition of gene names and their associated database identifiers from biomedical text has been widely studied in recent years, as these tasks play an important role in many downstream text-mining applications. Despite significant previous research, only a small number of tools are publicly available and these tools are typically restricted to detecting only mention level gene names or only document level gene identifiers. In this work, we report GNormPlus: an end-to-end and open source system that handles both gene mention and identifier detection. We created a new corpus of 694 PubMed articles to support our development of GNormPlus, containing manual annotations for not only gene names and their identifiers, but also closely related concepts useful for gene name disambiguation, such as gene families and protein domains. GNormPlus integrates several advanced text-mining techniques, including SimConcept for resolving composite gene names. As a result, GNormPlus compares favorably to other state-of-the-art methods when evaluated on two widely used public benchmarking datasets, achieving 86.7% F1-score on the BioCreative II Gene Normalization task dataset and 50.1% F1-score on the BioCreative III Gene Normalization task dataset. The GNormPlus source code and its annotated corpus are freely available, and the results of applying GNormPlus to the entire PubMed are freely accessible through our web-based tool PubTator. PMID:26380306
An update on USGS studies of the Summitville Mine and its downstream environmental effects
Plumlee, Geoffrey S.; Edelmann, Patrick R.
1995-01-01
The Summitville gold mine, located at ~3800 meters (11,500 ft) elevation in the San Juan Mountains of southwestern Colorado, was the focus of extensive public attention in 1992 and 1993 for environmental problems stemming from recent open-pit mining activities. Summitville catalyzed national debates about the environmental effects of modern mining activities, and became the focus of arguments for proposed revisions to the 1872 Mining Law governing mining activities on public lands. In early 1993, the State of Colorado, U.S. Environmental Protection Agency (EPA), U.S. Geological Survey (USGS), U.S. Fish and Wildlife Service (USFWS), Colorado State University, San Luis Valley agencies, downstream water users, private companies, and individuals began a multi-disciplinary research program to provide needed scientific information on Summitville's environmental problems and downstream environmental effects. Detailed results of this multi-agency effort were presented, along with legal and policy issues, at the Summitville Forum in January, 1995, at Colorado State University, Fort Collins, Colorado.
What the papers say: Text mining for genomics and systems biology
2010-01-01
Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward. PMID:21106487
The 3D widgets for exploratory scientific visualization
NASA Technical Reports Server (NTRS)
Herndon, Kenneth P.; Meyer, Tom
1995-01-01
Computational fluid dynamics (CFD) techniques are used to simulate flows of fluids like air or water around such objects as airplanes and automobiles. These techniques usually generate very large amounts of numerical data which are difficult to understand without using graphical scientific visualization techniques. There are a number of commercial scientific visualization applications available today which allow scientists to control visualization tools via textual and/or 2D user interfaces. However, these user interfaces are often difficult to use. We believe that 3D direct-manipulation techniques for interactively controlling visualization tools will provide opportunities for powerful and useful interfaces with which scientists can more effectively explore their datasets. A few systems have been developed which use these techniques. In this paper, we will present a variety of 3D interaction techniques for manipulating parameters of visualization tools used to explore CFD datasets, and discuss in detail various techniques for positioning tools in a 3D scene.
Industrial chimney monitoring - contemporary methods
NASA Astrophysics Data System (ADS)
Kaszowska, Olga; Gruchlik, Piotr; Mika, Wiesław
2018-04-01
The paper presents knowledge acquired during the monitoring of a flue-gas stack, performed as part of technical and scientific surveillance of mining activity and its impact on industrial objects. The chimney is located in an area impacted by mining activity since the 1970s, from a coal mine which is no longer in existence. In the period of 2013-16, this area was subject to mining carried out by a mining entrepreneur who currently holds a license to excavate hard coal. Periodic measurements of the deflection of the 113-meter chimney are performed using conventional geodetic methods. The GIG used 3 methods to observe the stack: landbased 3D laser scanning, continuous deflection monitoring with a laser sensor, and drone-based visual inspections. The drone offered the possibility to closely inspect the upper sections of the flue-gas stack, which are difficult to see from the ground level.
A Review of Extra-Terrestrial Mining Robot Concepts
NASA Technical Reports Server (NTRS)
Mueller, Robert P.; Van Susante, Paul J.
2011-01-01
Outer space contains a vast amount of resources that offer virtually unlimited wealth to the humans that can access and use them for commercial purposes. One of the key technologies for harvesting these resources is robotic mining of regolith, minerals, ices and metals. The harsh environment and vast distances create challenges that are handled best by robotic machines working in collaboration with human explorers. Humans will benefit from the resources that will be mined by robots. They will visit outposts and mining camps as required for exploration, commerce and scientific research, but a continuous presence is most likely to be provided by robotic mining machines that are remotely controlled by humans. There have been a variety of extra-terrestrial robotic mining concepts proposed over the last 100 years and this paper will attempt to summarize and review concepts in the public domain (government, industry and academia) to serve as an informational resource for future mining robot developers and operators. The challenges associated with these concepts will be discussed and feasibility will be assessed. Future needs associated with commercial efforts will also be investigated.
A Review of Extra-Terrestrial Mining Concepts
NASA Technical Reports Server (NTRS)
Mueller, R. P.; van Susante, P. J.
2012-01-01
Outer space contains a vast amount of resources that offer virtually unlimited wealth to the humans that can access and use them for commercial purposes. One of the key technologies for harvesting these resources is robotic mining of regolith, minerals, ices and metals. The harsh environment and vast distances create challenges that are handled best by robotic machines working in collaboration with human explorers. Humans will benefit from the resources that will be mined by robots. They will visit outposts and mining camps as required for exploration, commerce and scientific research, but a continuous presence is most likely to be provided by robotic mining machines that are remotely controlled by humans. There have been a variety of extra-terrestrial robotic mining concepts proposed over the last 40 years and this paper will attempt to summarize and review concepts in the public domain (government, industry and academia) to serve as an informational resource for future mining robot developers and operators. The challenges associated with these concepts will be discussed and feasibility will be assessed. Future needs associated with commercial efforts will also be investigated.
Bachman, John A; Gyori, Benjamin M; Sorger, Peter K
2018-06-28
For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or "grounding." Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers. In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., "AKT") and complexes with multiple subunits (e.g."NF- κB"). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings. FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.
An innovative privacy preserving technique for incremental datasets on cloud computing.
Aldeen, Yousra Abdul Alsahib S; Salleh, Mazleena; Aljeroudi, Yazan
2016-08-01
Cloud computing (CC) is a magnificent service-based delivery with gigantic computer processing power and data storage across connected communications channels. It imparted overwhelming technological impetus in the internet (web) mediated IT industry, where users can easily share private data for further analysis and mining. Furthermore, user affable CC services enable to deploy sundry applications economically. Meanwhile, simple data sharing impelled various phishing attacks and malware assisted security threats. Some privacy sensitive applications like health services on cloud that are built with several economic and operational benefits necessitate enhanced security. Thus, absolute cyberspace security and mitigation against phishing blitz became mandatory to protect overall data privacy. Typically, diverse applications datasets are anonymized with better privacy to owners without providing all secrecy requirements to the newly added records. Some proposed techniques emphasized this issue by re-anonymizing the datasets from the scratch. The utmost privacy protection over incremental datasets on CC is far from being achieved. Certainly, the distribution of huge datasets volume across multiple storage nodes limits the privacy preservation. In this view, we propose a new anonymization technique to attain better privacy protection with high data utility over distributed and incremental datasets on CC. The proficiency of data privacy preservation and improved confidentiality requirements is demonstrated through performance evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.
Jeyasingh, Suganthi; Veluchamy, Malathi
2017-05-01
Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License
Differentially Private Frequent Subgraph Mining
Xu, Shengzhi; Xiong, Li; Cheng, Xiang; Xiao, Ke
2016-01-01
Mining frequent subgraphs from a collection of input graphs is an important topic in data mining research. However, if the input graphs contain sensitive information, releasing frequent subgraphs may pose considerable threats to individual's privacy. In this paper, we study the problem of frequent subgraph mining (FGM) under the rigorous differential privacy model. We introduce a novel differentially private FGM algorithm, which is referred to as DFG. In this algorithm, we first privately identify frequent subgraphs from input graphs, and then compute the noisy support of each identified frequent subgraph. In particular, to privately identify frequent subgraphs, we present a frequent subgraph identification approach which can improve the utility of frequent subgraph identifications through candidates pruning. Moreover, to compute the noisy support of each identified frequent subgraph, we devise a lattice-based noisy support derivation approach, where a series of methods has been proposed to improve the accuracy of the noisy supports. Through formal privacy analysis, we prove that our DFG algorithm satisfies ε-differential privacy. Extensive experimental results on real datasets show that the DFG algorithm can privately find frequent subgraphs with high data utility. PMID:27616876
Pandey, Abhishek; Kreimeyer, Kory; Foster, Matthew; Botsis, Taxiarchis; Dang, Oanh; Ly, Thomas; Wang, Wei; Forshee, Richard
2018-01-01
Structured Product Labels follow an XML-based document markup standard approved by the Health Level Seven organization and adopted by the US Food and Drug Administration as a mechanism for exchanging medical products information. Their current organization makes their secondary use rather challenging. We used the Side Effect Resource database and DailyMed to generate a comparison dataset of 1159 Structured Product Labels. We processed the Adverse Reaction section of these Structured Product Labels with the Event-based Text-mining of Health Electronic Records system and evaluated its ability to extract and encode Adverse Event terms to Medical Dictionary for Regulatory Activities Preferred Terms. A small sample of 100 labels was then selected for further analysis. Of the 100 labels, Event-based Text-mining of Health Electronic Records achieved a precision and recall of 81 percent and 92 percent, respectively. This study demonstrated Event-based Text-mining of Health Electronic Record's ability to extract and encode Adverse Event terms from Structured Product Labels which may potentially support multiple pharmacoepidemiological tasks.
Temporal and spatial changes of land use and landscape in a coal mining area in Xilingol grassland
NASA Astrophysics Data System (ADS)
Guan, Chunzhu; Zhang, Baolin; Li, Jiannan; Zhao, Junling
2017-01-01
Coal mining, particularly surface mining, inevitably disturbs land. According to Landsat images acquired over Xilingol grassland in 2005, 2009 and 2015, land uses were divided into seven classes, i. e., open stope, stripping area, waste-dump area, mine industrial area, farmland, urban area and the original landscape (grassland), using supervised classification and human-computer interactive interpretation. The overall classification accuracies were 97.72 %, 98.43 % and 96.73 %, respectively; the Kappa coefficients were 0.95, 0.97 and 0.95, respectively. Analysis on LUCC (Land Use and Cover Change) showed that surface coal mining disturbed grassland ecosystem: grassland decreased by 8661.15 hm2 in 2005-2015. The area and proportion of mining operation areas (open stope, stripping area, waste-dump area, mine industrial field) increased, but those of grassland decreased continuously. Transfer matrix of land use changes showed that waste-dump had the largest impacts in mining disturbance, and that effective reclamation of waste-dump areas would mitigate eco-environment destruction, as would be of great significance to protect fragile grassland eco-system. Six landscape index showed that landscape fragmentation increased, and the influences of human activity on landscape was mainly reflected in the expansion of mining area and urban area. Remote sensing monitoring of coal surface mining in grassland would accurately demonstrate the dynamics and trend of LUCC, providing scientific supports for ecological reconstruction in surface mining area.
From scientific discovery to cures: bright stars within a galaxy.
Williams, R Sanders; Lotia, Samad; Holloway, Alisha K; Pico, Alexander R
2015-09-24
We propose that data mining and network analysis utilizing public databases can identify and quantify relationships between scientific discoveries and major advances in medicine (cures). Further development of such approaches could help to increase public understanding and governmental support for life science research and could enhance decision making in the quest for cures. Copyright © 2015 Elsevier Inc. All rights reserved.
Microbial bebop: creating music from complex dynamics in microbial ecology.
Larsen, Peter; Gilbert, Jack
2013-01-01
In order for society to make effective policy decisions on complex and far-reaching subjects, such as appropriate responses to global climate change, scientists must effectively communicate complex results to the non-scientifically specialized public. However, there are few ways however to transform highly complicated scientific data into formats that are engaging to the general community. Taking inspiration from patterns observed in nature and from some of the principles of jazz bebop improvisation, we have generated Microbial Bebop, a method by which microbial environmental data are transformed into music. Microbial Bebop uses meter, pitch, duration, and harmony to highlight the relationships between multiple data types in complex biological datasets. We use a comprehensive microbial ecology, time course dataset collected at the L4 marine monitoring station in the Western English Channel as an example of microbial ecological data that can be transformed into music. Four compositions were generated (www.bio.anl.gov/MicrobialBebop.htm.) from L4 Station data using Microbial Bebop. Each composition, though deriving from the same dataset, is created to highlight different relationships between environmental conditions and microbial community structure. The approach presented here can be applied to a wide variety of complex biological datasets.
Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha
2016-01-01
Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.
Oliphant, Adam J.; Wynne, R.H.; Zipper, Carl E.; Ford, W. Mark; Donovan, P. F.; Li, Jing
2017-01-01
Invasive plants threaten native plant communities. Surface coal mines in the Appalachian Mountains are among the most disturbed landscapes in North America, but information about land cover characteristics of Appalachian mined lands is lacking. The invasive shrub autumn olive (Elaeagnus umbellata) occurs on these sites and interferes with ecosystem recovery by outcompeting native trees, thus inhibiting re-establishment of the native woody-plant community. We analyzed Landsat 8 satellite imagery to describe autumn olive’s distribution on post-mined lands in southwestern Virginia within the Appalachian coalfield. Eight images from April 2013 through January 2015 served as input data. Calibration and validation data obtained from high-resolution aerial imagery were used to develop a land cover classification model that identified areas where autumn olive was a primary component of land cover. Results indicate that autumn olive cover was sufficiently dense to enable detection on approximately 12.6 % of post-mined lands within the study area. The classified map had user’s and producer’s accuracies of 85.3 and 78.6 %, respectively, for the autumn olive coverage class. Overall accuracy was assessed in reference to an independent validation dataset at 96.8 %. Autumn olive was detected more frequently on mines disturbed prior to 2003, the last year of known plantings, than on lands disturbed by more recent mining. These results indicate that autumn olive growing on reclaimed coal mines in Virginia and elsewhere in eastern USA can be mapped using Landsat 8 Operational Land Imager imagery; and that autumn olive occurrence is a significant landscape vegetation feature on former surface coal mines in the southwestern Virginia segment of the Appalachian coalfield.
Earth Science Research at the Homestake Deep Underground Science and Engineering Laboratory
NASA Astrophysics Data System (ADS)
Roggenthen, W.; Wang, J.
2004-12-01
The Homestake Mine in South Dakota ceased gold production in 2002 and was sealed for entry in 2003. The announcement of mine closure triggered the revival of a national initiative to establish a deep underground facility, currently known as the Deep Underground Science and Engineering Laboratory (DUSEL). The National Science Foundation announced that solicitations were to be issued in 2004 and 2005, with the first one (known as S-1) issued in June, 2004. The focus of S-1 is on site non-specific technical requirements to define the scientific program at DUSEL. Earth scientists and physicists participated in an S-1 workshop at Berkeley in August, 2004. This abstract presents the prospects of the Homestake Mine to accommodate the earth science scientific programs defined at the S-1 workshop. The Homestake Mine has hundreds of kilometers of drifts over fifty levels accessible (upon mine reopening) for water evaluation, seepage quantification, seismic monitoring, geophysical imaging, geological mapping, mineral sampling, ecology and geo-microbiology. The extensive network of drifts, ramps, and vertical shafts allows installation of 10-kilometer-scale seismograph and electromagnetic networks. Ramps connecting different levels, typically separated by 150 ft, could be instrumented for flow and transport studies, prior to implementation of coupled thermal-hydro-chemical-mechanical-biological processes testing. Numerous large rooms are available for ecological and introduced-material evaluations. Ideas for installing instruments in cubic kilometers of rock mass can be realized over multiple levels. Environmental assessment, petroleum recovery, carbon sequestration were among the applications discussed in the S-1 workshop. If the Homestake Mine can be expediently reopened, earth scientists are ready to perform important tests with a phased approach. The drifts and ramps directly below the large open pit could be the first area for shallow testing. The 4,850 ft level is the next target area, which has a large lateral extent. Geophysical sensor stations could be installed at this level, together with stations along two main shafts accessing this level, and one winze below. After dewatering, rock mechanics and geotechnical engineering investigators could actively participate in room siting and excavation, at depths up to 8,000 ft. Geochemistry and geo-microbiology scientists would prefer additional drilling in deep zones beyond the mining and flooding perturbations. Additional earth science programs are being developed for the Homestake Mine, utilizing multiple levels and shafts. Many physics experiments require a site "as deep as possible" and special conditions to reduce background and cosmic rays. The Homestake Mine offers a very deep site and a vast amount of data and knowledge associated with its 125 years of mining operation. The cores from exploratory drilling into a mechanical strong unit, the Yates Formation, are available for scientific and engineering evaluations. A team from many institutions is being formed by Kevin Lesko, a neutrino scientist with experience in detecting neutrino oscillations with deep detectors in Canada and Japan. It is time for the United States to establish a DUSEL deep and large enough for next-generation physics and earth science long-term experiments. The Homestake Mine has these necessary attributes. The collaboration welcomes participation and contribution from scientists and engineers in the physics and earth science community for multi-disciplinary research during and after the restoration and conversion of the Homestake Mine.
Data Container Study for Handling array-based data using Hive, Spark, MongoDB, SciDB and Rasdaman
NASA Astrophysics Data System (ADS)
Xu, M.; Hu, F.; Yang, J.; Yu, M.; Yang, C. P.
2017-12-01
Geoscience communities have come up with various big data storage solutions, such as Rasdaman and Hive, to address the grand challenges for massive Earth observation data management and processing. To examine the readiness of current solutions in supporting big Earth observation, we propose to investigate and compare four popular data container solutions, including Rasdaman, Hive, Spark, SciDB and MongoDB. Using different types of spatial and non-spatial queries, datasets stored in common scientific data formats (e.g., NetCDF and HDF), and two applications (i.e. dust storm simulation data mining and MERRA data analytics), we systematically compare and evaluate the feature and performance of these four data containers in terms of data discover and access. The computing resources (e.g. CPU, memory, hard drive, network) consumed while performing various queries and operations are monitored and recorded for the performance evaluation. The initial results show that 1) the popular data container clusters are able to handle large volume of data, but their performances vary in different situations. Meanwhile, there is a trade-off between data preprocessing, disk saving, query-time saving, and resource consuming. 2) ClimateSpark, MongoDB and SciDB perform the best among all the containers in all the queries tests, and Hive performs the worst. 3) These studied data containers can be applied on other array-based datasets, such as high resolution remote sensing data and model simulation data. 4) Rasdaman clustering configuration is more complex than the others. A comprehensive report will detail the experimental results, and compare their pros and cons regarding system performance, ease of use, accessibility, scalability, compatibility, and flexibility.
Myneni, Sahiti; Patel, Vimla L.
2010-01-01
Biomedical researchers often work with massive, detailed and heterogeneous datasets. These datasets raise new challenges of information organization and management for scientific interpretation, as they demand much of the researchers’ time and attention. The current study investigated the nature of the problems that researchers face when dealing with such data. Four major problems identified with existing biomedical scientific information management methods were related to data organization, data sharing, collaboration, and publications. Therefore, there is a compelling need to develop an efficient and user-friendly information management system to handle the biomedical research data. This study evaluated the implementation of an information management system, which was introduced as part of the collaborative research to increase scientific productivity in a research laboratory. Laboratory members seemed to exhibit frustration during the implementation process. However, empirical findings revealed that they gained new knowledge and completed specified tasks while working together with the new system. Hence, researchers are urged to persist and persevere when dealing with any new technology, including an information management system in a research laboratory environment. PMID:20543892
Tracking Provenance of Earth Science Data
NASA Technical Reports Server (NTRS)
Tilmes, Curt; Yesha, Yelena; Halem, Milton
2010-01-01
Tremendous volumes of data have been captured, archived and analyzed. Sensors, algorithms and processing systems for transforming and analyzing the data are evolving over time. Web Portals and Services can create transient data sets on-demand. Data are transferred from organization to organization with additional transformations at every stage. Provenance in this context refers to the source of data and a record of the process that led to its current state. It encompasses the documentation of a variety of artifacts related to particular data. Provenance is important for understanding and using scientific datasets, and critical for independent confirmation of scientific results. Managing provenance throughout scientific data processing has gained interest lately and there are a variety of approaches. Large scale scientific datasets consisting of thousands to millions of individual data files and processes offer particular challenges. This paper uses the analogy of art history provenance to explore some of the concerns of applying provenance tracking to earth science data. It also illustrates some of the provenance issues with examples drawn from the Ozone Monitoring Instrument (OMI) Data Processing System (OMIDAPS) run at NASA's Goddard Space Flight Center by the first author.
Myneni, Sahiti; Patel, Vimla L
2010-06-01
Biomedical researchers often work with massive, detailed and heterogeneous datasets. These datasets raise new challenges of information organization and management for scientific interpretation, as they demand much of the researchers' time and attention. The current study investigated the nature of the problems that researchers face when dealing with such data. Four major problems identified with existing biomedical scientific information management methods were related to data organization, data sharing, collaboration, and publications. Therefore, there is a compelling need to develop an efficient and user-friendly information management system to handle the biomedical research data. This study evaluated the implementation of an information management system, which was introduced as part of the collaborative research to increase scientific productivity in a research laboratory. Laboratory members seemed to exhibit frustration during the implementation process. However, empirical findings revealed that they gained new knowledge and completed specified tasks while working together with the new system. Hence, researchers are urged to persist and persevere when dealing with any new technology, including an information management system in a research laboratory environment.
Arnold, L. Rick
2010-01-01
These datasets were compiled in support of U.S. Geological Survey Scientific-Investigations Report 2010-5082-Hydrogeology and Steady-State Numerical Simulation of Groundwater Flow in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. The datasets were developed by the U.S. Geological Survey in cooperation with the Lost Creek Ground Water Management District and the Colorado Geological Survey. The four datasets are described as follows and methods used to develop the datasets are further described in Scientific-Investigations Report 2010-5082: (1) ds507_regolith_data: This point dataset contains geologic information concerning regolith (unconsolidated sediment) thickness and top-of-bedrock altitude at selected well and test-hole locations in and near the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Data were compiled from published reports, consultant reports, and from lithologic logs of wells and test holes on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources. (2) ds507_regthick_contours: This dataset consists of contours showing generalized lines of equal regolith thickness overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness was contoured manually on the basis of information provided in the dataset ds507_regolith_data. (3) ds507_regthick_grid: This dataset consists of raster-based generalized thickness of regolith overlying bedrock in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. Regolith thickness in this dataset was derived from contours presented in the dataset ds507_regthick_contours. (4) ds507_welltest_data: This point dataset contains estimates of aquifer transmissivity and hydraulic conductivity at selected well locations in the Lost Creek Designated Ground Water Basin, Weld, Adams, and Arapahoe Counties, Colorado. This dataset also contains hydrologic information used to estimate transmissivity from specific capacity at selected well locations. Data were compiled from published reports, consultant reports, and from well-test records on file with the U.S. Geological Survey Colorado Water Science Center and the Colorado Division of Water Resources.
Exploring Antarctic Land Surface Temperature Extremes Using Condensed Anomaly Databases
NASA Astrophysics Data System (ADS)
Grant, Glenn Edwin
Satellite observations have revolutionized the Earth Sciences and climate studies. However, data and imagery continue to accumulate at an accelerating rate, and efficient tools for data discovery, analysis, and quality checking lag behind. In particular, studies of long-term, continental-scale processes at high spatiotemporal resolutions are especially problematic. The traditional technique of downloading an entire dataset and using customized analysis code is often impractical or consumes too many resources. The Condensate Database Project was envisioned as an alternative method for data exploration and quality checking. The project's premise was that much of the data in any satellite dataset is unneeded and can be eliminated, compacting massive datasets into more manageable sizes. Dataset sizes are further reduced by retaining only anomalous data of high interest. Hosting the resulting "condensed" datasets in high-speed databases enables immediate availability for queries and exploration. Proof of the project's success relied on demonstrating that the anomaly database methods can enhance and accelerate scientific investigations. The hypothesis of this dissertation is that the condensed datasets are effective tools for exploring many scientific questions, spurring further investigations and revealing important information that might otherwise remain undetected. This dissertation uses condensed databases containing 17 years of Antarctic land surface temperature anomalies as its primary data. The study demonstrates the utility of the condensate database methods by discovering new information. In particular, the process revealed critical quality problems in the source satellite data. The results are used as the starting point for four case studies, investigating Antarctic temperature extremes, cloud detection errors, and the teleconnections between Antarctic temperature anomalies and climate indices. The results confirm the hypothesis that the condensate databases are a highly useful tool for Earth Science analyses. Moreover, the quality checking capabilities provide an important method for independent evaluation of dataset veracity.
Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data
2013-01-01
Background The World Wide Web has become a dissemination platform for scientific and non-scientific publications. However, most of the information remains locked up in discrete documents that are not always interconnected or machine-readable. The connectivity tissue provided by RDF technology has not yet been widely used to support the generation of self-describing, machine-readable documents. Results In this paper, we present our approach to the generation of self-describing machine-readable scholarly documents. We understand the scientific document as an entry point and interface to the Web of Data. We have semantically processed the full-text, open-access subset of PubMed Central. Our RDF model and resulting dataset make extensive use of existing ontologies and semantic enrichment services. We expose our model, services, prototype, and datasets at http://biotea.idiginfo.org/ Conclusions The semantic processing of biomedical literature presented in this paper embeds documents within the Web of Data and facilitates the execution of concept-based queries against the entire digital library. Our approach delivers a flexible and adaptable set of tools for metadata enrichment and semantic processing of biomedical documents. Our model delivers a semantically rich and highly interconnected dataset with self-describing content so that software can make effective use of it. PMID:23734622
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gillen, David S.
Analysis activities for Nonproliferation and Arms Control verification require the use of many types of data. Tabular structured data, such as Excel spreadsheets and relational databases, have traditionally been used for data mining activities, where specific queries are issued against data to look for matching results. The application of visual analytics tools to structured data enables further exploration of datasets to promote discovery of previously unknown results. This paper discusses the application of a specific visual analytics tool to datasets related to the field of Arms Control and Nonproliferation to promote the use of visual analytics more broadly in thismore » domain. Visual analytics focuses on analytical reasoning facilitated by interactive visual interfaces (Wong and Thomas 2004). It promotes exploratory analysis of data, and complements data mining technologies where known patterns can be mined for. Also with a human in the loop, they can bring in domain knowledge and subject matter expertise. Visual analytics has not widely been applied to this domain. In this paper, we will focus on one type of data: structured data, and show the results of applying a specific visual analytics tool to answer questions in the Arms Control and Nonproliferation domain. We chose to use the T.Rex tool, a visual analytics tool developed at PNNL, which uses a variety of visual exploration patterns to discover relationships in structured datasets, including a facet view, graph view, matrix view, and timeline view. The facet view enables discovery of relationships between categorical information, such as countries and locations. The graph tool visualizes node-link relationship patterns, such as the flow of materials being shipped between parties. The matrix visualization shows highly correlated categories of information. The timeline view shows temporal patterns in data. In this paper, we will use T.Rex with two different datasets to demonstrate how interactive exploration of the data can aid an analyst with arms control and nonproliferation verification activities. Using a dataset from PIERS (PIERS 2014), we will show how container shipment imports and exports can aid an analyst in understanding the shipping patterns between two countries. We will also use T.Rex to examine a collection of research publications from the IAEA International Nuclear Information System (IAEA 2014) to discover collaborations of concern. We hope this paper will encourage the use of visual analytics structured data analytics in the field of nonproliferation and arms control verification. Our paper outlines some of the challenges that exist before broad adoption of these kinds of tools can occur and offers next steps to overcome these challenges.« less
ToxRefDB - Release user-friendly web-based tool for mining ToxRefDB
The updated URL link is for a table of NCCT ToxCast public datasets. The next to last row of the table has the link for the US EPA ToxCast ToxRefDB Data Release October 2014. ToxRefDB provides detailed chemical toxicity data in a publically accessible searchable format. ToxRefD...
Mining spatiotemporal patterns of urban dwellers from taxi trajectory data
NASA Astrophysics Data System (ADS)
Mao, Feng; Ji, Minhe; Liu, Ting
2016-06-01
With the widespread adoption of locationaware technology, obtaining long-sequence, massive and high-accuracy spatiotemporal trajectory data of individuals has become increasingly popular in various geographic studies. Trajectory data of taxis, one of the most widely used inner-city travel modes, contain rich information about both road network traffic and travel behavior of passengers. Such data can be used to study the microscopic activity patterns of individuals as well as the macro system of urban spatial structures. This paper focuses on trajectories obtained from GPS-enabled taxis and their applications for mining urban commuting patterns. A novel approach is proposed to discover spatiotemporal patterns of household travel from the taxi trajectory dataset with a large number of point locations. The approach involves three critical steps: spatial clustering of taxi origin-destination (OD) based on urban traffic grids to discover potentially meaningful places, identifying threshold values from statistics of the OD clusters to extract urban jobs-housing structures, and visualization of analytic results to understand the spatial distribution and temporal trends of the revealed urban structures and implied household commuting behavior. A case study with a taxi trajectory dataset in Shanghai, China is presented to demonstrate and evaluate the proposed method.
Trayhurn, Paul; Denyer, Gareth
2012-01-01
Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity.
Trayhurn, Paul; Denyer, Gareth
2012-01-01
Microarray datasets are a rich source of information in nutritional investigation. Targeted mining of microarray data following initial, non-biased bioinformatic analysis can provide key insight into specific genes and metabolic processes of interest. Microarrays from human adipocytes were examined to explore the effects of macrophage secretions on the expression of the G-protein-coupled receptor (GPR) genes that encode fatty acid receptors/sensors. Exposure of the adipocytes to macrophage-conditioned medium for 4 or 24 h had no effect on GPR40 and GPR43 expression, but there was a marked stimulation of GPR84 expression (receptor for medium-chain fatty acids), the mRNA level increasing 13·5-fold at 24 h relative to unconditioned medium. Importantly, expression of GPR120, which encodes an n-3 PUFA receptor/sensor, was strongly inhibited by the conditioned medium (15-fold decrease in mRNA at 24 h). Macrophage secretions have major effects on the expression of fatty acid receptor/sensor genes in human adipocytes, which may lead to an augmentation of the inflammatory response in adipose tissue in obesity. PMID:25191551
Big Data Analytics for Demand Response: Clustering Over Space and Time
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chelmis, Charalampos; Kolte, Jahanvi; Prasanna, Viktor K.
The pervasive deployment of advanced sensing infrastructure in Cyber-Physical systems, such as the Smart Grid, has resulted in an unprecedented data explosion. Such data exhibit both large volumes and high velocity characteristics, two of the three pillars of Big Data, and have a time-series notion as datasets in this context typically consist of successive measurements made over a time interval. Time-series data can be valuable for data mining and analytics tasks such as identifying the “right” customers among a diverse population, to target for Demand Response programs. However, time series are challenging to mine due to their high dimensionality. Inmore » this paper, we motivate this problem using a real application from the smart grid domain. We explore novel representations of time-series data for BigData analytics, and propose a clustering technique for determining natural segmentation of customers and identification of temporal consumption patterns. Our method is generizable to large-scale, real-world scenarios, without making any assumptions about the data. We evaluate our technique using real datasets from smart meters, totaling ~ 18,200,000 data points, and show the efficacy of our technique in efficiency detecting the number of optimal number of clusters.« less
PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting
Jiang, Yexi; Tang, Mingjie; Yuan, Changan; Tang, Changjie
2013-01-01
Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmenting mainly focused on the issue of ameliorating precision instead of paying much attention to the efficiency. Moreover, the performance of these algorithms depends heavily on parameters, which are hard for the users to set. In this paper, we propose PRESEE (parameter-free, real-time, and scalable time-series stream segmenting algorithm), which greatly improves the efficiency of time-series stream segmenting. PRESEE is based on both MDL (minimum description length) and MML (minimum message length) methods, which could segment the data automatically. To evaluate the performance of PRESEE, we conduct several experiments on time-series streams of different types and compare it with the state-of-art algorithm. The empirical results show that PRESEE is very efficient for real-time stream datasets by improving segmenting speed nearly ten times. The novelty of this algorithm is further demonstrated by the application of PRESEE in segmenting real-time stream datasets from ChinaFLUX sensor networks data stream. PMID:23956693
PRESEE: an MDL/MML algorithm to time-series stream segmenting.
Xu, Kaikuo; Jiang, Yexi; Tang, Mingjie; Yuan, Changan; Tang, Changjie
2013-01-01
Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmenting mainly focused on the issue of ameliorating precision instead of paying much attention to the efficiency. Moreover, the performance of these algorithms depends heavily on parameters, which are hard for the users to set. In this paper, we propose PRESEE (parameter-free, real-time, and scalable time-series stream segmenting algorithm), which greatly improves the efficiency of time-series stream segmenting. PRESEE is based on both MDL (minimum description length) and MML (minimum message length) methods, which could segment the data automatically. To evaluate the performance of PRESEE, we conduct several experiments on time-series streams of different types and compare it with the state-of-art algorithm. The empirical results show that PRESEE is very efficient for real-time stream datasets by improving segmenting speed nearly ten times. The novelty of this algorithm is further demonstrated by the application of PRESEE in segmenting real-time stream datasets from ChinaFLUX sensor networks data stream.
Determining a pre-mining radiological baseline from historic airborne gamma surveys: a case study.
Bollhöfer, Andreas; Beraldo, Annamarie; Pfitzner, Kirrilly; Esparon, Andrew; Doering, Che
2014-01-15
Knowing the baseline level of radioactivity in areas naturally enriched in radionuclides is important in the uranium mining context to assess radiation doses to humans and the environment both during and after mining. This information is particularly useful in rehabilitation planning and developing closure criteria for uranium mines as only radiation doses additional to the natural background are usually considered 'controllable' for radiation protection purposes. In this case study we have tested whether the method of contemporary groundtruthing of a historic airborne gamma survey could be used to determine the pre-mining radiological conditions at the Ranger mine in northern Australia. The airborne gamma survey was flown in 1976 before mining started and groundtruthed using ground gamma dose rate measurements made between 2007 and 2009 at an undisturbed area naturally enriched in uranium (Anomaly 2) located nearby the Ranger mine. Measurements of (226)Ra soil activity concentration and (222)Rn exhalation flux density at Anomaly 2 were made concurrent with the ground gamma dose rate measurements. Algorithms were developed to upscale the ground gamma data to the same spatial resolution as the historic airborne gamma survey data using a geographic information system, allowing comparison of the datasets. Linear correlation models were developed to estimate the pre-mining gamma dose rates, (226)Ra soil activity concentrations, and (222)Rn exhalation flux densities at selected areas in the greater Ranger region. The modelled levels agreed with measurements made at the Ranger Orebodies 1 and 3 before mining started, and at environmental sites in the region. The conclusion is that our approach can be used to determine baseline radiation levels, and provide a benchmark for rehabilitation of uranium mines or industrial sites where historical airborne gamma survey data are available and an undisturbed radiological analogue exists to groundtruth the data. © 2013.
Data publication - policies and procedures from the PREPARDE project
NASA Astrophysics Data System (ADS)
Callaghan, Sarah; Murphy, Fiona; Tedds, Jonathan; Kunze, John; Lawrence, Rebecca; Mayernik, , Matthew S.; Whyte, Angus; Roberts, Timothy
2013-04-01
Data are widely acknowledged as a first class scientific output. Increases in researchers' abilities to create data need to be matched by corresponding infrastructures for them to manage and share their data. At the same time, the quality and persistence of the datasets need to be ensured, providing the dataset creators with the recognition they deserve for their efforts. Formal publication of data takes advantage of the processes and procedures already in place to publish academic articles about scientific results, enabling data to be reviewed and more broadly disseminated. Data are vastly more varied in format than papers, and so the policies required to manage and publish data must take into account the complexities associated with different data types, scientific fields, licensing rules etc. The Peer REview for Publication & Accreditation of Research Data in the Earth sciences (PREPARDE) project is JISC- and NERC-funded, and aims to investigate the policies and procedures required for the formal publication of research data. The project is investigating the whole workflow of data publication, from ingestion into a data repository, through to formal publication in a data journal. To limit the scope of the project, the focus is primarily on the policies required for the Royal Meteorological Society and Wiley's Geoscience Data Journal, though members of the project team include representatives from the life sciences (F1000Research), and will generalise the policies to other disciplines. PREPARDE addresses key issues arising in the data publication paradigm, such as: what criteria are needed for a repository to be considered objectively trustworthy; how does one peer-review a dataset; and how can datasets and journal publications be effectively cross-linked for the benefit of the wider research community and the completeness of the scientific record? To answer these questions, the project is hosting workshops addressing these issues, with interactions from key stakeholders, including data and repository managers, researchers, funders and publishers. The results of these workshops will be presented and further comment and interaction sought from interested parties.
Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value
Thompson, Paul A.
2016-01-01
Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the “p-hacking bump” just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed. PMID:26925335
Mining the Temporal Dimension of the Information Propagation
NASA Astrophysics Data System (ADS)
Berlingerio, Michele; Coscia, Michele; Giannotti, Fosca
In the last decade, Social Network Analysis has been a field in which the effort devoted from several researchers in the Data Mining area has increased very fast. Among the possible related topics, the study of the information propagation in a network attracted the interest of many researchers, also from the industrial world. However, only a few answers to the questions “How does the information propagates over a network, why and how fast?” have been discovered so far. On the other hand, these answers are of large interest, since they help in the tasks of finding experts in a network, assessing viral marketing strategies, identifying fast or slow paths of the information inside a collaborative network. In this paper we study the problem of finding frequent patterns in a network with the help of two different techniques: TAS (Temporally Annotated Sequences) mining, aimed at extracting sequential patterns where each transition between two events is annotated with a typical transition time that emerges from input data, and Graph Mining, which is helpful for locally analyzing the nodes of the networks with their properties. Finally we show preliminary results done in the direction of mining the information propagation over a network, performed on two well known email datasets, that show the power of the combination of these two approaches.
Temporal Delineation and Quantification of Short Term Clustered Mining Seismicity
NASA Astrophysics Data System (ADS)
Woodward, Kyle; Wesseloo, Johan; Potvin, Yves
2017-07-01
The assessment of the temporal characteristics of seismicity is fundamental to understanding and quantifying the seismic hazard associated with mining, the effectiveness of strategies and tactics used to manage seismic hazard, and the relationship between seismicity and changes to the mining environment. This article aims to improve the accuracy and precision in which the temporal dimension of seismic responses can be quantified and delineated. We present a review and discussion on the occurrence of time-dependent mining seismicity with a specific focus on temporal modelling and the modified Omori law (MOL). This forms the basis for the development of a simple weighted metric that allows for the consistent temporal delineation and quantification of a seismic response. The optimisation of this metric allows for the selection of the most appropriate modelling interval given the temporal attributes of time-dependent mining seismicity. We evaluate the performance weighted metric for the modelling of a synthetic seismic dataset. This assessment shows that seismic responses can be quantified and delineated by the MOL, with reasonable accuracy and precision, when the modelling is optimised by evaluating the weighted MLE metric. Furthermore, this assessment highlights that decreased weighted MLE metric performance can be expected if there is a lack of contrast between the temporal characteristics of events associated with different processes.
A novel water quality data analysis framework based on time-series data mining.
Deng, Weihui; Wang, Guoyin
2017-07-01
The rapid development of time-series data mining provides an emerging method for water resource management research. In this paper, based on the time-series data mining methodology, we propose a novel and general analysis framework for water quality time-series data. It consists of two parts: implementation components and common tasks of time-series data mining in water quality data. In the first part, we propose to granulate the time series into several two-dimensional normal clouds and calculate the similarities in the granulated level. On the basis of the similarity matrix, the similarity search, anomaly detection, and pattern discovery tasks in the water quality time-series instance dataset can be easily implemented in the second part. We present a case study of this analysis framework on weekly Dissolve Oxygen time-series data collected from five monitoring stations on the upper reaches of Yangtze River, China. It discovered the relationship of water quality in the mainstream and tributary as well as the main changing patterns of DO. The experimental results show that the proposed analysis framework is a feasible and efficient method to mine the hidden and valuable knowledge from water quality historical time-series data. Copyright © 2017 Elsevier Ltd. All rights reserved.
Code of Federal Regulations, 2010 CFR
2010-04-01
... oil and gas or other mining leases, to furnish scientific and technical information and advice, to ascertain and record the amount and value of production, and to determine and record rentals and royalties...
Code of Federal Regulations, 2014 CFR
2014-04-01
... oil and gas or other mining leases, to furnish scientific and technical information and advice, to ascertain and record the amount and value of production, and to determine and record rentals and royalties...
Code of Federal Regulations, 2011 CFR
2011-04-01
... oil and gas or other mining leases, to furnish scientific and technical information and advice, to ascertain and record the amount and value of production, and to determine and record rentals and royalties...
Code of Federal Regulations, 2012 CFR
2012-04-01
... oil and gas or other mining leases, to furnish scientific and technical information and advice, to ascertain and record the amount and value of production, and to determine and record rentals and royalties...
Code of Federal Regulations, 2013 CFR
2013-04-01
... oil and gas or other mining leases, to furnish scientific and technical information and advice, to ascertain and record the amount and value of production, and to determine and record rentals and royalties...
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106
Topic modeling for cluster analysis of large biological and medical datasets.
Zhao, Weizhong; Zou, Wen; Chen, James J
2014-01-01
The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
[Scientific significance and prospective application of digitized virtual human].
Zhong, Shi-zhen
2003-03-01
As a cutting-edge research project, digitization of human anatomical information combines conventional medicine with information technology, computer technology, and virtual reality technology. Recent years have seen the establishment of, or the ongoing effort to establish various virtual human models in many countries, on the basis of continuous sections of human body that are digitized by means of computational medicine incorporating information technology to quantitatively simulate human physiological and pathological conditions, and to provide wide prospective applications in the fields of medicine and other disciplines. This article addresses 4 issues concerning the progress in virtual human model researches as the following: (1) Worldwide survey of sectioning and modeling of visible human. American visible human database was completed in 1994, which contains both a male and a female datasets, and has found wide application internationally. South Korea also finished the data collection for a male visible Korean human dataset in 2000. (2) Application of the dataset of Visible Human Project (VHP). This dataset has yielded plentiful fruits in medical education and clinical research, and further plans are proposed and practiced to construct a Physical Human and Physiological Human . (3) Scientific significance and prospect of virtual human studies. Digitized human dataset may eventually contribute to the development of many new high-tech industries. (4) Progress of virtual Chinese human project. The 174th session of Xiangshang Science Conferences held in 2001 marked the initiation of digitized virtual human project in China, and some key techniques have been explored. By now the data-collection process for 4 Chinese virtual human datasets have been successfully completed.
Sleep stages identification in patients with sleep disorder using k-means clustering
NASA Astrophysics Data System (ADS)
Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.
2018-05-01
Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.
Alanazi, Ibrahim O; AlYahya, Sami A; Ebrahimie, Esmaeil; Mohammadi-Dehcheshmeh, Manijeh
2018-06-15
Exponentially growing scientific knowledge in scientific publications has resulted in the emergence of a new interdisciplinary science of literature mining. In text mining, the machine reads the published literature and transfers the discovered knowledge to mathematical-like formulas. In an integrative approach in this study, we used text mining in combination with network discovery, pathway analysis, and enrichment analysis of genomic regions for better understanding of biomarkers in lung cancer. Particular attention was paid to non-coding biomarkers. In total, 60 MicroRNA biomarkers were reported for lung cancer, including some prognostic biomarkers. MIR21, MIR155, MALAT1, and MIR31 were the top non-coding RNA biomarkers of lung cancer. Text mining identified 447 proteins which have been studied as biomarkers in lung cancer. EGFR (receptor), TP53 (transcription factor), KRAS, CDKN2A, ENO2, KRT19, RASSF1, GRP (ligand), SHOX2 (transcription factor), and ERBB2 (receptor) were the most studied proteins. Within small molecules, thymosin-a1, oestrogen, and 8-OHdG have received more attention. We found some chromosomal bands, such as 7q32.2, 18q12.1, 6p12, 11p15.5, and 3p21.3 that are highly involved in deriving lung cancer biomarkers. Copyright © 2018 Elsevier B.V. All rights reserved.
Mata, Cristina; Malo, Juan E; Galaz, José Luis; Cadorzo, César; Lagunas, Héctor
2016-07-01
Resource extraction projects generate a diversity of negative effects on the environment that are difficult to predict and mitigate. Consequently, adaptive management approaches have been advocated to develop effective responses to impacts that were not predicted. Mammal populations living in or around mine sites are frequently of management concern; yet, there is a dearth of published information on how to minimise the negative effects of different phases of mining operations on them. Here, we present the case study of a copper mine in the Chilean Altiplano, which caused roadkills of the protected vicuña (Vicugna vicugna). This issue led to a three-step solution being implemented: (1) the initial identification of the problem and implementation of an emergency response, (2) the scientific analysis for decision making and (3) the planning and informed implementation of responses for different future scenarios and timescales. The measures taken under each of these steps provide examples of environmental management approaches that make use of scientific information to develop integrated management responses. In brief, our case study showed how (1) the timescale and the necessity/urgency of the case were addressed, (2) the various stakeholders involved were taken into account and (3) changes were included into the physical, human and organisational elements of the company to achieve the stated objectives.
Decibel: The Relational Dataset Branching System
Maddox, Michael; Goehring, David; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol
2017-01-01
As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs. PMID:28149668
ERIC Educational Resources Information Center
Vázquez-Alonso, Ángel; Aponte, Abdiel; Manassero-Mas, María-Antonia; Montesano, Marisa
2016-01-01
This study examines the effectiveness of a teaching-learning sequence (TLS) to improve the understanding of the influences and interactions between a technology (mining) and society. The aim of the study is also to show the possibility of both teaching and assessing the most innovative issues and aspects of scientific competence and their impact…
Xing, Jie; Zang, Meitong; Zhang, Haiying; Zhu, Mingshe
2015-10-15
Patients are usually exposed to multiple drugs, and metabolite profiling of each drug in complex biological matrices is a big challenge. This study presented a new application of an improved high resolution mass spectrometry (HRMS)-based data-mining tools in tandem to fast and comprehensive metabolite identification of combination drugs in human. The model drug combination was metronidazole-pantoprazole-clarithromycin (MET-PAN-CLAR), which is widely used in clinic to treat ulcers caused by Helicobacter pylori. First, mass defect filter (MDF), as a targeted data processing tool, was able to recover all relevant metabolites of MET-PAN-CLAR in human plasma and urine from the full-scan MS dataset when appropriate MDF templates for each drug were defined. Second, the accurate mass-based background subtraction (BS), as an untargeted data-mining tool, worked effectively except for several trace metabolites, which were buried in the remaining background signals. Third, an integrated strategy, i.e., untargeted BS followed by improved MDF, was effective for metabolite identification of MET-PAN-CLAR. Most metabolites except for trace ones were found in the first step of BS-processed datasets, and the results led to the setup of appropriate metabolite MDF template for the subsequent MDF data processing. Trace metabolites were further recovered by MDF, which used both common MDF templates and the novel metabolite-based MDF templates. As a result, a total of 44 metabolites or related components were found for MET-PAN-CLAR in human plasma and urine using the integrated strategy. New metabolic pathways such as N-glucuronidation of PAN and dehydrogenation of CLAR were found. This study demonstrated that the combination of accurate mass-based multiple data-mining techniques in tandem, i.e., untargeted background subtraction followed by targeted mass defect filtering, can be a valuable tool for rapid metabolite profiling of combination drugs in vivo. Copyright © 2015 Elsevier B.V. All rights reserved.
Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization
Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip
2013-01-01
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license. PMID:23613707
Jayashree, B; Hanspal, Manindra S; Srinivasan, Rajgopal; Vigneshwaran, R; Varshney, Rajeev K; Spurthi, N; Eshwar, K; Ramesh, N; Chandra, S; Hoisington, David A
2007-01-01
The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level.
DiMeX: A Text Mining System for Mutation-Disease Association Extraction.
Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K
2016-01-01
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.
DiMeX: A Text Mining System for Mutation-Disease Association Extraction
Mahmood, A. S. M. Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K.
2016-01-01
The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases. PMID:27073839
Interim Scientific Report: AFOSR-81-0122.
1983-05-05
Maximum likelihood. 2 Periton Lane, Mine-head, TA24 8AQ , England .... ...• .r- . ’ ’ "fl’ ’ ’ " .. ...... ’ ’"’ ’ - ’: , t i .a....,: Attachment 5
A Comprehensive Framework for Use of NEXRAD Data in Hydrometeorology and Hydrology
NASA Astrophysics Data System (ADS)
Krajewski, W. F.; Bradley, A.; Kruger, A.; Lawrence, R. E.; Smith, J. A.; Steiner, M.; Ramamurthy, M. K.; del Greco, S. A.
2004-12-01
The overall objective of this project is to provide the broad science and engineering communities with ready access to the vast archives and real-time information collected by the national network of NEXRAD weather radars. The main focus is on radar-rainfall data for use in hydrology, hydrometeorology, and water resources. Currently, the NEXRAD data, which are archived at NOAA's National Climatic Data Center (NCDC), are converted to operational products and used by forecasters in real time. The scientific use of the full resolution NEXRAD information is presently limited because current methods of accessing this data require considerable expertise in weather radars, data quality control, formatting and handling, and radar-rainfall algorithms. The goal is to provide professionals in the scientific, engineering, education, and public policy sectors with on-demand NEXRAD data and custom products that are at high spatial and temporal resolutions. Furthermore, the data and custom products will be of a quality suitable for scientific discovery in hydrology and hydrometeorology and in data formats that are convenient to a wide spectrum of users. We are developing a framework and a set of tools for access, visualization, management, rainfall estimation algorithms, and scientific analysis of full resolution NEXRAD data. The framework will address the issues of data dissemination, format conversions and compression, management of terabyte-sized datasets, rapid browsing and visualization, metadata selection and calculation, relational and XML databases, integration with geographic information systems, data queries and knowledge mining, and Web Services. The tools will perform instantaneous comprehensive quality control and radar-rainfall estimation using a variety of algorithms. The algorithms that the user can select will range from "quick look" to complex, and computing-intensive and will include operational algorithms used by federal agencies as well as research grade experimental methods. Options available to the user will include user-specified spatial and temporal resolution, ancillary products such as storm advection velocity fields, estimation of uncertainty associated with rainfall maps, and mathematical synthesis of the products. The data and the developed tools will be provided to the community via the services and the infrastructure of Unidata and the NCDC.
Highly comparative time-series analysis: the empirical structure of time series and their methods.
Fulcher, Ben D; Little, Max A; Jones, Nick S
2013-06-06
The process of collecting and organizing sets of observations represents a common theme throughout the history of science. However, despite the ubiquity of scientists measuring, recording and analysing the dynamics of different processes, an extensive organization of scientific time-series data and analysis methods has never been performed. Addressing this, annotated collections of over 35 000 real-world and model-generated time series, and over 9000 time-series analysis algorithms are analysed in this work. We introduce reduced representations of both time series, in terms of their properties measured by diverse scientific methods, and of time-series analysis methods, in terms of their behaviour on empirical time series, and use them to organize these interdisciplinary resources. This new approach to comparing across diverse scientific data and methods allows us to organize time-series datasets automatically according to their properties, retrieve alternatives to particular analysis methods developed in other scientific disciplines and automate the selection of useful methods for time-series classification and regression tasks. The broad scientific utility of these tools is demonstrated on datasets of electroencephalograms, self-affine time series, heartbeat intervals, speech signals and others, in each case contributing novel analysis techniques to the existing literature. Highly comparative techniques that compare across an interdisciplinary literature can thus be used to guide more focused research in time-series analysis for applications across the scientific disciplines.
Highly comparative time-series analysis: the empirical structure of time series and their methods
Fulcher, Ben D.; Little, Max A.; Jones, Nick S.
2013-01-01
The process of collecting and organizing sets of observations represents a common theme throughout the history of science. However, despite the ubiquity of scientists measuring, recording and analysing the dynamics of different processes, an extensive organization of scientific time-series data and analysis methods has never been performed. Addressing this, annotated collections of over 35 000 real-world and model-generated time series, and over 9000 time-series analysis algorithms are analysed in this work. We introduce reduced representations of both time series, in terms of their properties measured by diverse scientific methods, and of time-series analysis methods, in terms of their behaviour on empirical time series, and use them to organize these interdisciplinary resources. This new approach to comparing across diverse scientific data and methods allows us to organize time-series datasets automatically according to their properties, retrieve alternatives to particular analysis methods developed in other scientific disciplines and automate the selection of useful methods for time-series classification and regression tasks. The broad scientific utility of these tools is demonstrated on datasets of electroencephalograms, self-affine time series, heartbeat intervals, speech signals and others, in each case contributing novel analysis techniques to the existing literature. Highly comparative techniques that compare across an interdisciplinary literature can thus be used to guide more focused research in time-series analysis for applications across the scientific disciplines. PMID:23554344
Sustainable Remediation of Legacy Mine Drainage: A Case Study of the Flight 93 National Memorial.
Emili, Lisa A; Pizarchik, Joseph; Mahan, Carolyn G
2016-03-01
Pollution from mining activities is a global environmental concern, not limited to areas of current resource extraction, but including a broader geographic area of historic (legacy) and abandoned mines. The pollution of surface waters from acid mine drainage is a persistent problem and requires a holistic and sustainable approach to addressing the spatial and temporal complexity of mining-specific problems. In this paper, we focus on the environmental, socio-economic, and legal challenges associated with the concurrent activities to remediate a coal mine site and to develop a national memorial following a catastrophic event. We provide a conceptual construct of a socio-ecological system defined at several spatial, temporal, and organizational scales and a critical synthesis of the technical and social learning processes necessary to achieving sustainable environmental remediation. Our case study is an example of a multi-disciplinary management approach, whereby collaborative interaction of stakeholders, the emergence of functional linkages for information exchange, and mediation led to scientifically informed decision making, creative management solutions, and ultimately environmental policy change.
A report on the medieval mining and ore processing complex: Zilan valley, Van, Turkey.
NASA Astrophysics Data System (ADS)
Ateş, Yusuf; Kılıη, Sinan
Literature has records of the use of obsidian that shows the existence of a knowledge base on raw material resources around Lake Van extending to very ancient times. Against this background, very little information can be obtained from literature about accurate location of historical mining activities in the region today. An ancient mining and processing complex, located northwest of the city of Van (Turkey) has been discovered by chance in 2007. The purpose of this article is to describe this historical mining area. The site contains mining structures such as shafts and galleries, and heaves of stone chips indicating some ore enrichment activities taking place there. The XRD and chemical analyses show the samples taken from the ore vein are rich in Manganese (Mn) and Barium (Ba), and it is concluded that the Zilan Valley Mining and Processing Complex was for Pyrolusite (MnO2), Barium or both. The site is being described for the first time in the literature and offers an opportunity to fulfill the gap in literature regarding mining history. The discovery and the description of the site would also have implications in a wide multidisciplinary scientific community, including metallurgy, archeology, and world heritage.
Advances in Machine Learning and Data Mining for Astronomy
NASA Astrophysics Data System (ADS)
Way, Michael J.; Scargle, Jeffrey D.; Ali, Kamal M.; Srivastava, Ashok N.
2012-03-01
Advances in Machine Learning and Data Mining for Astronomy documents numerous successful collaborations among computer scientists, statisticians, and astronomers who illustrate the application of state-of-the-art machine learning and data mining techniques in astronomy. Due to the massive amount and complexity of data in most scientific disciplines, the material discussed in this text transcends traditional boundaries between various areas in the sciences and computer science. The book's introductory part provides context to issues in the astronomical sciences that are also important to health, social, and physical sciences, particularly probabilistic and statistical aspects of classification and cluster analysis. The next part describes a number of astrophysics case studies that leverage a range of machine learning and data mining technologies. In the last part, developers of algorithms and practitioners of machine learning and data mining show how these tools and techniques are used in astronomical applications. With contributions from leading astronomers and computer scientists, this book is a practical guide to many of the most important developments in machine learning, data mining, and statistics. It explores how these advances can solve current and future problems in astronomy and looks at how they could lead to the creation of entirely new algorithms within the data mining community.
Optimizing data collection for public health decisions: a data mining approach
2014-01-01
Background Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. Methods The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Results Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. Conclusions While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost. PMID:24919484
Yager, Douglas B.; Johnson, Raymond H.; Rockwell, Barnaby W.; Caine, Jonathan S.; Smith, Kathleen S.
2013-01-01
Hydrothermally altered bedrock in the Silverton mining area, southwest Colorado, USA, contains sulfide minerals that weather to produce acidic and metal-rich leachate that is toxic to aquatic life. This study utilized a geographic information system (GIS) and statistical approach to identify watershed-scale geologic variables in the Silverton area that influence water quality. GIS analysis of mineral maps produced using remote sensing datasets including Landsat Thematic Mapper, advanced spaceborne thermal emission and reflection radiometer, and a hybrid airborne visible infrared imaging spectrometer and field-based product enabled areas of alteration to be quantified. Correlations between water quality signatures determined at watershed outlets, and alteration types intersecting both total watershed areas and GIS-buffered areas along streams were tested using linear regression analysis. Despite remote sensing datasets having varying watershed area coverage due to vegetation cover and differing mineral mapping capabilities, each dataset was useful for delineating acid-generating bedrock. Areas of quartz–sericite–pyrite mapped by AVIRIS have the highest correlations with acidic surface water and elevated iron and aluminum concentrations. Alkalinity was only correlated with area of acid neutralizing, propylitically altered bedrock containing calcite and chlorite mapped by AVIRIS. Total watershed area of acid-generating bedrock is more significantly correlated with acidic and metal-rich surface water when compared with acid-generating bedrock intersected by GIS-buffered areas along streams. This methodology could be useful in assessing the possible effects that alteration type area has in either generating or neutralizing acidity in unmined watersheds and in areas where new mining is planned.
Optimizing data collection for public health decisions: a data mining approach.
Partington, Susan N; Papakroni, Vasil; Menzies, Tim
2014-06-12
Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.
LSST Astroinformatics And Astrostatistics: Data-oriented Astronomical Research
NASA Astrophysics Data System (ADS)
Borne, Kirk D.; Stassun, K.; Brunner, R. J.; Djorgovski, S. G.; Graham, M.; Hakkila, J.; Mahabal, A.; Paegert, M.; Pesenson, M.; Ptak, A.; Scargle, J.; Informatics, LSST; Statistics Team
2011-01-01
The LSST Informatics and Statistics Science Collaboration (ISSC) focuses on research and scientific discovery challenges posed by the very large and complex data collection that LSST will generate. Application areas include astroinformatics, machine learning, data mining, astrostatistics, visualization, scientific data semantics, time series analysis, and advanced signal processing. Research problems to be addressed with these methodologies include transient event characterization and classification, rare class discovery, correlation mining, outlier/anomaly/surprise detection, improved estimators (e.g., for photometric redshift or early onset supernova classification), exploration of highly dimensional (multivariate) data catalogs, and more. We present sample science results from these data-oriented approaches to large-data astronomical research. We present results from LSST ISSC team members, including the EB (Eclipsing Binary) Factory, the environmental variations in the fundamental plane of elliptical galaxies, and outlier detection in multivariate catalogs.
,
2000-01-01
INTRODUCTION The joint U.S. Department of the Interior and U.S. Department of Agriculture Abandoned Mine Lands Initiative (AMLI) was developed as a collaborative effort between the Federal land management agencies (FLMA, that is the U.S. Bureau of Land Management and the U.S. Forest Service) and the U.S. Geological Survey (USGS) in 1996. The stated goal of the AML Initiative was to develop a strategy for gathering and communicating the scientific information needed to develop effective and cost-efficient remediation of abandoned mines within the framework of a watershed. Four primary objectives of the AMLI are to: 1. Provide the scientific information needed (in the short-term) by the FLMAs to make decisions related to the design and implementation of cleanup actions, 2. Develop a multi-disciplined, multi-division approach that integrates geologic, hydrologic, geochemical and ecological information into a knowledge base for sound decision making, 3. Transfer technologies developed within the scientific programs of the USGS to the field and demonstrate their suitability to solve real, practical problems, and 4. Establish working relationships among involved members of land management and regulatory agencies within the framework of a watershed approach to the cleanup of abandoned mines. Long-term process-based research, including development of analytical tools, is recognized as being critical to the long-term success in remediating watersheds impacted by historical mining activities (AML 5-year plan, http://amli.usgs.gov/amli). In a meeting of Federal agencies (U.S. Bureau of Land Management [BLM], U.S. Bureau of Reclamation [BOR], U.S. National Park Service [NPS], U.S. Forest Service [USFS], the U.S. Environmental Protection Agency [EPA], the U.S. Fish and Wildlife Service [F&WS]), and State agencies (Colorado Division of Public Health and Environment, Colorado Division of Mines and Geology), several watersheds were examined within the state whose water quality was presumed to be impacted by historical mining activities. The Animas River watershed (fig. 1) was selected by the State and Federal agencies as one of two watersheds in the U.S. to be studied in detail by the USGS in the AML Initiative. Beginning in October 1997, each of the four Divisions of the USGS (Water Resources, Geologic, Biological Resources, and National Mapping) initiated a collaborative integrated science study of the watershed. Funds were provided from USGS base funding to each of the four Divisions in response to the priorities set by Congressional action and within the flexibility provided by the budgetary framework funding individual research programs. The AML Initiative provides for a five-year focused scientific effort in the two watersheds with final synthesis of the scientific results from each to be published in 2001. Publications are released on the AML web site on a regular basis (http://amli.usgs.gov/amli). On March 29, 2000, the USGS hosted a meeting for the BLM and USFS to discuss remediation options that were under consideration for the summer of 2000. The purpose of this report is to provide an overview of the scientific rational provided by the USGS to meet objective one above, and to summarize our preliminary interpretations of our data. Additional information from sites on private lands have been collected by the State of Colorado, EPA, and the ARSG. Unfortunately, these data have not been fully supplied to the USGS so our conclusions are based only upon our data. These interpretations provide science-based constraints on possible remediation options to be considered by the FLMA, the State, and local property owners in the Animas River watershed. The report is presented in outline format to facilitate discussion of remediation options at the March 29, 2000 meeting. Not all historical mining sites within the watershed are on public lands. This should not be construed to be a final report of the USGS
[Soil seed bank research of China mining areas: necessity and challenges].
Chang, Qing; Zhang, Da-Wei; Li, Xue; Peng, Jian; Guan, Ai-Nong; Liu, Xiao-Si
2011-05-01
Soil seed bank consists of all living seeds existed in soil and its surface litter, especially in topsoil, and can reflect the characteristics of regional biodiversity. As the base of vegetation restoration and potential greening material, topsoil and its seed bank are the limited and non-renewable resources in mining areas. The study of soil seed bank has become one of the hotspots in the research field of vegetation restoration and land reclamation in China mining areas. Owing to the special characteristics of mining industry, the soil seed bank study of mining areas should not only concern with the seed species, quantities, and their relations with ground surface vegetation, but also make use of the research results on the soil seed bank of other fragile habitats. Besides, a breakthrough should be sought in the thinking ways and research approach. This paper analyzed the particularity of mining area's soil seek bank research, summarized the research progress in the soil seed bank of mining areas and other fragile habitats, and put forward the challenges we are facing with. It was expected that this paper could help to reinforce the soil seed bank research of China mining areas, and provide scientific guidelines for taking great advantage of the significant roles of soil seed bank in land reclamation and vegetation restoration in the future.
Realistic computer network simulation for network intrusion detection dataset generation
NASA Astrophysics Data System (ADS)
Payer, Garrett
2015-05-01
The KDD-99 Cup dataset is dead. While it can continue to be used as a toy example, the age of this dataset makes it all but useless for intrusion detection research and data mining. Many of the attacks used within the dataset are obsolete and do not reflect the features important for intrusion detection in today's networks. Creating a new dataset encompassing a large cross section of the attacks found on the Internet today could be useful, but would eventually fall to the same problem as the KDD-99 Cup; its usefulness would diminish after a period of time. To continue research into intrusion detection, the generation of new datasets needs to be as dynamic and as quick as the attacker. Simply examining existing network traffic and using domain experts such as intrusion analysts to label traffic is inefficient, expensive, and not scalable. The only viable methodology is simulation using technologies including virtualization, attack-toolsets such as Metasploit and Armitage, and sophisticated emulation of threat and user behavior. Simulating actual user behavior and network intrusion events dynamically not only allows researchers to vary scenarios quickly, but enables online testing of intrusion detection mechanisms by interacting with data as it is generated. As new threat behaviors are identified, they can be added to the simulation to make quicker determinations as to the effectiveness of existing and ongoing network intrusion technology, methodology and models.
Biological responses to disturbance from simulated deep-sea polymetallic nodule mining.
Jones, Daniel O B; Kaiser, Stefanie; Sweetman, Andrew K; Smith, Craig R; Menot, Lenaick; Vink, Annemiek; Trueblood, Dwight; Greinert, Jens; Billett, David S M; Arbizu, Pedro Martinez; Radziejewska, Teresa; Singh, Ravail; Ingole, Baban; Stratmann, Tanja; Simon-Lledó, Erik; Durden, Jennifer M; Clark, Malcolm R
2017-01-01
Commercial-scale mining for polymetallic nodules could have a major impact on the deep-sea environment, but the effects of these mining activities on deep-sea ecosystems are very poorly known. The first commercial test mining for polymetallic nodules was carried out in 1970. Since then a number of small-scale commercial test mining or scientific disturbance studies have been carried out. Here we evaluate changes in faunal densities and diversity of benthic communities measured in response to these 11 simulated or test nodule mining disturbances using meta-analysis techniques. We find that impacts are often severe immediately after mining, with major negative changes in density and diversity of most groups occurring. However, in some cases, the mobile fauna and small-sized fauna experienced less negative impacts over the longer term. At seven sites in the Pacific, multiple surveys assessed recovery in fauna over periods of up to 26 years. Almost all studies show some recovery in faunal density and diversity for meiofauna and mobile megafauna, often within one year. However, very few faunal groups return to baseline or control conditions after two decades. The effects of polymetallic nodule mining are likely to be long term. Our analyses show considerable negative biological effects of seafloor nodule mining, even at the small scale of test mining experiments, although there is variation in sensitivity amongst organisms of different sizes and functional groups, which have important implications for ecosystem responses. Unfortunately, many past studies have limitations that reduce their effectiveness in determining responses. We provide recommendations to improve future mining impact test studies. Further research to assess the effects of test-mining activities will inform ways to improve mining practices and guide effective environmental management of mining activities.
Biological responses to disturbance from simulated deep-sea polymetallic nodule mining
Kaiser, Stefanie; Sweetman, Andrew K.; Smith, Craig R.; Menot, Lenaick; Vink, Annemiek; Trueblood, Dwight; Greinert, Jens; Billett, David S. M.; Arbizu, Pedro Martinez; Radziejewska, Teresa; Singh, Ravail; Ingole, Baban; Stratmann, Tanja; Simon-Lledó, Erik; Durden, Jennifer M.; Clark, Malcolm R.
2017-01-01
Commercial-scale mining for polymetallic nodules could have a major impact on the deep-sea environment, but the effects of these mining activities on deep-sea ecosystems are very poorly known. The first commercial test mining for polymetallic nodules was carried out in 1970. Since then a number of small-scale commercial test mining or scientific disturbance studies have been carried out. Here we evaluate changes in faunal densities and diversity of benthic communities measured in response to these 11 simulated or test nodule mining disturbances using meta-analysis techniques. We find that impacts are often severe immediately after mining, with major negative changes in density and diversity of most groups occurring. However, in some cases, the mobile fauna and small-sized fauna experienced less negative impacts over the longer term. At seven sites in the Pacific, multiple surveys assessed recovery in fauna over periods of up to 26 years. Almost all studies show some recovery in faunal density and diversity for meiofauna and mobile megafauna, often within one year. However, very few faunal groups return to baseline or control conditions after two decades. The effects of polymetallic nodule mining are likely to be long term. Our analyses show considerable negative biological effects of seafloor nodule mining, even at the small scale of test mining experiments, although there is variation in sensitivity amongst organisms of different sizes and functional groups, which have important implications for ecosystem responses. Unfortunately, many past studies have limitations that reduce their effectiveness in determining responses. We provide recommendations to improve future mining impact test studies. Further research to assess the effects of test-mining activities will inform ways to improve mining practices and guide effective environmental management of mining activities. PMID:28178346
Association Rule Based Feature Extraction for Character Recognition
NASA Astrophysics Data System (ADS)
Dua, Sumeet; Singh, Harpreet
Association rules that represent isomorphisms among data have gained importance in exploratory data analysis because they can find inherent, implicit, and interesting relationships among data. They are also commonly used in data mining to extract the conditions among attribute values that occur together frequently in a dataset [1]. These rules have wide range of applications, namely in the financial and retail sectors of marketing, sales, and medicine.
The data set contains the raw data used to develop the figures and tables associated with the published manuscriptThis dataset is associated with the following publication:Eckley , C., T. Luxton , J. McKernan , J. Goetz , and J. Goulet. Influence of Reservoir Water-Level Fluctuations on Mercury Methylation Downstream of the Historic Black Butte Mercury Mine, OR. Michael Kersten APPLIED GEOCHEMISTRY. Elsevier Science Ltd, New York, NY, USA, 61: 284-293, (2015).
NASA Astrophysics Data System (ADS)
Klump, J. F.; Ulbricht, D.; Conze, R.
2014-12-01
The Continental Deep Drilling Programme (KTB) was a scientific drilling project from 1987 to 1995 near Windischeschenbach, Bavaria. The main super-deep borehole reached a depth of 9,101 meters into the Earth's continental crust. The project used the most current equipment for data capture and processing. After the end of the project key data were disseminated through the web portal of the International Continental Scientific Drilling Program (ICDP). The scientific reports were published as printed volumes. As similar projects have also experienced, it becomes increasingly difficult to maintain a data portal over a long time. Changes in software and underlying hardware make a migration of the entire system inevitable. Around 2009 the data presented on the ICDP web portal were migrated to the Scientific Drilling Database (SDDB) and published through DataCite using Digital Object Identifiers (DOI) as persistent identifiers. The SDDB portal used a relational database with a complex data model to store data and metadata. A PHP-based Content Management System with custom modifications made it possible to navigate and browse datasets using the metadata and then download datasets. The data repository software eSciDoc allows storing self-contained packages consistent with the OAIS reference model. Each package consists of binary data files and XML-metadata. Using a REST-API the packages can be stored in the eSciDoc repository and can be searched using the XML-metadata. During the last maintenance cycle of the SDDB the data and metadata were migrated into the eSciDoc repository. Discovery metadata was generated following the GCMD-DIF, ISO19115 and DataCite schemas. The eSciDoc repository allows to store an arbitrary number of XML-metadata records with each data object. In addition to descriptive metadata each data object may contain pointers to related materials, such as IGSN-metadata to link datasets to physical specimens, or identifiers of literature interpreting the data. Datasets are presented by XSLT-stylesheet transformation using the stored metadata. The presentation shows several migration cycles of data and metadata, which were driven by aging software systems. Currently the datasets reside as self-contained entities in a repository system that is ready for digital preservation.
EPA RE-Powering Mapper Completed Installations
The U.S. Environmental Protection Agency (EPA) Office of Land and Emergency Management (OLEM) Office of Communications, Partnerships and Analysis (OCPA) initiated the RE-Powering America's Land Initiative to demonstrate the enormous potential that contaminated lands, landfills, and mine sites provide for developing renewable energy in the United States. Using publically available information, RE-Powering maintains a list of completed renewable energy installations on contaminated sites and landfills. To date, the RE-Powering Initiative has identified 179 renewable energy installations on 171 contaminated lands, landfills, and mine sites, with a cumulative installed capacity of just over 1,124 megawatts (MW) and consistent growth in total installations since the inception of the RE-Powering Initiative. This dataset is current as of April 2016.
The tragedy of the biodiversity data commons: a data impediment creeping nigher?
Galicia, David; Ariño, Arturo H
2018-01-01
Abstract Researchers are embracing the open access movement to facilitate unrestricted availability of scientific results. One sign of this willingness is the steady increase in data freely shared online, which has prompted a corresponding increase in the number of papers using such data. Publishing datasets is a time-consuming process that is often seen as a courtesy, rather than a necessary step in the research process. Making data accessible allows further research, provides basic information for decision-making and contributes to transparency in science. Nevertheless, the ease of access to heaps of data carries a perception of ‘free lunch for all’, and the work of data publishers is largely going unnoticed. Acknowledging such a significant effort involving the creation, management and publication of a dataset remains a flimsy, not well established practice in the scientific community. In a meta-analysis of published literature, we have observed various dataset citation practices, but mostly (92%) consisting of merely citing the data repository rather than the data publisher. Failing to recognize the work of data publishers might lead to a decrease in the number of quality datasets shared online, compromising potential research that is dependent on the availability of such data. We make an urgent appeal to raise awareness about this issue. PMID:29688384
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hirdt, J.A.; Brown, D.A., E-mail: dbrown@bnl.gov
The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of socialmore » networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.« less
Resources for Functional Genomics Studies in Drosophila melanogaster
Mohr, Stephanie E.; Hu, Yanhui; Kim, Kevin; Housden, Benjamin E.; Perrimon, Norbert
2014-01-01
Drosophila melanogaster has become a system of choice for functional genomic studies. Many resources, including online databases and software tools, are now available to support design or identification of relevant fly stocks and reagents or analysis and mining of existing functional genomic, transcriptomic, proteomic, etc. datasets. These include large community collections of fly stocks and plasmid clones, “meta” information sites like FlyBase and FlyMine, and an increasing number of more specialized reagents, databases, and online tools. Here, we introduce key resources useful to plan large-scale functional genomics studies in Drosophila and to analyze, integrate, and mine the results of those studies in ways that facilitate identification of highest-confidence results and generation of new hypotheses. We also discuss ways in which existing resources can be used and might be improved and suggest a few areas of future development that would further support large- and small-scale studies in Drosophila and facilitate use of Drosophila information by the research community more generally. PMID:24653003
A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences.
Xue, Yun; Liao, Zhengling; Li, Meihang; Luo, Jie; Kuang, Qiuhua; Hu, Xiaohui; Li, Tiechen
2015-01-01
Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.
Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics
Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming
2012-01-01
Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research. PMID:22371823
Applying Data Mining Techniques to Improve Breast Cancer Diagnosis.
Diz, Joana; Marreiros, Goreti; Freitas, Alberto
2016-09-01
In the field of breast cancer research, and more than ever, new computer aided diagnosis based systems have been developed aiming to reduce diagnostic tests false-positives. Within this work, we present a data mining based approach which might support oncologists in the process of breast cancer classification and diagnosis. The present study aims to compare two breast cancer datasets and find the best methods in predicting benign/malignant lesions, breast density classification, and even for finding identification (mass / microcalcification distinction). To carry out these tasks, two matrices of texture features extraction were implemented using Matlab, and classified using data mining algorithms, on WEKA. Results revealed good percentages of accuracy for each class: 89.3 to 64.7 % - benign/malignant; 75.8 to 78.3 % - dense/fatty tissue; 71.0 to 83.1 % - finding identification. Among the different tests classifiers, Naive Bayes was the best to identify masses texture, and Random Forests was the first or second best classifier for the majority of tested groups.
NASA Astrophysics Data System (ADS)
Hirdt, J. A.; Brown, D. A.
2016-01-01
The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.
Non-invasive flow path characterization in a mining-impacted wetland
Bethune, James; Randell, Jackie; Runkel, Robert L.; Singha, Kamini
2015-01-01
Time-lapse electrical resistivity (ER) was used to capture the dilution of a seasonal pulse of acid mine drainage (AMD) contamination in the subsurface of a wetland downgradient of the abandoned Pennsylvania mine workings in central Colorado. Data were collected monthly from mid-July to late October of 2013, with an additional dataset collected in June of 2014. Inversion of the ER data shows the development through time of multiple resistive anomalies in the subsurface, which corroborating data suggest are driven by changes in total dissolved solids (TDS) localized in preferential flow pathways. Sensitivity analyses on a synthetic model of the site suggest that the anomalies would need to be at least several meters in diameter to be adequately resolved by the inversions. The existence of preferential flow paths would have a critical impact on the extent of attenuation mechanisms at the site, and their further characterization could be used to parameterize reactive transport models in developing quantitative predictions of remediation strategies.
Ghaibeh, A Ammar; Kasem, Asem; Ng, Xun Jin; Nair, Hema Latha Krishna; Hirose, Jun; Thiruchelvam, Vinesh
2018-01-01
The analysis of Electronic Health Records (EHRs) is attracting a lot of research attention in the medical informatics domain. Hospitals and medical institutes started to use data mining techniques to gain new insights from the massive amounts of data that can be made available through EHRs. Researchers in the medical field have often used descriptive statistics and classical statistical methods to prove assumed medical hypotheses. However, discovering new insights from large amounts of data solely based on experts' observations is difficult. Using data mining techniques and visualizations, practitioners can find hidden knowledge, identify interesting patterns, or formulate new hypotheses to be further investigated. This paper describes a work in progress on using data mining methods to analyze clinical data of Nasopharyngeal Carcinoma (NPC) cancer patients. NPC is the fifth most common cancer among Malaysians, and the data analyzed in this study was collected from three states in Malaysia (Kuala Lumpur, Sabah and Sarawak), and is considered to be the largest up-to-date dataset of its kind. This research is addressing the issue of cancer recurrence after the completion of radiotherapy and chemotherapy treatment. We describe the procedure, problems, and insights gained during the process.
CrossCheck: an open-source web tool for high-throughput screen data analysis.
Najafov, Jamil; Najafov, Ayaz
2017-07-19
Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.
Selection-Fusion Approach for Classification of Datasets with Missing Values
Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming
2010-01-01
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921
NASA Astrophysics Data System (ADS)
Patton, E. W.; West, P.; Greer, R.; Jin, B.
2011-12-01
Following on work presented at the 2010 AGU Fall Meeting, we present a number of real-world collections of semantically-enabled scientific metadata ingested into the Tetherless World RDF2HTML system as structured data and presented and edited using that system. Two separate datasets from two different domains (oceanography and solar sciences) are made available using existing web standards and services, e.g. encoded using ontologies represented with the Web Ontology Language (OWL) and stored in a SPARQL endpoint for querying. These datasets are deployed for use in three different web environments, i.e. Drupal, MediaWiki, and a custom web portal written in Java, to highlight the cross-platform nature of the data presentation. Stylesheets used to transform concepts in each domain as well as shared terms into HTML will be presented to show the power of using common ontologies to publish data and support reuse of existing terminologies. In addition, a single domain dataset is shared between two separate portal instances to demonstrate the ability for this system to offer distributed access and modification of content across the Internet. Lastly, we will highlight challenges that arose in the software engineering process, outline the design choices we made in solving those issues, and discuss how future improvements to this and other systems will enable the evolution of distributed, decentralized collaborations for scientific data sharing across multiple research groups.
ERIC Educational Resources Information Center
Sands, Ashley Elizabeth
2017-01-01
Ground-based astronomy sky surveys are massive, decades-long investments in scientific data collection. Stakeholders expect these datasets to retain scientific value well beyond the lifetime of the sky survey. However, the necessary investments in knowledge infrastructures for managing sky survey data are not yet in place to ensure the long-term…
Modern data science for analytical chemical data - A comprehensive review.
Szymańska, Ewa
2018-10-22
Efficient and reliable analysis of chemical analytical data is a great challenge due to the increase in data size, variety and velocity. New methodologies, approaches and methods are being proposed not only by chemometrics but also by other data scientific communities to extract relevant information from big datasets and provide their value to different applications. Besides common goal of big data analysis, different perspectives and terms on big data are being discussed in scientific literature and public media. The aim of this comprehensive review is to present common trends in the analysis of chemical analytical data across different data scientific fields together with their data type-specific and generic challenges. Firstly, common data science terms used in different data scientific fields are summarized and discussed. Secondly, systematic methodologies to plan and run big data analysis projects are presented together with their steps. Moreover, different analysis aspects like assessing data quality, selecting data pre-processing strategies, data visualization and model validation are considered in more detail. Finally, an overview of standard and new data analysis methods is provided and their suitability for big analytical chemical datasets shortly discussed. Copyright © 2018 Elsevier B.V. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ruebel, Oliver
2009-11-20
Knowledge discovery from large and complex collections of today's scientific datasets is a challenging task. With the ability to measure and simulate more processes at increasingly finer spatial and temporal scales, the increasing number of data dimensions and data objects is presenting tremendous challenges for data analysis and effective data exploration methods and tools. Researchers are overwhelmed with data and standard tools are often insufficient to enable effective data analysis and knowledge discovery. The main objective of this thesis is to provide important new capabilities to accelerate scientific knowledge discovery form large, complex, and multivariate scientific data. The research coveredmore » in this thesis addresses these scientific challenges using a combination of scientific visualization, information visualization, automated data analysis, and other enabling technologies, such as efficient data management. The effectiveness of the proposed analysis methods is demonstrated via applications in two distinct scientific research fields, namely developmental biology and high-energy physics.Advances in microscopy, image analysis, and embryo registration enable for the first time measurement of gene expression at cellular resolution for entire organisms. Analysis of high-dimensional spatial gene expression datasets is a challenging task. By integrating data clustering and visualization, analysis of complex, time-varying, spatial gene expression patterns and their formation becomes possible. The analysis framework MATLAB and the visualization have been integrated, making advanced analysis tools accessible to biologist and enabling bioinformatic researchers to directly integrate their analysis with the visualization. Laser wakefield particle accelerators (LWFAs) promise to be a new compact source of high-energy particles and radiation, with wide applications ranging from medicine to physics. To gain insight into the complex physical processes of particle acceleration, physicists model LWFAs computationally. The datasets produced by LWFA simulations are (i) extremely large, (ii) of varying spatial and temporal resolution, (iii) heterogeneous, and (iv) high-dimensional, making analysis and knowledge discovery from complex LWFA simulation data a challenging task. To address these challenges this thesis describes the integration of the visualization system VisIt and the state-of-the-art index/query system FastBit, enabling interactive visual exploration of extremely large three-dimensional particle datasets. Researchers are especially interested in beams of high-energy particles formed during the course of a simulation. This thesis describes novel methods for automatic detection and analysis of particle beams enabling a more accurate and efficient data analysis process. By integrating these automated analysis methods with visualization, this research enables more accurate, efficient, and effective analysis of LWFA simulation data than previously possible.« less
Management and assimilation of diverse, distributed watershed datasets
NASA Astrophysics Data System (ADS)
Varadharajan, C.; Faybishenko, B.; Versteeg, R.; Agarwal, D.; Hubbard, S. S.; Hendrix, V.
2016-12-01
The U.S. Department of Energy's (DOE) Watershed Function Scientific Focus Area (SFA) seeks to determine how perturbations to mountainous watersheds (e.g., floods, drought, early snowmelt) impact the downstream delivery of water, nutrients, carbon, and metals over seasonal to decadal timescales. We are building a software platform that enables integration of diverse and disparate field, laboratory, and simulation datasets, of various types including hydrological, geological, meteorological, geophysical, geochemical, ecological and genomic datasets across a range of spatial and temporal scales within the Rifle floodplain and the East River watershed, Colorado. We are using agile data management and assimilation approaches, to enable web-based integration of heterogeneous, multi-scale dataSensor-based observations of water-level, vadose zone and groundwater temperature, water quality, meteorology as well as biogeochemical analyses of soil and groundwater samples have been curated and archived in federated databases. Quality Assurance and Quality Control (QA/QC) are performed on priority datasets needed for on-going scientific analyses, and hydrological and geochemical modeling. Automated QA/QC methods are used to identify and flag issues in the datasets. Data integration is achieved via a brokering service that dynamically integrates data from distributed databases via web services, based on user queries. The integrated results are presented to users in a portal that enables intuitive search, interactive visualization and download of integrated datasets. The concepts, approaches and codes being used are shared across various data science components of various large DOE-funded projects such as the Watershed Function SFA, Next Generation Ecosystem Experiment (NGEE) Tropics, Ameriflux/FLUXNET, and Advanced Simulation Capability for Environmental Management (ASCEM), and together contribute towards DOE's cyberinfrastructure for data management and model-data integration.
The NAS Computational Aerosciences Archive
NASA Technical Reports Server (NTRS)
Miceli, Kristina D.; Globus, Al; Lasinski, T. A. (Technical Monitor)
1995-01-01
In order to further the state-of-the-art in computational aerosciences (CAS) technology, researchers must be able to gather and understand existing work in the field. One aspect of this information gathering is studying published work available in scientific journals and conference proceedings. However, current scientific publications are very limited in the type and amount of information that they can disseminate. Information is typically restricted to text, a few images, and a bibliography list. Additional information that might be useful to the researcher, such as additional visual results, referenced papers, and datasets, are not available. New forms of electronic publication, such as the World Wide Web (WWW), limit publication size only by available disk space and data transmission bandwidth, both of which are improving rapidly. The Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center is in the process of creating an archive of CAS information on the WWW. This archive will be based on the large amount of information produced by researchers associated with the NAS facility. The archive will contain technical summaries and reports of research performed on NAS supercomputers, visual results (images, animations, visualization system scripts), datasets, and any other supporting meta-information. This information will be available via the WWW through the NAS homepage, located at http://www.nas.nasa.gov/, fully indexed for searching. The main components of the archive are technical summaries and reports, visual results, and datasets. Technical summaries are gathered every year by researchers who have been allotted resources on NAS supercomputers. These summaries, together with supporting visual results and references, are browsable by interested researchers. Referenced papers made available by researchers can be accessed through hypertext links. Technical reports are in-depth accounts of tools and applications research projects performed by NAS staff members and collaborators. Visual results, which may be available in the form of images, animations, and/or visualization scripts, are generated by researchers with respect to a certain research project, depicting dataset features that were determined important by the investigating researcher. For example, script files for visualization systems (e.g. FAST, PLOT3D, AVS) are provided to create visualizations on the user's local workstation to elucidate the key points of the numerical study. Users can then interact with the data starting where the investigator left off. Datasets are intended to give researchers an opportunity to understand previous work, 'mine' solutions for new information (for example, have you ever read a paper thinking "I wonder what the helicity density looks like?"), compare new techniques with older results, collaborate with remote colleagues, and perform validation. Supporting meta-information associated with the research projects is also important to provide additional context for research projects. This may include information such as the software used in the simulation (e.g. grid generators, flow solvers, visualization). In addition to serving the CAS research community, the information archive will also be helpful to students, visualization system developers and researchers, and management. Students (of any age) can use the data to study fluid dynamics, compare results from different flow solvers, learn about meshing techniques, etc., leading to better informed individuals. For these users it is particularly important that visualization be integrated into dataset archives. Visualization researchers can use dataset archives to test algorithms and techniques, leading to better visualization systems, Management can use the data to figure what is really going on behind the viewgraphs. All users will benefit from fast, easy, and convenient access to CFD datasets. The CAS information archive hopes to serve as a useful resource to those interested in computational sciences. At present, only information that may be distributed internationally is made available via the archive. Studies are underway to determine security requirements and solutions to make additional information available. By providing access to the archive via the WWW, the process of information gathering can be more productive and fruitful due to ease of access and ability to manage many different types of information. As the archive grows, additional resources from outside NAS will be added, providing a dynamic source of research results.
Seismic safety in conducting large-scale blasts
NASA Astrophysics Data System (ADS)
Mashukov, I. V.; Chaplygin, V. V.; Domanov, V. P.; Semin, A. A.; Klimkin, M. A.
2017-09-01
In mining enterprises to prepare hard rocks for excavation a drilling and blasting method is used. With the approach of mining operations to settlements the negative effect of large-scale blasts increases. To assess the level of seismic impact of large-scale blasts the scientific staff of Siberian State Industrial University carried out expertise for coal mines and iron ore enterprises. Determination of the magnitude of surface seismic vibrations caused by mass explosions was performed using seismic receivers, an analog-digital converter with recording on a laptop. The registration results of surface seismic vibrations during production of more than 280 large-scale blasts at 17 mining enterprises in 22 settlements are presented. The maximum velocity values of the Earth’s surface vibrations are determined. The safety evaluation of seismic effect was carried out according to the permissible value of vibration velocity. For cases with exceedance of permissible values recommendations were developed to reduce the level of seismic impact.
The health care and life sciences community profile for dataset descriptions
Alexiev, Vladimir; Ansell, Peter; Bader, Gary; Baran, Joachim; Bolleman, Jerven T.; Callahan, Alison; Cruz-Toledo, José; Gaudet, Pascale; Gombocz, Erich A.; Gonzalez-Beltran, Alejandra N.; Groth, Paul; Haendel, Melissa; Ito, Maori; Jupp, Simon; Juty, Nick; Katayama, Toshiaki; Kobayashi, Norio; Krishnaswami, Kalpana; Laibe, Camille; Le Novère, Nicolas; Lin, Simon; Malone, James; Miller, Michael; Mungall, Christopher J.; Rietveld, Laurens; Wimalaratne, Sarala M.; Yamaguchi, Atsuko
2016-01-01
Access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data. However, while there are many relevant vocabularies for the annotation of a dataset, none sufficiently captures all the necessary metadata. This prevents uniform indexing and querying of dataset repositories. Towards providing a practical guide for producing a high quality description of biomedical datasets, the W3C Semantic Web for Health Care and the Life Sciences Interest Group (HCLSIG) identified Resource Description Framework (RDF) vocabularies that could be used to specify common metadata elements and their value sets. The resulting guideline covers elements of description, identification, attribution, versioning, provenance, and content summarization. This guideline reuses existing vocabularies, and is intended to meet key functional requirements including indexing, discovery, exchange, query, and retrieval of datasets, thereby enabling the publication of FAIR data. The resulting metadata profile is generic and could be used by other domains with an interest in providing machine readable descriptions of versioned datasets. PMID:27602295
Use of Electronic Health-Related Datasets in Nursing and Health-Related Research.
Al-Rawajfah, Omar M; Aloush, Sami; Hewitt, Jeanne Beauchamp
2015-07-01
Datasets of gigabyte size are common in medical sciences. There is increasing consensus that significant untapped knowledge lies hidden in these large datasets. This review article aims to discuss Electronic Health-Related Datasets (EHRDs) in terms of types, features, advantages, limitations, and possible use in nursing and health-related research. Major scientific databases, MEDLINE, ScienceDirect, and Scopus, were searched for studies or review articles regarding using EHRDs in research. A total number of 442 articles were located. After application of study inclusion criteria, 113 articles were included in the final review. EHRDs were categorized into Electronic Administrative Health-Related Datasets and Electronic Clinical Health-Related Datasets. Subcategories of each major category were identified. EHRDs are invaluable assets for nursing the health-related research. Advanced research skills such as using analytical softwares, advanced statistical procedures, dealing with missing data and missing variables will maximize the efficient utilization of EHRDs in research. © The Author(s) 2014.
CheS-Mapper - Chemical Space Mapping and Visualization in 3D.
Gütlein, Martin; Karwath, Andreas; Kramer, Stefan
2012-03-17
Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis.
CheS-Mapper - Chemical Space Mapping and Visualization in 3D
2012-01-01
Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis. PMID:22424447
Data always getting bigger -- A scalable DOI architecture for big and expanding scientific data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prakash, Giri; Shrestha, Biva; Younkin, Katarina
The Atmospheric Radiation Measurement (ARM) Data Archive established a data citation strategy based on Digital Object Identifiers (DOIs) for the ARM datasets in order to facilitate citing continuous and diverse ARM datasets in articles and other papers. This strategy eases the tracking of data provided as supplements to articles and papers. Additionally, it allows future data users and the ARM Climate Research Facility to easily locate the exact data used in various articles. Traditionally, DOIs are assigned to individual digital objects (a report or a data table), but for ARM datasets, these DOIs are assigned to an ARM data product.more » This eliminates the need for creating DOIs for numerous components of the ARM data product, in turn making it easier for users to manage and cite the ARM data with fewer DOIs. In addition, the ARM data infrastructure team, with input from scientific users, developed a citation format and an online data citation generation tool for continuous data streams. As a result, this citation format includes DOIs along with additional details such as spatial and temporal information.« less
Data always getting bigger -- A scalable DOI architecture for big and expanding scientific data
Prakash, Giri; Shrestha, Biva; Younkin, Katarina; ...
2016-08-31
The Atmospheric Radiation Measurement (ARM) Data Archive established a data citation strategy based on Digital Object Identifiers (DOIs) for the ARM datasets in order to facilitate citing continuous and diverse ARM datasets in articles and other papers. This strategy eases the tracking of data provided as supplements to articles and papers. Additionally, it allows future data users and the ARM Climate Research Facility to easily locate the exact data used in various articles. Traditionally, DOIs are assigned to individual digital objects (a report or a data table), but for ARM datasets, these DOIs are assigned to an ARM data product.more » This eliminates the need for creating DOIs for numerous components of the ARM data product, in turn making it easier for users to manage and cite the ARM data with fewer DOIs. In addition, the ARM data infrastructure team, with input from scientific users, developed a citation format and an online data citation generation tool for continuous data streams. As a result, this citation format includes DOIs along with additional details such as spatial and temporal information.« less
Artificial intelligence support for scientific model-building
NASA Technical Reports Server (NTRS)
Keller, Richard M.
1992-01-01
Scientific model-building can be a time-intensive and painstaking process, often involving the development of large and complex computer programs. Despite the effort involved, scientific models cannot easily be distributed and shared with other scientists. In general, implemented scientific models are complex, idiosyncratic, and difficult for anyone but the original scientific development team to understand. We believe that artificial intelligence techniques can facilitate both the model-building and model-sharing process. In this paper, we overview our effort to build a scientific modeling software tool that aids the scientist in developing and using models. This tool includes an interactive intelligent graphical interface, a high-level domain specific modeling language, a library of physics equations and experimental datasets, and a suite of data display facilities.
Haas, Emily J.; Cecala, Andrew B.; Hoebbel, Cassandra L.
2016-01-01
Research continues to investigate barriers to managing occupational health and safety behaviors among the workforce. Recent literature argues that (1) there is a lack of consistent, multilevel communication and application of health and safety practices, and (2) social scientific methods are absent when determining how to manage injury prevention in the workplace. In response, the current study developed and tested a multilevel intervention case study at two industrial mineral mines to help managers and workers communicate about and reduce respirable silica dust exposures at their mine sites. A dust assessment technology, the Helmet-CAM, was used to identify and encourage communication about potential problem areas and tasks on site that contributed to elevated exposures. The intervention involved pre- and post-assessment field visits, four weeks apart that included multiple forms of data collection from workers and managers. Results revealed that mine management can utilize dust assessment technology as a risk communication tool to prompt and communicate about healthier behaviors with their workforce. Additionally, when workers were debriefed with the Helmet-CAM data through the device software, the dust exposure data can help improve the knowledge and awareness of workers, empowering them to change subtle behaviors that could reduce future elevated exposures to respirable silica dust. This case study demonstrates that incorporating social scientific methods into the application of health and safety management strategies, such as behavioral modification and technology integration, can leverage managers’ communication practices with workers, subsequently improving health and safety behaviors. PMID:26807445
Intensification and Structure Change of Super Typhoon Flo as Related to the Large-Scale Environment.
1998-06-01
large dataset is a challenge. Schiavone and Papathomas (1990) summarize methods currently available for visualizing scientific 116 datasets. These...Prediction and Dynamic Meteorology, Second Edition. John Wiley and Sons, 477 pp. Hardy, R. L., 1971: Multiquadric equations of topography and other...Inter. Corp., Monterey CA, 40 pp. Sawyer, J. S., 1947: Notes on the theory of tropical cyclones. Quart. J. Roy. Meteor. Soc, 73, 101-126. Schiavone
Changing knowledge perspective in a changing world: The Adriatic multidisciplinary TDS approach
NASA Astrophysics Data System (ADS)
Bergamasco, Andrea; Carniel, Sandro; Nativi, Stefano; Signell, Richard P.; Benetazzo, Alvise; Falcieri, Francesco M.; Bonaldo, Davide; Minuzzo, Tiziano; Sclavo, Mauro
2013-04-01
The use and exploitation of the marine environment in recent years has been increasingly high, therefore calling for the need of a better description, monitoring and understanding of its behavior. However, marine scientists and managers often spend too much time in accessing and reformatting data instead of focusing on discovering new knowledge from the processes observed and data acquired. There is therefore the need to make more efficient our approach to data mining, especially in a world where rapid climate change imposes rapid and quick choices. In this context, it is mandatory to explore ways and possibilities to make large amounts of distributed data usable in an efficient and easy way, an effort that requires standardized data protocols, web services and standards-based tools. Following the US-IOOS approach, which has been adopted in many oceanographic and meteorological sectors, we present a CNR experience in the direction of setting up a national Italian IOOS framework (at the moment confined at the Adriatic Sea environment), using the THREDDS (THematic Real-time Environmental Distributed Data Services) Data Server (TDS). A TDS is a middleware designed to fill the gap between data providers and data users, and provides services allowing data users to find the data sets pertaining to their scientific needs, to access, visualize and use them in an easy way, without the need of downloading files to the local workspace. In order to achieve this results, it is necessary that the data providers make their data available in a standard form that the TDS understands, and with sufficient metadata so that the data can be read and searched for in a standard way. The TDS core is a NetCDF- Java Library implementing a Common Data Model (CDM), as developed by Unidata (http://www.unidata.ucar.edu), allowing the access to "array-based" scientific data. Climate and Forecast (CF) compliant NetCDF files can be read directly with no modification, while non-compliant files can be modified to meet appropriate metadata requirements. Once standardized in the CDM, the TDS makes datasets available through a series of web services such as OPeNDAP or Open Geospatial Consortium Web Coverage Service (WCS), allowing the data users to easily obtain small subsets from large datasets, and to quickly visualize their content by using tools such as GODIVA2 or Integrated Data Viewer (IDV). In addition, an ISO metadata service is available through the TDS that can be harvested by catalogue broker services (e.g. GI-cat) to enable distributed search across federated data servers. Example of TDS datasets from oceanographic evolutions (currents, waves, sediments...) will be described and discussed, while some examples can be accessed directly to the Venice site http://tds.ve.ismar.cnr.it:8080/thredds/catalog.html (Bergamasco et al., 2012) also within the framework of RITMARE Project. References Bergamasco A., Benetazzo A., Carniel S., Falcieri F., Minuzzo T., Signell R.P. and M. Sclavo, 2012. From interoperability to knowledge discovery using large model datasets in the marine environment: the THREDDS Data Server example. Advances in Oceanography and Limnology, 3(1), 41-50. DOI:10.1080/19475721.2012.669637
ERIC Educational Resources Information Center
Lewis, Jonathan S.
2017-01-01
Paid employment is one of the most common extracurricular activities among full-time undergraduates, and an array of studies has attempted to measure its impact. Methodological concerns with the extant literature, however, make it difficult to draw reliable conclusions. Furthermore, the research on working college students has little to say about…
Computational Omics Pre-Awardees | Office of Cancer Clinical Proteomics Research
The National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) is pleased to announce the pre-awardees of the Computational Omics solicitation. Working with NVIDIA Foundation's Compute the Cure initiative and Leidos Biomedical Research Inc., the NCI, through this solicitation, seeks to leverage computational efforts to provide tools for the mining and interpretation of large-scale publicly available ‘omics’ datasets.
The need for data standards in zoomorphology.
Vogt, Lars; Nickel, Michael; Jenner, Ronald A; Deans, Andrew R
2013-07-01
eScience is a new approach to research that focuses on data mining and exploration rather than data generation or simulation. This new approach is arguably a driving force for scientific progress and requires data to be openly available, easily accessible via the Internet, and compatible with each other. eScience relies on modern standards for the reporting and documentation of data and metadata. Here, we suggest necessary components (i.e., content, concept, nomenclature, format) of such standards in the context of zoomorphology. We document the need for using data repositories to prevent data loss and how publication practice is currently changing, with the emergence of dynamic publications and the publication of digital datasets. Subsequently, we demonstrate that in zoomorphology the scientific record is still limited to published literature and that zoomorphological data are usually not accessible through data repositories. The underlying problem is that zoomorphology lacks the standards for data and metadata. As a consequence, zoomorphology cannot participate in eScience. We argue that the standardization of morphological data requires i) a standardized framework for terminologies for anatomy and ii) a formalized method of description that allows computer-parsable morphological data to be communicable, compatible, and comparable. The role of controlled vocabularies (e.g., ontologies) for developing respective terminologies and methods of description is discussed, especially in the context of data annotation and semantic enhancement of publications. Finally, we introduce the International Consortium for Zoomorphology Standards, a working group that is open to everyone and whose aim is to stimulate and synthesize dialog about standards. It is the Consortium's ultimate goal to assist the zoomorphology community in developing modern data and metadata standards, including anatomy ontologies, thereby facilitating the participation of zoomorphology in eScience. Copyright © 2013 Wiley Periodicals, Inc.
DOE Office of Scientific and Technical Information (OSTI.GOV)
al-Saffar, Sinan; Joslyn, Cliff A.; Chappell, Alan R.
As semantic datasets grow to be very large and divergent, there is a need to identify and exploit their inherent semantic structure for discovery and optimization. Towards that end, we present here a novel methodology to identify the semantic structures inherent in an arbitrary semantic graph dataset. We first present the concept of an extant ontology as a statistical description of the semantic relations present amongst the typed entities modeled in the graph. This serves as a model of the underlying semantic structure to aid in discovery and visualization. We then describe a method of ontological scaling in which themore » ontology is employed as a hierarchical scaling filter to infer different resolution levels at which the graph structures are to be viewed or analyzed. We illustrate these methods on three large and publicly available semantic datasets containing more than one billion edges each. Keywords-Semantic Web; Visualization; Ontology; Multi-resolution Data Mining;« less
Extraction and LOD control of colored interval volumes
NASA Astrophysics Data System (ADS)
Miyamura, Hiroko N.; Takeshima, Yuriko; Fujishiro, Issei; Saito, Takafumi
2005-03-01
Interval volume serves as a generalized isosurface and represents a three-dimensional subvolume for which the associated scalar filed values lie within a user-specified closed interval. In general, it is not an easy task for novices to specify the scalar field interval corresponding to their ROIs. In order to extract interval volumes from which desirable geometric features can be mined effectively, we propose a suggestive technique which extracts interval volumes automatically based on the global examination of the field contrast structure. Also proposed here is a simplification scheme for decimating resultant triangle patches to realize efficient transmission and rendition of large-scale interval volumes. Color distributions as well as geometric features are taken into account to select best edges to be collapsed. In addition, when a user wants to selectively display and analyze the original dataset, the simplified dataset is restructured to the original quality. Several simulated and acquired datasets are used to demonstrate the effectiveness of the present methods.
Sambles, Christine M; Salmon, Deborah L; Florance, Hannah; Howard, Thomas P; Smirnoff, Nicholas; Nielsen, Lene R; McKinney, Lea V; Kjær, Erik D; Buggs, Richard J A; Studholme, David J; Grant, Murray
2017-12-19
European common ash, Fraxinus excelsior, is currently threatened by Ash dieback (ADB) caused by the fungus, Hymenoscyphus fraxineus. To detect and identify metabolites that may be products of pathways important in contributing to resistance against H. fraxineus, we performed untargeted metabolomic profiling on leaves from five high-susceptibility and five low-susceptibility F. excelsior individuals identified during Danish field trials. We describe in this study, two datasets. The first is untargeted LC-MS metabolomics raw data from ash leaves with high-susceptibility and low-susceptibility to ADB in positive and negative mode. These data allow the application of peak picking, alignment, gap-filling and retention-time correlation analyses to be performed in alternative ways. The second, a processed dataset containing abundances of aligned features across all samples enables further mining of the data. Here we illustrate the utility of this dataset which has previously been used to identify putative iridoid glycosides, well known anti-herbivory terpenoid derivatives, and show differential abundance in tolerant and susceptible ash samples.
Sambles, Christine M.; Salmon, Deborah L.; Florance, Hannah; Howard, Thomas P.; Smirnoff, Nicholas; Nielsen, Lene R.; McKinney, Lea V.; Kjær, Erik D.; Buggs, Richard J. A.; Studholme, David J.; Grant, Murray
2017-01-01
European common ash, Fraxinus excelsior, is currently threatened by Ash dieback (ADB) caused by the fungus, Hymenoscyphus fraxineus. To detect and identify metabolites that may be products of pathways important in contributing to resistance against H. fraxineus, we performed untargeted metabolomic profiling on leaves from five high-susceptibility and five low-susceptibility F. excelsior individuals identified during Danish field trials. We describe in this study, two datasets. The first is untargeted LC-MS metabolomics raw data from ash leaves with high-susceptibility and low-susceptibility to ADB in positive and negative mode. These data allow the application of peak picking, alignment, gap-filling and retention-time correlation analyses to be performed in alternative ways. The second, a processed dataset containing abundances of aligned features across all samples enables further mining of the data. Here we illustrate the utility of this dataset which has previously been used to identify putative iridoid glycosides, well known anti-herbivory terpenoid derivatives, and show differential abundance in tolerant and susceptible ash samples. PMID:29257137