Sample records for mining algorithms including

  1. Runtime support for parallelizing data mining algorithms

    NASA Astrophysics Data System (ADS)

    Jin, Ruoming; Agrawal, Gagan

    2002-03-01

    With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.

  2. Advances in algorithm fusion for automated sea mine detection and classification

    NASA Astrophysics Data System (ADS)

    Dobeck, Gerald J.; Cobb, J. Tory

    2002-11-01

    Along with other sensors, the Navy uses high-resolution sonar to detect and classify sea mines in mine-hunting operations. Scientists and engineers have devoted substantial effort to the development of automated detection and classification (D/C) algorithms for these high-resolution systems. Several factors spurred these efforts, including: (1) aids for operators to reduce work overload; (2) more optimal use of all available data; and (3) the introduction of unmanned minehunting systems. The environments where sea mines are typically laid (harbor areas, shipping lanes, and the littorals) give rise to many false alarms caused by natural, biologic, and manmade clutter. The objective of the automated D/C algorithms is to eliminate most of these false alarms while maintaining a very high probability of mine detection and classification (PdPc). In recent years, the benefits of fusing the outputs of multiple D/C algorithms (Algorithm Fusion) have been studied. To date, the results have been remarkable, including reliable robustness to new environments. In this paper a brief history of existing Algorithm Fusion technology and some techniques recently used to improve performance are presented. An exploration of new developments is presented in conclusion.

  3. Data Mining: The Art of Automated Knowledge Extraction

    NASA Astrophysics Data System (ADS)

    Karimabadi, H.; Sipes, T.

    2012-12-01

    Data mining algorithms are used routinely in a wide variety of fields and they are gaining adoption in sciences. The realities of real world data analysis are that (a) data has flaws, and (b) the models and assumptions that we bring to the data are inevitably flawed, and/or biased and misspecified in some way. Data mining can improve data analysis by detecting anomalies in the data, check for consistency of the user model assumptions, and decipher complex patterns and relationships that would not be possible otherwise. The common form of data collected from in situ spacecraft measurements is multi-variate time series which represents one of the most challenging problems in data mining. We have successfully developed algorithms to deal with such data and have extended the algorithms to handle streaming data. In this talk, we illustrate the utility of our algorithms through several examples including automated detection of reconnection exhausts in the solar wind and flux ropes in the magnetotail. We also show examples from successful applications of our technique to analysis of 3D kinetic simulations. With an eye to the future, we provide an overview of our upcoming plans that include collaborative data mining, expert outsourcing data mining, computer vision for image analysis, among others. Finally, we discuss the integration of data mining algorithms with web-based services such as VxOs and other Heliophysics data centers and the resulting capabilities that it would enable.

  4. Data Mining Citizen Science Results

    NASA Astrophysics Data System (ADS)

    Borne, K. D.

    2012-12-01

    Scientific discovery from big data is enabled through multiple channels, including data mining (through the application of machine learning algorithms) and human computation (commonly implemented through citizen science tasks). We will describe the results of new data mining experiments on the results from citizen science activities. Discovering patterns, trends, and anomalies in data are among the powerful contributions of citizen science. Establishing scientific algorithms that can subsequently re-discover the same types of patterns, trends, and anomalies in automatic data processing pipelines will ultimately result from the transformation of those human algorithms into computer algorithms, which can then be applied to much larger data collections. Scientific discovery from big data is thus greatly amplified through the marriage of data mining with citizen science.

  5. Effective application of improved profit-mining algorithm for the interday trading model.

    PubMed

    Hsieh, Yu-Lung; Yang, Don-Lin; Wu, Jungpin

    2014-01-01

    Many real world applications of association rule mining from large databases help users make better decisions. However, they do not work well in financial markets at this time. In addition to a high profit, an investor also looks for a low risk trading with a better rate of winning. The traditional approach of using minimum confidence and support thresholds needs to be changed. Based on an interday model of trading, we proposed effective profit-mining algorithms which provide investors with profit rules including information about profit, risk, and winning rate. Since profit-mining in the financial market is still in its infant stage, it is important to detail the inner working of mining algorithms and illustrate the best way to apply them. In this paper we go into details of our improved profit-mining algorithm and showcase effective applications with experiments using real world trading data. The results show that our approach is practical and effective with good performance for various datasets.

  6. Effective Application of Improved Profit-Mining Algorithm for the Interday Trading Model

    PubMed Central

    Wu, Jungpin

    2014-01-01

    Many real world applications of association rule mining from large databases help users make better decisions. However, they do not work well in financial markets at this time. In addition to a high profit, an investor also looks for a low risk trading with a better rate of winning. The traditional approach of using minimum confidence and support thresholds needs to be changed. Based on an interday model of trading, we proposed effective profit-mining algorithms which provide investors with profit rules including information about profit, risk, and winning rate. Since profit-mining in the financial market is still in its infant stage, it is important to detail the inner working of mining algorithms and illustrate the best way to apply them. In this paper we go into details of our improved profit-mining algorithm and showcase effective applications with experiments using real world trading data. The results show that our approach is practical and effective with good performance for various datasets. PMID:24688442

  7. The LSST Data Mining Research Agenda

    NASA Astrophysics Data System (ADS)

    Borne, K.; Becla, J.; Davidson, I.; Szalay, A.; Tyson, J. A.

    2008-12-01

    We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night) multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.

  8. Data Mining.

    ERIC Educational Resources Information Center

    Benoit, Gerald

    2002-01-01

    Discusses data mining (DM) and knowledge discovery in databases (KDD), taking the view that KDD is the larger view of the entire process, with DM emphasizing the cleaning, warehousing, mining, and visualization of knowledge discovery in databases. Highlights include algorithms; users; the Internet; text mining; and information extraction.…

  9. Fusion of multiple quadratic penalty function support vector machines (QPFSVM) for automated sea mine detection and classification

    NASA Astrophysics Data System (ADS)

    Dobeck, Gerald J.; Cobb, J. Tory

    2002-08-01

    The high-resolution sonar is one of the principal sensors used by the Navy to detect and classify sea mines in minehunting operations. For such sonar systems, substantial effort has been devoted to the development of automated detection and classification (D/C) algorithms. These have been spurred by several factors including (1) aids for operators to reduce work overload, (2) more optimal use of all available data, and (3) the introduction of unmanned minehunting systems. The environments where sea mines are typically laid (harbor areas, shipping lanes, and the littorals) give rise to many false alarms caused by natural, biologic, and man-made clutter. The objective of the automated D/C algorithms is to eliminate most of these false alarms while still maintaining a very high probability of mine detection and classification (PdPc). In recent years, the benefits of fusing the outputs of multiple D/C algorithms have been studied. We refer to this as Algorithm Fusion. The results have been remarkable, including reliable robustness to new environments. The Quadratic Penalty Function Support Vector Machine (QPFSVM) algorithm to aid in the automated detection and classification of sea mines is introduced in this paper. The QPFSVM algorithm is easy to train, simple to implement, and robust to feature space dimension. Outputs of successive SVM algorithms are cascaded in stages (fused) to improve the Probability of Classification (Pc) and reduce the number of false alarms. Even though our experience has been gained in the area of sea mine detection and classification, the principles described herein are general and can be applied to fusion of any D/C problem (e.g., automated medical diagnosis or automatic target recognition for ballistic missile defense).

  10. Application of the pessimistic pruning to increase the accuracy of C4.5 algorithm in diagnosing chronic kidney disease

    NASA Astrophysics Data System (ADS)

    Muslim, M. A.; Herowati, A. J.; Sugiharti, E.; Prasetiyo, B.

    2018-03-01

    A technique to dig valuable information buried or hidden in data collection which is so big to be found an interesting patterns that was previously unknown is called data mining. Data mining has been applied in the healthcare industry. One technique used data mining is classification. The decision tree included in the classification of data mining and algorithm developed by decision tree is C4.5 algorithm. A classifier is designed using applying pessimistic pruning in C4.5 algorithm in diagnosing chronic kidney disease. Pessimistic pruning use to identify and remove branches that are not needed, this is done to avoid overfitting the decision tree generated by the C4.5 algorithm. In this paper, the result obtained using these classifiers are presented and discussed. Using pessimistic pruning shows increase accuracy of C4.5 algorithm of 1.5% from 95% to 96.5% in diagnosing of chronic kidney disease.

  11. Prediction model for peninsular Indian summer monsoon rainfall using data mining and statistical approaches

    NASA Astrophysics Data System (ADS)

    Vathsala, H.; Koolagudi, Shashidhar G.

    2017-01-01

    In this paper we discuss a data mining application for predicting peninsular Indian summer monsoon rainfall, and propose an algorithm that combine data mining and statistical techniques. We select likely predictors based on association rules that have the highest confidence levels. We then cluster the selected predictors to reduce their dimensions and use cluster membership values for classification. We derive the predictors from local conditions in southern India, including mean sea level pressure, wind speed, and maximum and minimum temperatures. The global condition variables include southern oscillation and Indian Ocean dipole conditions. The algorithm predicts rainfall in five categories: Flood, Excess, Normal, Deficit and Drought. We use closed itemset mining, cluster membership calculations and a multilayer perceptron function in the algorithm to predict monsoon rainfall in peninsular India. Using Indian Institute of Tropical Meteorology data, we found the prediction accuracy of our proposed approach to be exceptionally good.

  12. Application of Three Existing Stope Boundary Optimisation Methods in an Operating Underground Mine

    NASA Astrophysics Data System (ADS)

    Erdogan, Gamze; Yavuz, Mahmut

    2017-12-01

    The underground mine planning and design optimisation process have received little attention because of complexity and variability of problems in underground mines. Although a number of optimisation studies and software tools are available and some of them, in special, have been implemented effectively to determine the ultimate-pit limits in an open pit mine, there is still a lack of studies for optimisation of ultimate stope boundaries in underground mines. The proposed approaches for this purpose aim at maximizing the economic profit by selecting the best possible layout under operational, technical and physical constraints. In this paper, the existing three heuristic techniques including Floating Stope Algorithm, Maximum Value Algorithm and Mineable Shape Optimiser (MSO) are examined for optimisation of stope layout in a case study. Each technique is assessed in terms of applicability, algorithm capabilities and limitations considering the underground mine planning challenges. Finally, the results are evaluated and compared.

  13. Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data.

    PubMed

    Luna, Jose Maria; Padillo, Francisco; Pechenizkiy, Mykola; Ventura, Sebastian

    2017-09-27

    Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms [Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed, which extract any existing item-set in data. Second, two algorithms (space pruning AprioriMR and top AprioriMR) that prune the search space by means of the well-known anti-monotone property are proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed representations of frequent patterns. To test the performance of the proposed algorithms, a varied collection of big data datasets have been considered, comprising up to 3 · 10#x00B9;⁸ transactions and more than 5 million of distinct single-items. The experimental stage includes comparisons against highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying MapReduce versions when complex problems are considered, and also the unsuitability of this paradigm when dealing with small data.

  14. Data Mining and Machine Learning in Astronomy

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Brunner, Robert J.

    We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those in which data mining techniques directly contributed to improving science, and important current and future directions, including probability density functions, parallel algorithms, Peta-Scale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

  15. Finding Frequent Closed Itemsets in Sliding Window in Linear Time

    NASA Astrophysics Data System (ADS)

    Chen, Junbo; Zhou, Bo; Chen, Lu; Wang, Xinyu; Ding, Yiqun

    One of the most well-studied problems in data mining is computing the collection of frequent itemsets in large transactional databases. Since the introduction of the famous Apriori algorithm [14], many others have been proposed to find the frequent itemsets. Among such algorithms, the approach of mining closed itemsets has raised much interest in data mining community. The algorithms taking this approach include TITANIC [8], CLOSET+[6], DCI-Closed [4], FCI-Stream [3], GC-Tree [15], TGC-Tree [16] etc. Among these algorithms, FCI-Stream, GC-Tree and TGC-Tree are online algorithms work under sliding window environments. By the performance evaluation in [16], GC-Tree [15] is the fastest one. In this paper, an improved algorithm based on GC-Tree is proposed, the computational complexity of which is proved to be a linear combination of the average transaction size and the average closed itemset size. The algorithm is based on the essential theorem presented in Sect. 4.2. Empirically, the new algorithm is several orders of magnitude faster than the state of art algorithm, GC-Tree.

  16. Data Streams: An Overview and Scientific Applications

    NASA Astrophysics Data System (ADS)

    Aggarwal, Charu C.

    In recent years, advances in hardware technology have facilitated the ability to collect data continuously. Simple transactions of everyday life such as using a credit card, a phone, or browsing the web lead to automated data storage. Similarly, advances in information technology have lead to large flows of data across IP networks. In many cases, these large volumes of data can be mined for interesting and relevant information in a wide variety of applications. When the volume of the underlying data is very large, it leads to a number of computational and mining challenges: With increasing volume of the data, it is no longer possible to process the data efficiently by using multiple passes. Rather, one can process a data item at most once. This leads to constraints on the implementation of the underlying algorithms. Therefore, stream mining algorithms typically need to be designed so that the algorithms work with one pass of the data. In most cases, there is an inherent temporal component to the stream mining process. This is because the data may evolve over time. This behavior of data streams is referred to as temporal locality. Therefore, a straightforward adaptation of one-pass mining algorithms may not be an effective solution to the task. Stream mining algorithms need to be carefully designed with a clear focus on the evolution of the underlying data. Another important characteristic of data streams is that they are often mined in a distributed fashion. Furthermore, the individual processors may have limited processing and memory. Examples of such cases include sensor networks, in which it may be desirable to perform in-network processing of data stream with limited processing and memory [1, 2]. This chapter will provide an overview of the key challenges in stream mining algorithms which arise from the unique setup in which these problems are encountered. This chapter is organized as follows. In the next section, we will discuss the generic challenges that stream mining poses to a variety of data management and data mining problems. The next section also deals with several issues which arise in the context of data stream management. In Sect. 3, we discuss several mining algorithms on the data stream model. Section 4 discusses various scientific applications of data streams. Section 5 discusses the research directions and conclusions.

  17. Mining algorithm for association rules in big data based on Hadoop

    NASA Astrophysics Data System (ADS)

    Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying

    2018-04-01

    In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.

  18. Fast Algorithms for Mining Co-evolving Time Series

    DTIC Science & Technology

    2011-09-01

    Keogh et al., 2001, 2004] and (b) forecasting, like an autoregressive integrated moving average model ( ARIMA ) and related meth- ods [Box et al., 1994...computing hardware? We develop models to mine time series with missing values, to extract compact representation from time sequences, to segment the...sequences, and to do forecasting. For large scale data, we propose algorithms for learning time series models , in particular, including Linear Dynamical

  19. Physics Mining of Multi-Source Data Sets

    NASA Technical Reports Server (NTRS)

    Helly, John; Karimabadi, Homa; Sipes, Tamara

    2012-01-01

    Powerful new parallel data mining algorithms can produce diagnostic and prognostic numerical models and analyses from observational data. These techniques yield higher-resolution measures than ever before of environmental parameters by fusing synoptic imagery and time-series measurements. These techniques are general and relevant to observational data, including raster, vector, and scalar, and can be applied in all Earth- and environmental science domains. Because they can be highly automated and are parallel, they scale to large spatial domains and are well suited to change and gap detection. This makes it possible to analyze spatial and temporal gaps in information, and facilitates within-mission replanning to optimize the allocation of observational resources. The basis of the innovation is the extension of a recently developed set of algorithms packaged into MineTool to multi-variate time-series data. MineTool is unique in that it automates the various steps of the data mining process, thus making it amenable to autonomous analysis of large data sets. Unlike techniques such as Artificial Neural Nets, which yield a blackbox solution, MineTool's outcome is always an analytical model in parametric form that expresses the output in terms of the input variables. This has the advantage that the derived equation can then be used to gain insight into the physical relevance and relative importance of the parameters and coefficients in the model. This is referred to as physics-mining of data. The capabilities of MineTool are extended to include both supervised and unsupervised algorithms, handle multi-type data sets, and parallelize it.

  20. A Comparative Study of Frequent and Maximal Periodic Pattern Mining Algorithms in Spatiotemporal Databases

    NASA Astrophysics Data System (ADS)

    Obulesu, O.; Rama Mohan Reddy, A., Dr; Mahendra, M.

    2017-08-01

    Detecting regular and efficient cyclic models is the demanding activity for data analysts due to unstructured, vigorous and enormous raw information produced from web. Many existing approaches generate large candidate patterns in the occurrence of huge and complex databases. In this work, two novel algorithms are proposed and a comparative examination is performed by considering scalability and performance parameters. The first algorithm is, EFPMA (Extended Regular Model Detection Algorithm) used to find frequent sequential patterns from the spatiotemporal dataset and the second one is, ETMA (Enhanced Tree-based Mining Algorithm) for detecting effective cyclic models with symbolic database representation. EFPMA is an algorithm grows models from both ends (prefixes and suffixes) of detected patterns, which results in faster pattern growth because of less levels of database projection compared to existing approaches such as Prefixspan and SPADE. ETMA uses distinct notions to store and manage transactions data horizontally such as segment, sequence and individual symbols. ETMA exploits a partition-and-conquer method to find maximal patterns by using symbolic notations. Using this algorithm, we can mine cyclic models in full-series sequential patterns including subsection series also. ETMA reduces the memory consumption and makes use of the efficient symbolic operation. Furthermore, ETMA only records time-series instances dynamically, in terms of character, series and section approaches respectively. The extent of the pattern and proving efficiency of the reducing and retrieval techniques from synthetic and actual datasets is a really open & challenging mining problem. These techniques are useful in data streams, traffic risk analysis, medical diagnosis, DNA sequence Mining, Earthquake prediction applications. Extensive investigational outcomes illustrates that the algorithms outperforms well towards efficiency and scalability than ECLAT, STNR and MAFIA approaches.

  1. Using an improved association rules mining optimization algorithm in web-based mobile-learning system

    NASA Astrophysics Data System (ADS)

    Huang, Yin; Chen, Jianhua; Xiong, Shaojun

    2009-07-01

    Mobile-Learning (M-learning) makes many learners get the advantages of both traditional learning and E-learning. Currently, Web-based Mobile-Learning Systems have created many new ways and defined new relationships between educators and learners. Association rule mining is one of the most important fields in data mining and knowledge discovery in databases. Rules explosion is a serious problem which causes great concerns, as conventional mining algorithms often produce too many rules for decision makers to digest. Since Web-based Mobile-Learning System collects vast amounts of student profile data, data mining and knowledge discovery techniques can be applied to find interesting relationships between attributes of learners, assessments, the solution strategies adopted by learners and so on. Therefore ,this paper focus on a new data-mining algorithm, combined with the advantages of genetic algorithm and simulated annealing algorithm , called ARGSA(Association rules based on an improved Genetic Simulated Annealing Algorithm), to mine the association rules. This paper first takes advantage of the Parallel Genetic Algorithm and Simulated Algorithm designed specifically for discovering association rules. Moreover, the analysis and experiment are also made to show the proposed method is superior to the Apriori algorithm in this Mobile-Learning system.

  2. Developing image processing meta-algorithms with data mining of multiple metrics.

    PubMed

    Leung, Kelvin; Cunha, Alexandre; Toga, A W; Parker, D Stott

    2014-01-01

    People often use multiple metrics in image processing, but here we take a novel approach of mining the values of batteries of metrics on image processing results. We present a case for extending image processing methods to incorporate automated mining of multiple image metric values. Here by a metric we mean any image similarity or distance measure, and in this paper we consider intensity-based and statistical image measures and focus on registration as an image processing problem. We show how it is possible to develop meta-algorithms that evaluate different image processing results with a number of different metrics and mine the results in an automated fashion so as to select the best results. We show that the mining of multiple metrics offers a variety of potential benefits for many image processing problems, including improved robustness and validation.

  3. Developing Image Processing Meta-Algorithms with Data Mining of Multiple Metrics

    PubMed Central

    Cunha, Alexandre; Toga, A. W.; Parker, D. Stott

    2014-01-01

    People often use multiple metrics in image processing, but here we take a novel approach of mining the values of batteries of metrics on image processing results. We present a case for extending image processing methods to incorporate automated mining of multiple image metric values. Here by a metric we mean any image similarity or distance measure, and in this paper we consider intensity-based and statistical image measures and focus on registration as an image processing problem. We show how it is possible to develop meta-algorithms that evaluate different image processing results with a number of different metrics and mine the results in an automated fashion so as to select the best results. We show that the mining of multiple metrics offers a variety of potential benefits for many image processing problems, including improved robustness and validation. PMID:24653748

  4. Improve Data Mining and Knowledge Discovery Through the Use of MatLab

    NASA Technical Reports Server (NTRS)

    Shaykhian, Gholam Ali; Martin, Dawn (Elliott); Beil, Robert

    2011-01-01

    Data mining is widely used to mine business, engineering, and scientific data. Data mining uses pattern based queries, searches, or other analyses of one or more electronic databases/datasets in order to discover or locate a predictive pattern or anomaly indicative of system failure, criminal or terrorist activity, etc. There are various algorithms, techniques and methods used to mine data; including neural networks, genetic algorithms, decision trees, nearest neighbor method, rule induction association analysis, slice and dice, segmentation, and clustering. These algorithms, techniques and methods used to detect patterns in a dataset, have been used in the development of numerous open source and commercially available products and technology for data mining. Data mining is best realized when latent information in a large quantity of data stored is discovered. No one technique solves all data mining problems; challenges are to select algorithms or methods appropriate to strengthen data/text mining and trending within given datasets. In recent years, throughout industry, academia and government agencies, thousands of data systems have been designed and tailored to serve specific engineering and business needs. Many of these systems use databases with relational algebra and structured query language to categorize and retrieve data. In these systems, data analyses are limited and require prior explicit knowledge of metadata and database relations; lacking exploratory data mining and discoveries of latent information. This presentation introduces MatLab(R) (MATrix LABoratory), an engineering and scientific data analyses tool to perform data mining. MatLab was originally intended to perform purely numerical calculations (a glorified calculator). Now, in addition to having hundreds of mathematical functions, it is a programming language with hundreds built in standard functions and numerous available toolboxes. MatLab's ease of data processing, visualization and its enormous availability of built in functionalities and toolboxes make it suitable to perform numerical computations and simulations as well as a data mining tool. Engineers and scientists can take advantage of the readily available functions/toolboxes to gain wider insight in their perspective data mining experiments.

  5. Improve Data Mining and Knowledge Discovery through the use of MatLab

    NASA Technical Reports Server (NTRS)

    Shaykahian, Gholan Ali; Martin, Dawn Elliott; Beil, Robert

    2011-01-01

    Data mining is widely used to mine business, engineering, and scientific data. Data mining uses pattern based queries, searches, or other analyses of one or more electronic databases/datasets in order to discover or locate a predictive pattern or anomaly indicative of system failure, criminal or terrorist activity, etc. There are various algorithms, techniques and methods used to mine data; including neural networks, genetic algorithms, decision trees, nearest neighbor method, rule induction association analysis, slice and dice, segmentation, and clustering. These algorithms, techniques and methods used to detect patterns in a dataset, have been used in the development of numerous open source and commercially available products and technology for data mining. Data mining is best realized when latent information in a large quantity of data stored is discovered. No one technique solves all data mining problems; challenges are to select algorithms or methods appropriate to strengthen data/text mining and trending within given datasets. In recent years, throughout industry, academia and government agencies, thousands of data systems have been designed and tailored to serve specific engineering and business needs. Many of these systems use databases with relational algebra and structured query language to categorize and retrieve data. In these systems, data analyses are limited and require prior explicit knowledge of metadata and database relations; lacking exploratory data mining and discoveries of latent information. This presentation introduces MatLab(TradeMark)(MATrix LABoratory), an engineering and scientific data analyses tool to perform data mining. MatLab was originally intended to perform purely numerical calculations (a glorified calculator). Now, in addition to having hundreds of mathematical functions, it is a programming language with hundreds built in standard functions and numerous available toolboxes. MatLab's ease of data processing, visualization and its enormous availability of built in functionalities and toolboxes make it suitable to perform numerical computations and simulations as well as a data mining tool. Engineers and scientists can take advantage of the readily available functions/toolboxes to gain wider insight in their perspective data mining experiments.

  6. Data mining for multiagent rules, strategies, and fuzzy decision tree structure

    NASA Astrophysics Data System (ADS)

    Smith, James F., III; Rhyne, Robert D., II; Fisher, Kristin

    2002-03-01

    A fuzzy logic based resource manager (RM) has been developed that automatically allocates electronic attack resources in real-time over many dissimilar platforms. Two different data mining algorithms have been developed to determine rules, strategies, and fuzzy decision tree structure. The first data mining algorithm uses a genetic algorithm as a data mining function and is called from an electronic game. The game allows a human expert to play against the resource manager in a simulated battlespace with each of the defending platforms being exclusively directed by the fuzzy resource manager and the attacking platforms being controlled by the human expert or operating autonomously under their own logic. This approach automates the data mining problem. The game automatically creates a database reflecting the domain expert's knowledge. It calls a data mining function, a genetic algorithm, for data mining of the database as required and allows easy evaluation of the information mined in the second step. The criterion for re- optimization is discussed as well as experimental results. Then a second data mining algorithm that uses a genetic program as a data mining function is introduced to automatically discover fuzzy decision tree structures. Finally, a fuzzy decision tree generated through this process is discussed.

  7. Collaborative mining and transfer learning for relational data

    NASA Astrophysics Data System (ADS)

    Levchuk, Georgiy; Eslami, Mohammed

    2015-06-01

    Many of the real-world problems, - including human knowledge, communication, biological, and cyber network analysis, - deal with data entities for which the essential information is contained in the relations among those entities. Such data must be modeled and analyzed as graphs, with attributes on both objects and relations encode and differentiate their semantics. Traditional data mining algorithms were originally designed for analyzing discrete objects for which a set of features can be defined, and thus cannot be easily adapted to deal with graph data. This gave rise to the relational data mining field of research, of which graph pattern learning is a key sub-domain [11]. In this paper, we describe a model for learning graph patterns in collaborative distributed manner. Distributed pattern learning is challenging due to dependencies between the nodes and relations in the graph, and variability across graph instances. We present three algorithms that trade-off benefits of parallelization and data aggregation, compare their performance to centralized graph learning, and discuss individual benefits and weaknesses of each model. Presented algorithms are designed for linear speedup in distributed computing environments, and learn graph patterns that are both closer to ground truth and provide higher detection rates than centralized mining algorithm.

  8. Distributed Function Mining for Gene Expression Programming Based on Fast Reduction.

    PubMed

    Deng, Song; Yue, Dong; Yang, Le-chan; Fu, Xiong; Feng, Ya-zhou

    2016-01-01

    For high-dimensional and massive data sets, traditional centralized gene expression programming (GEP) or improved algorithms lead to increased run-time and decreased prediction accuracy. To solve this problem, this paper proposes a new improved algorithm called distributed function mining for gene expression programming based on fast reduction (DFMGEP-FR). In DFMGEP-FR, fast attribution reduction in binary search algorithms (FAR-BSA) is proposed to quickly find the optimal attribution set, and the function consistency replacement algorithm is given to solve integration of the local function model. Thorough comparative experiments for DFMGEP-FR, centralized GEP and the parallel gene expression programming algorithm based on simulated annealing (parallel GEPSA) are included in this paper. For the waveform, mushroom, connect-4 and musk datasets, the comparative results show that the average time-consumption of DFMGEP-FR drops by 89.09%%, 88.85%, 85.79% and 93.06%, respectively, in contrast to centralized GEP and by 12.5%, 8.42%, 9.62% and 13.75%, respectively, compared with parallel GEPSA. Six well-studied UCI test data sets demonstrate the efficiency and capability of our proposed DFMGEP-FR algorithm for distributed function mining.

  9. Data Mining Research with the LSST

    NASA Astrophysics Data System (ADS)

    Borne, Kirk D.; Strauss, M. A.; Tyson, J. A.

    2007-12-01

    The LSST catalog database will exceed 10 petabytes, comprising several hundred attributes for 5 billion galaxies, 10 billion stars, and over 1 billion variable sources (optical variables, transients, or moving objects), extracted from over 20,000 square degrees of deep imaging in 5 passbands with thorough time domain coverage: 1000 visits over the 10-year LSST survey lifetime. The opportunities are enormous for novel scientific discoveries within this rich time-domain ultra-deep multi-band survey database. Data Mining, Machine Learning, and Knowledge Discovery research opportunities with the LSST are now under study, with a potential for new collaborations to develop to contribute to these investigations. We will describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. We also give some illustrative examples of current scientific data mining research in astronomy, and point out where new research is needed. In particular, the data mining research community will need to address several issues in the coming years as we prepare for the LSST data deluge. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; visual data mining algorithms for visual exploration of the data; indexing of multi-attribute multi-dimensional astronomical databases (beyond RA-Dec spatial indexing) for rapid querying of petabyte databases; and more. Finally, we will identify opportunities for synergistic collaboration between the data mining research group and the LSST Data Management and Science Collaboration teams.

  10. Research on parallel algorithm for sequential pattern mining

    NASA Astrophysics Data System (ADS)

    Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao

    2008-03-01

    Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.

  11. Handling Dynamic Weights in Weighted Frequent Pattern Mining

    NASA Astrophysics Data System (ADS)

    Ahmed, Chowdhury Farhan; Tanbeer, Syed Khairuzzaman; Jeong, Byeong-Soo; Lee, Young-Koo

    Even though weighted frequent pattern (WFP) mining is more effective than traditional frequent pattern mining because it can consider different semantic significances (weights) of items, existing WFP algorithms assume that each item has a fixed weight. But in real world scenarios, the weight (price or significance) of an item can vary with time. Reflecting these changes in item weight is necessary in several mining applications, such as retail market data analysis and web click stream analysis. In this paper, we introduce the concept of a dynamic weight for each item, and propose an algorithm, DWFPM (dynamic weighted frequent pattern mining), that makes use of this concept. Our algorithm can address situations where the weight (price or significance) of an item varies dynamically. It exploits a pattern growth mining technique to avoid the level-wise candidate set generation-and-test methodology. Furthermore, it requires only one database scan, so it is eligible for use in stream data mining. An extensive performance analysis shows that our algorithm is efficient and scalable for WFP mining using dynamic weights.

  12. Data Mining Methods for Recommender Systems

    NASA Astrophysics Data System (ADS)

    Amatriain, Xavier; Jaimes*, Alejandro; Oliver, Nuria; Pujol, Josep M.

    In this chapter, we give an overview of the main Data Mining techniques used in the context of Recommender Systems. We first describe common preprocessing methods such as sampling or dimensionality reduction. Next, we review the most important classification techniques, including Bayesian Networks and Support Vector Machines. We describe the k-means clustering algorithm and discuss several alternatives. We also present association rules and related algorithms for an efficient training process. In addition to introducing these techniques, we survey their uses in Recommender Systems and present cases where they have been successfully applied.

  13. Image Information Mining Utilizing Hierarchical Segmentation

    NASA Technical Reports Server (NTRS)

    Tilton, James C.; Marchisio, Giovanni; Koperski, Krzysztof; Datcu, Mihai

    2002-01-01

    The Hierarchical Segmentation (HSEG) algorithm is an approach for producing high quality, hierarchically related image segmentations. The VisiMine image information mining system utilizes clustering and segmentation algorithms for reducing visual information in multispectral images to a manageable size. The project discussed herein seeks to enhance the VisiMine system through incorporating hierarchical segmentations from HSEG into the VisiMine system.

  14. An Incremental High-Utility Mining Algorithm with Transaction Insertion

    PubMed Central

    Gan, Wensheng; Zhang, Binbin

    2015-01-01

    Association-rule mining is commonly used to discover useful and meaningful patterns from a very large database. It only considers the occurrence frequencies of items to reveal the relationships among itemsets. Traditional association-rule mining is, however, not suitable in real-world applications since the purchased items from a customer may have various factors, such as profit or quantity. High-utility mining was designed to solve the limitations of association-rule mining by considering both the quantity and profit measures. Most algorithms of high-utility mining are designed to handle the static database. Fewer researches handle the dynamic high-utility mining with transaction insertion, thus requiring the computations of database rescan and combination explosion of pattern-growth mechanism. In this paper, an efficient incremental algorithm with transaction insertion is designed to reduce computations without candidate generation based on the utility-list structures. The enumeration tree and the relationships between 2-itemsets are also adopted in the proposed algorithm to speed up the computations. Several experiments are conducted to show the performance of the proposed algorithm in terms of runtime, memory consumption, and number of generated patterns. PMID:25811038

  15. Data mining in soft computing framework: a survey.

    PubMed

    Mitra, S; Pal, S K; Mitra, P

    2002-01-01

    The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.

  16. Non-Convex Sparse and Low-Rank Based Robust Subspace Segmentation for Data Mining.

    PubMed

    Cheng, Wenlong; Zhao, Mingbo; Xiong, Naixue; Chui, Kwok Tai

    2017-07-15

    Parsimony, including sparsity and low-rank, has shown great importance for data mining in social networks, particularly in tasks such as segmentation and recognition. Traditionally, such modeling approaches rely on an iterative algorithm that minimizes an objective function with convex l ₁-norm or nuclear norm constraints. However, the obtained results by convex optimization are usually suboptimal to solutions of original sparse or low-rank problems. In this paper, a novel robust subspace segmentation algorithm has been proposed by integrating l p -norm and Schatten p -norm constraints. Our so-obtained affinity graph can better capture local geometrical structure and the global information of the data. As a consequence, our algorithm is more generative, discriminative and robust. An efficient linearized alternating direction method is derived to realize our model. Extensive segmentation experiments are conducted on public datasets. The proposed algorithm is revealed to be more effective and robust compared to five existing algorithms.

  17. Semisupervised GDTW kernel-based fuzzy c-means algorithm for mapping vegetation dynamics in mining region using normalized difference vegetation index time series

    NASA Astrophysics Data System (ADS)

    Jia, Duo; Wang, Cangjiao; Lei, Shaogang

    2018-01-01

    Mapping vegetation dynamic types in mining areas is significant for revealing the mechanisms of environmental damage and for guiding ecological construction. Dynamic types of vegetation can be identified by applying interannual normalized difference vegetation index (NDVI) time series. However, phase differences and time shifts in interannual time series decrease mapping accuracy in mining regions. To overcome these problems and to increase the accuracy of mapping vegetation dynamics, an interannual Landsat time series for optimum vegetation growing status was constructed first by using the enhanced spatial and temporal adaptive reflectance fusion model algorithm. We then proposed a Markov random field optimized semisupervised Gaussian dynamic time warping kernel-based fuzzy c-means (FCM) cluster algorithm for interannual NDVI time series to map dynamic vegetation types in mining regions. The proposed algorithm has been tested in the Shengli mining region and Shendong mining region, which are typical representatives of China's open-pit and underground mining regions, respectively. Experiments show that the proposed algorithm can solve the problems of phase differences and time shifts to achieve better performance when mapping vegetation dynamic types. The overall accuracies for the Shengli and Shendong mining regions were 93.32% and 89.60%, respectively, with improvements of 7.32% and 25.84% when compared with the original semisupervised FCM algorithm.

  18. Recommending Learning Activities in Social Network Using Data Mining Algorithms

    ERIC Educational Resources Information Center

    Mahnane, Lamia

    2017-01-01

    In this paper, we show how data mining algorithms (e.g. Apriori Algorithm (AP) and Collaborative Filtering (CF)) is useful in New Social Network (NSN-AP-CF). "NSN-AP-CF" processes the clusters based on different learning styles. Next, it analyzes the habits and the interests of the users through mining the frequent episodes by the…

  19. SPMBR: a scalable algorithm for mining sequential patterns based on bitmaps

    NASA Astrophysics Data System (ADS)

    Xu, Xiwei; Zhang, Changhai

    2013-12-01

    Now some sequential patterns mining algorithms generate too many candidate sequences, and increase the processing cost of support counting. Therefore, we present an effective and scalable algorithm called SPMBR (Sequential Patterns Mining based on Bitmap Representation) to solve the problem of mining the sequential patterns for large databases. Our method differs from previous related works of mining sequential patterns. The main difference is that the database of sequential patterns is represented by bitmaps, and a simplified bitmap structure is presented firstly. In this paper, First the algorithm generate candidate sequences by SE(Sequence Extension) and IE(Item Extension), and then obtain all frequent sequences by comparing the original bitmap and the extended item bitmap .This method could simplify the problem of mining the sequential patterns and avoid the high processing cost of support counting. Both theories and experiments indicate that the performance of SPMBR is predominant for large transaction databases, the required memory size for storing temporal data is much less during mining process, and all sequential patterns can be mined with feasibility.

  20. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    ERIC Educational Resources Information Center

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  1. Prediction of pork quality parameters by applying fractals and data mining on MRI.

    PubMed

    Caballero, Daniel; Pérez-Palacios, Trinidad; Caro, Andrés; Amigo, José Manuel; Dahl, Anders B; ErsbØll, Bjarne K; Antequera, Teresa

    2017-09-01

    This work firstly investigates the use of MRI, fractal algorithms and data mining techniques to determine pork quality parameters non-destructively. The main objective was to evaluate the capability of fractal algorithms (Classical Fractal algorithm, CFA; Fractal Texture Algorithm, FTA and One Point Fractal Texture Algorithm, OPFTA) to analyse MRI in order to predict quality parameters of loin. In addition, the effect of the sequence acquisition of MRI (Gradient echo, GE; Spin echo, SE and Turbo 3D, T3D) and the predictive technique of data mining (Isotonic regression, IR and Multiple linear regression, MLR) were analysed. Both fractal algorithm, FTA and OPFTA are appropriate to analyse MRI of loins. The sequence acquisition, the fractal algorithm and the data mining technique seems to influence on the prediction results. For most physico-chemical parameters, prediction equations with moderate to excellent correlation coefficients were achieved by using the following combinations of acquisition sequences of MRI, fractal algorithms and data mining techniques: SE-FTA-MLR, SE-OPFTA-IR, GE-OPFTA-MLR, SE-OPFTA-MLR, with the last one offering the best prediction results. Thus, SE-OPFTA-MLR could be proposed as an alternative technique to determine physico-chemical traits of fresh and dry-cured loins in a non-destructive way with high accuracy. Copyright © 2017. Published by Elsevier Ltd.

  2. A Node Linkage Approach for Sequential Pattern Mining

    PubMed Central

    Navarro, Osvaldo; Cumplido, René; Villaseñor-Pineda, Luis; Feregrino-Uribe, Claudia; Carrasco-Ochoa, Jesús Ariel

    2014-01-01

    Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT), has better performance and scalability in comparison with state of the art algorithms. PMID:24933123

  3. False alarm reduction by the And-ing of multiple multivariate Gaussian classifiers

    NASA Astrophysics Data System (ADS)

    Dobeck, Gerald J.; Cobb, J. Tory

    2003-09-01

    The high-resolution sonar is one of the principal sensors used by the Navy to detect and classify sea mines in minehunting operations. For such sonar systems, substantial effort has been devoted to the development of automated detection and classification (D/C) algorithms. These have been spurred by several factors including (1) aids for operators to reduce work overload, (2) more optimal use of all available data, and (3) the introduction of unmanned minehunting systems. The environments where sea mines are typically laid (harbor areas, shipping lanes, and the littorals) give rise to many false alarms caused by natural, biologic, and man-made clutter. The objective of the automated D/C algorithms is to eliminate most of these false alarms while still maintaining a very high probability of mine detection and classification (PdPc). In recent years, the benefits of fusing the outputs of multiple D/C algorithms have been studied. We refer to this as Algorithm Fusion. The results have been remarkable, including reliable robustness to new environments. This paper describes a method for training several multivariate Gaussian classifiers such that their And-ing dramatically reduces false alarms while maintaining a high probability of classification. This training approach is referred to as the Focused- Training method. This work extends our 2001-2002 work where the Focused-Training method was used with three other types of classifiers: the Attractor-based K-Nearest Neighbor Neural Network (a type of radial-basis, probabilistic neural network), the Optimal Discrimination Filter Classifier (based linear discrimination theory), and the Quadratic Penalty Function Support Vector Machine (QPFSVM). Although our experience has been gained in the area of sea mine detection and classification, the principles described herein are general and can be applied to a wide range of pattern recognition and automatic target recognition (ATR) problems.

  4. Privacy Preserving Nearest Neighbor Search

    NASA Astrophysics Data System (ADS)

    Shaneck, Mark; Kim, Yongdae; Kumar, Vipin

    Data mining is frequently obstructed by privacy concerns. In many cases data is distributed, and bringing the data together in one place for analysis is not possible due to privacy laws (e.g. HIPAA) or policies. Privacy preserving data mining techniques have been developed to address this issue by providing mechanisms to mine the data while giving certain privacy guarantees. In this chapter we address the issue of privacy preserving nearest neighbor search, which forms the kernel of many data mining applications. To this end, we present a novel algorithm based on secure multiparty computation primitives to compute the nearest neighbors of records in horizontally distributed data. We show how this algorithm can be used in three important data mining algorithms, namely LOF outlier detection, SNN clustering, and kNN classification. We prove the security of these algorithms under the semi-honest adversarial model, and describe methods that can be used to optimize their performance. Keywords: Privacy Preserving Data Mining, Nearest Neighbor Search, Outlier Detection, Clustering, Classification, Secure Multiparty Computation

  5. Efficient frequent pattern mining algorithm based on node sets in cloud computing environment

    NASA Astrophysics Data System (ADS)

    Billa, V. N. Vinay Kumar; Lakshmanna, K.; Rajesh, K.; Reddy, M. Praveen Kumar; Nagaraja, G.; Sudheer, K.

    2017-11-01

    The ultimate goal of Data Mining is to determine the hidden information which is useful in making decisions using the large databases collected by an organization. This Data Mining involves many tasks that are to be performed during the process. Mining frequent itemsets is the one of the most important tasks in case of transactional databases. These transactional databases contain the data in very large scale where the mining of these databases involves the consumption of physical memory and time in proportion to the size of the database. A frequent pattern mining algorithm is said to be efficient only if it consumes less memory and time to mine the frequent itemsets from the given large database. Having these points in mind in this thesis we proposed a system which mines frequent itemsets in an optimized way in terms of memory and time by using cloud computing as an important factor to make the process parallel and the application is provided as a service. A complete framework which uses a proven efficient algorithm called FIN algorithm. FIN algorithm works on Nodesets and POC (pre-order coding) tree. In order to evaluate the performance of the system we conduct the experiments to compare the efficiency of the same algorithm applied in a standalone manner and in cloud computing environment on a real time data set which is traffic accidents data set. The results show that the memory consumption and execution time taken for the process in the proposed system is much lesser than those of standalone system.

  6. Block-suffix shifting: fast, simultaneous medical concept set identification in large medical record corpora.

    PubMed

    Liu, Ying; Lita, Lucian Vlad; Niculescu, Radu Stefan; Mitra, Prasenjit; Giles, C Lee

    2008-11-06

    Owing to new advances in computer hardware, large text databases have become more prevalent than ever.Automatically mining information from these databases proves to be a challenge due to slow pattern/string matching techniques. In this paper we present a new, fast multi-string pattern matching method based on the well known Aho-Chorasick algorithm. Advantages of our algorithm include:the ability to exploit the natural structure of text, the ability to perform significant character shifting, avoiding backtracking jumps that are not useful, efficiency in terms of matching time and avoiding the typical "sub-string" false positive errors.Our algorithm is applicable to many fields with free text, such as the health care domain and the scientific document field. In this paper, we apply the BSS algorithm to health care data and mine hundreds of thousands of medical concepts from a large Electronic Medical Record (EMR) corpora simultaneously and efficiently. Experimental results show the superiority of our algorithm when compared with the top of the line multi-string matching algorithms.

  7. Learning Behavior Characterization with Multi-Feature, Hierarchical Activity Sequences

    ERIC Educational Resources Information Center

    Ye, Cheng; Segedy, James R.; Kinnebrew, John S.; Biswas, Gautam

    2015-01-01

    This paper discusses Multi-Feature Hierarchical Sequential Pattern Mining, MFH-SPAM, a novel algorithm that efficiently extracts patterns from students' learning activity sequences. This algorithm extends an existing sequential pattern mining algorithm by dynamically selecting the level of specificity for hierarchically-defined features…

  8. Spectral methods to detect surface mines

    NASA Astrophysics Data System (ADS)

    Winter, Edwin M.; Schatten Silvious, Miranda

    2008-04-01

    Over the past five years, advances have been made in the spectral detection of surface mines under minefield detection programs at the U. S. Army RDECOM CERDEC Night Vision and Electronic Sensors Directorate (NVESD). The problem of detecting surface land mines ranges from the relatively simple, the detection of large anti-vehicle mines on bare soil, to the very difficult, the detection of anti-personnel mines in thick vegetation. While spatial and spectral approaches can be applied to the detection of surface mines, spatial-only detection requires many pixels-on-target such that the mine is actually imaged and shape-based features can be exploited. This method is unreliable in vegetated areas because only part of the mine may be exposed, while spectral detection is possible without the mine being resolved. At NVESD, hyperspectral and multi-spectral sensors throughout the reflection and thermal spectral regimes have been applied to the mine detection problem. Data has been collected on mines in forest and desert regions and algorithms have been developed both to detect the mines as anomalies and to detect the mines based on their spectral signature. In addition to the detection of individual mines, algorithms have been developed to exploit the similarities of mines in a minefield to improve their detection probability. In this paper, the types of spectral data collected over the past five years will be summarized along with the advances in algorithm development.

  9. Discriminative and informative features for biomolecular text mining with ensemble feature selection.

    PubMed

    Van Landeghem, Sofie; Abeel, Thomas; Saeys, Yvan; Van de Peer, Yves

    2010-09-15

    In the field of biomolecular text mining, black box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection (FS) is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows us to build more accurate classifiers while at the same time bridging the gap between the black box behavior and the end-user who has to interpret the results. We show that our FS methodology successfully discards a large fraction of machine-generated features, improving classification performance of state-of-the-art text mining algorithms. Furthermore, we illustrate how FS can be applied to gain understanding in the predictions of a framework for biomolecular event extraction from text. We include numerous examples of highly discriminative features that model either biological reality or common linguistic constructs. Finally, we discuss a number of insights from our FS analyses that will provide the opportunity to considerably improve upon current text mining tools. The FS algorithms and classifiers are available in Java-ML (http://java-ml.sf.net). The datasets are publicly available from the BioNLP'09 Shared Task web site (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/).

  10. Matched filter based detection of floating mines in IR spacetime

    NASA Astrophysics Data System (ADS)

    Borghgraef, Alexander; Lapierre, Fabian; Philips, Wilfried; Acheroy, Marc

    2009-09-01

    Ship-based automatic detection of small floating objects on an agitated sea surface remains a hard problem. Our main concern is the detection of floating mines, which proved a real threat to shipping in confined waterways during the first Gulf War, but applications include salvaging,search-and-rescue and perimeter or harbour defense. IR video was chosen for its day-and-night imaging capability, and its availability on military vessels. Detection is difficult because a rough sea is seen as a dynamic background of moving objects with size order, shape and temperature similar to those of the floating mine. We do find a determinant characteristic in the target's periodic motion, which differs from that of the propagating surface waves composing the background. The classical detection and tracking approaches give bad results when applied to this problem. While background detection algorithms assume a quasi-static background, the sea surface is actually very dynamic, causing this category of algorithms to fail. Kalman or particle filter algorithms on the other hand, which stress temporal coherence, suffer from tracking loss due to occlusions and the great noise level of the image. We propose an innovative approach. This approach uses the periodicity of the objects movement and thus its temporal coherence. The principle is to consider the video data as a spacetime volume similar to a hyperspectral data cube by replacing the spectral axis with a temporal axis. We can then apply algorithms developed for hyperspectral detection problems to the detection of small floating objects. We treat the detection problem using multilinear algebra, designing a number of finite impulse response filters (FIR) maximizing the target response. The algorithm was applied to test footage of practice mines in the infrared.

  11. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    PubMed

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.

  12. Learner Typologies Development Using OIndex and Data Mining Based Clustering Techniques

    ERIC Educational Resources Information Center

    Luan, Jing

    2004-01-01

    This explorative data mining project used distance based clustering algorithm to study 3 indicators, called OIndex, of student behavioral data and stabilized at a 6-cluster scenario following an exhaustive explorative study of 4, 5, and 6 cluster scenarios produced by K-Means and TwoStep algorithms. Using principles in data mining, the study…

  13. Comparison analysis for classification algorithm in data mining and the study of model use

    NASA Astrophysics Data System (ADS)

    Chen, Junde; Zhang, Defu

    2018-04-01

    As a key technique in data mining, classification algorithm was received extensive attention. Through an experiment of classification algorithm in UCI data set, we gave a comparison analysis method for the different algorithms and the statistical test was used here. Than that, an adaptive diagnosis model for preventive electricity stealing and leakage was given as a specific case in the paper.

  14. Multi-Level Sequential Pattern Mining Based on Prime Encoding

    NASA Astrophysics Data System (ADS)

    Lianglei, Sun; Yun, Li; Jiang, Yin

    Encoding is not only to express the hierarchical relationship, but also to facilitate the identification of the relationship between different levels, which will directly affect the efficiency of the algorithm in the area of mining the multi-level sequential pattern. In this paper, we prove that one step of division operation can decide the parent-child relationship between different levels by using prime encoding and present PMSM algorithm and CROSS-PMSM algorithm which are based on prime encoding for mining multi-level sequential pattern and cross-level sequential pattern respectively. Experimental results show that the algorithm can effectively extract multi-level and cross-level sequential pattern from the sequence database.

  15. Application and Exploration of Big Data Mining in Clinical Medicine.

    PubMed

    Zhang, Yue; Guo, Shu-Li; Han, Li-Na; Li, Tie-Ling

    2016-03-20

    To review theories and technologies of big data mining and their application in clinical medicine. Literatures published in English or Chinese regarding theories and technologies of big data mining and the concrete applications of data mining technology in clinical medicine were obtained from PubMed and Chinese Hospital Knowledge Database from 1975 to 2015. Original articles regarding big data mining theory/technology and big data mining's application in the medical field were selected. This review characterized the basic theories and technologies of big data mining including fuzzy theory, rough set theory, cloud theory, Dempster-Shafer theory, artificial neural network, genetic algorithm, inductive learning theory, Bayesian network, decision tree, pattern recognition, high-performance computing, and statistical analysis. The application of big data mining in clinical medicine was analyzed in the fields of disease risk assessment, clinical decision support, prediction of disease development, guidance of rational use of drugs, medical management, and evidence-based medicine. Big data mining has the potential to play an important role in clinical medicine.

  16. Applying data mining techniques to improve diagnosis in neonatal jaundice.

    PubMed

    Ferreira, Duarte; Oliveira, Abílio; Freitas, Alberto

    2012-12-07

    Hyperbilirubinemia is emerging as an increasingly common problem in newborns due to a decreasing hospital length of stay after birth. Jaundice is the most common disease of the newborn and although being benign in most cases it can lead to severe neurological consequences if poorly evaluated. In different areas of medicine, data mining has contributed to improve the results obtained with other methodologies.Hence, the aim of this study was to improve the diagnosis of neonatal jaundice with the application of data mining techniques. This study followed the different phases of the Cross Industry Standard Process for Data Mining model as its methodology.This observational study was performed at the Obstetrics Department of a central hospital (Centro Hospitalar Tâmega e Sousa--EPE), from February to March of 2011. A total of 227 healthy newborn infants with 35 or more weeks of gestation were enrolled in the study. Over 70 variables were collected and analyzed. Also, transcutaneous bilirubin levels were measured from birth to hospital discharge with maximum time intervals of 8 hours between measurements, using a noninvasive bilirubinometer.Different attribute subsets were used to train and test classification models using algorithms included in Weka data mining software, such as decision trees (J48) and neural networks (multilayer perceptron). The accuracy results were compared with the traditional methods for prediction of hyperbilirubinemia. The application of different classification algorithms to the collected data allowed predicting subsequent hyperbilirubinemia with high accuracy. In particular, at 24 hours of life of newborns, the accuracy for the prediction of hyperbilirubinemia was 89%. The best results were obtained using the following algorithms: naive Bayes, multilayer perceptron and simple logistic. The findings of our study sustain that, new approaches, such as data mining, may support medical decision, contributing to improve diagnosis in neonatal jaundice.

  17. Big data mining analysis method based on cloud computing

    NASA Astrophysics Data System (ADS)

    Cai, Qing Qiu; Cui, Hong Gang; Tang, Hao

    2017-08-01

    Information explosion era, large data super-large, discrete and non-(semi) structured features have gone far beyond the traditional data management can carry the scope of the way. With the arrival of the cloud computing era, cloud computing provides a new technical way to analyze the massive data mining, which can effectively solve the problem that the traditional data mining method cannot adapt to massive data mining. This paper introduces the meaning and characteristics of cloud computing, analyzes the advantages of using cloud computing technology to realize data mining, designs the mining algorithm of association rules based on MapReduce parallel processing architecture, and carries out the experimental verification. The algorithm of parallel association rule mining based on cloud computing platform can greatly improve the execution speed of data mining.

  18. A parallel algorithm for finding the shortest exit paths in mines

    NASA Astrophysics Data System (ADS)

    Jastrzab, Tomasz; Buchcik, Agata

    2017-11-01

    In the paper we study the problem of finding the shortest exit path in an underground mine in case of emergency. Since emergency situations, such as underground fires, can put the miners' lives at risk, the ability to quickly determine the safest exit path is crucial. We propose a parallel algorithm capable of finding the shortest path between the safe exit point and any other point in the mine. The algorithm is also able to take into account the characteristics of individual miners, to make the path determination more reliable.

  19. Multiagent data warehousing and multiagent data mining for cerebrum/cerebellum modeling

    NASA Astrophysics Data System (ADS)

    Zhang, Wen-Ran

    2002-03-01

    An algorithm named Neighbor-Miner is outlined for multiagent data warehousing and multiagent data mining. The algorithm is defined in an evolving dynamic environment with autonomous or semiautonomous agents. Instead of mining frequent itemsets from customer transactions, the new algorithm discovers new agents and mining agent associations in first-order logic from agent attributes and actions. While the Apriori algorithm uses frequency as a priory threshold, the new algorithm uses agent similarity as priory knowledge. The concept of agent similarity leads to the notions of agent cuboid, orthogonal multiagent data warehousing (MADWH), and multiagent data mining (MADM). Based on agent similarities and action similarities, Neighbor-Miner is proposed and illustrated in a MADWH/MADM approach to cerebrum/cerebellum modeling. It is shown that (1) semiautonomous neurofuzzy agents can be identified for uniped locomotion and gymnastic training based on attribute relevance analysis; (2) new agents can be discovered and agent cuboids can be dynamically constructed in an orthogonal MADWH, which resembles an evolving cerebrum/cerebellum system; and (3) dynamic motion laws can be discovered as association rules in first order logic. Although examples in legged robot gymnastics are used to illustrate the basic ideas, the new approach is generally suitable for a broad category of data mining tasks where knowledge can be discovered collectively by a set of agents from a geographically or geometrically distributed but relevant environment, especially in scientific and engineering data environments.

  20. Mining of high utility-probability sequential patterns from uncertain databases

    PubMed Central

    Zhang, Binbin; Fournier-Viger, Philippe; Li, Ting

    2017-01-01

    High-utility sequential pattern mining (HUSPM) has become an important issue in the field of data mining. Several HUSPM algorithms have been designed to mine high-utility sequential patterns (HUPSPs). They have been applied in several real-life situations such as for consumer behavior analysis and event detection in sensor networks. Nonetheless, most studies on HUSPM have focused on mining HUPSPs in precise data. But in real-life, uncertainty is an important factor as data is collected using various types of sensors that are more or less accurate. Hence, data collected in a real-life database can be annotated with existing probabilities. This paper presents a novel pattern mining framework called high utility-probability sequential pattern mining (HUPSPM) for mining high utility-probability sequential patterns (HUPSPs) in uncertain sequence databases. A baseline algorithm with three optional pruning strategies is presented to mine HUPSPs. Moroever, to speed up the mining process, a projection mechanism is designed to create a database projection for each processed sequence, which is smaller than the original database. Thus, the number of unpromising candidates can be greatly reduced, as well as the execution time for mining HUPSPs. Substantial experiments both on real-life and synthetic datasets show that the designed algorithm performs well in terms of runtime, number of candidates, memory usage, and scalability for different minimum utility and minimum probability thresholds. PMID:28742847

  1. A systematic review of validated methods for identifying hypersensitivity reactions other than anaphylaxis (fever, rash, and lymphadenopathy), using administrative and claims data.

    PubMed

    Schneider, Gary; Kachroo, Sumesh; Jones, Natalie; Crean, Sheila; Rotella, Philip; Avetisyan, Ruzan; Reynolds, Matthew W

    2012-01-01

    The Food and Drug Administration's Mini-Sentinel pilot program aims to conduct active surveillance to refine safety signals that emerge for marketed medical products. A key facet of this surveillance is to develop and understand the validity of algorithms for identifying health outcomes of interest from administrative and claims data. This article summarizes the process and findings of the algorithm review of hypersensitivity reactions. PubMed and Iowa Drug Information Service searches were conducted to identify citations applicable to the hypersensitivity reactions of health outcomes of interest. Level 1 abstract reviews and Level 2 full-text reviews were conducted to find articles using administrative and claims data to identify hypersensitivity reactions and including validation estimates of the coding algorithms. We identified five studies that provided validated hypersensitivity-reaction algorithms. Algorithm positive predictive values (PPVs) for various definitions of hypersensitivity reactions ranged from 3% to 95%. PPVs were high (i.e. 90%-95%) when both exposures and diagnoses were very specific. PPV generally decreased when the definition of hypersensitivity was expanded, except in one study that used data mining methodology for algorithm development. The ability of coding algorithms to identify hypersensitivity reactions varied, with decreasing performance occurring with expanded outcome definitions. This examination of hypersensitivity-reaction coding algorithms provides an example of surveillance bias resulting from outcome definitions that include mild cases. Data mining may provide tools for algorithm development for hypersensitivity and other health outcomes. Research needs to be conducted on designing validation studies to test hypersensitivity-reaction algorithms and estimating their predictive power, sensitivity, and specificity. Copyright © 2012 John Wiley & Sons, Ltd.

  2. Buried landmine detection using multivariate normal clustering

    NASA Astrophysics Data System (ADS)

    Duston, Brian M.

    2001-10-01

    A Bayesian classification algorithm is presented for discriminating buried land mines from buried and surface clutter in Ground Penetrating Radar (GPR) signals. This algorithm is based on multivariate normal (MVN) clustering, where feature vectors are used to identify populations (clusters) of mines and clutter objects. The features are extracted from two-dimensional images created from ground penetrating radar scans. MVN clustering is used to determine the number of clusters in the data and to create probability density models for target and clutter populations, producing the MVN clustering classifier (MVNCC). The Bayesian Information Criteria (BIC) is used to evaluate each model to determine the number of clusters in the data. An extension of the MVNCC allows the model to adapt to local clutter distributions by treating each of the MVN cluster components as a Poisson process and adaptively estimating the intensity parameters. The algorithm is developed using data collected by the Mine Hunter/Killer Close-In Detector (MH/K CID) at prepared mine lanes. The Mine Hunter/Killer is a prototype mine detecting and neutralizing vehicle developed for the U.S. Army to clear roads of anti-tank mines.

  3. On-Board Mining in the Sensor Web

    NASA Astrophysics Data System (ADS)

    Tanner, S.; Conover, H.; Graves, S.; Ramachandran, R.; Rushing, J.

    2004-12-01

    On-board data mining can contribute to many research and engineering applications, including natural hazard detection and prediction, intelligent sensor control, and the generation of customized data products for direct distribution to users. The ability to mine sensor data in real time can also be a critical component of autonomous operations, supporting deep space missions, unmanned aerial and ground-based vehicles (UAVs, UGVs), and a wide range of sensor meshes, webs and grids. On-board processing is expected to play a significant role in the next generation of NASA, Homeland Security, Department of Defense and civilian programs, providing for greater flexibility and versatility in measurements of physical systems. In addition, the use of UAV and UGV systems is increasing in military, emergency response and industrial applications. As research into the autonomy of these vehicles progresses, especially in fleet or web configurations, the applicability of on-board data mining is expected to increase significantly. Data mining in real time on board sensor platforms presents unique challenges. Most notably, the data to be mined is a continuous stream, rather than a fixed store such as a database. This means that the data mining algorithms must be modified to make only a single pass through the data. In addition, the on-board environment requires real time processing with limited computing resources, thus the algorithms must use fixed and relatively small amounts of processing time and memory. The University of Alabama in Huntsville is developing an innovative processing framework for the on-board data and information environment. The Environment for On-Board Processing (EVE) and the Adaptive On-board Data Processing (AODP) projects serve as proofs-of-concept of advanced information systems for remote sensing platforms. The EVE real-time processing infrastructure will upload, schedule and control the execution of processing plans on board remote sensors. These plans provide capabilities for autonomous data mining, classification and feature extraction using both streaming and buffered data sources. A ground-based testbed provides a heterogeneous, embedded hardware and software environment representing both space-based and ground-based sensor platforms, including wireless sensor mesh architectures. The AODP project explores the EVE concepts in the world of sensor-networks, including ad-hoc networks of small sensor platforms.

  4. Multivariate Spatial Condition Mapping Using Subtractive Fuzzy Cluster Means

    PubMed Central

    Sabit, Hakilo; Al-Anbuky, Adnan

    2014-01-01

    Wireless sensor networks are usually deployed for monitoring given physical phenomena taking place in a specific space and over a specific duration of time. The spatio-temporal distribution of these phenomena often correlates to certain physical events. To appropriately characterise these events-phenomena relationships over a given space for a given time frame, we require continuous monitoring of the conditions. WSNs are perfectly suited for these tasks, due to their inherent robustness. This paper presents a subtractive fuzzy cluster means algorithm and its application in data stream mining for wireless sensor systems over a cloud-computing-like architecture, which we call sensor cloud data stream mining. Benchmarking on standard mining algorithms, the k-means and the FCM algorithms, we have demonstrated that the subtractive fuzzy cluster means model can perform high quality distributed data stream mining tasks comparable to centralised data stream mining. PMID:25313495

  5. Software tool for data mining and its applications

    NASA Astrophysics Data System (ADS)

    Yang, Jie; Ye, Chenzhou; Chen, Nianyi

    2002-03-01

    A software tool for data mining is introduced, which integrates pattern recognition (PCA, Fisher, clustering, hyperenvelop, regression), artificial intelligence (knowledge representation, decision trees), statistical learning (rough set, support vector machine), computational intelligence (neural network, genetic algorithm, fuzzy systems). It consists of nine function models: pattern recognition, decision trees, association rule, fuzzy rule, neural network, genetic algorithm, Hyper Envelop, support vector machine, visualization. The principle and knowledge representation of some function models of data mining are described. The software tool of data mining is realized by Visual C++ under Windows 2000. Nonmonotony in data mining is dealt with by concept hierarchy and layered mining. The software tool of data mining has satisfactorily applied in the prediction of regularities of the formation of ternary intermetallic compounds in alloy systems, and diagnosis of brain glioma.

  6. An Algorithm of Association Rule Mining for Microbial Energy Prospection

    PubMed Central

    Shaheen, Muhammad; Shahbaz, Muhammad

    2017-01-01

    The presence of hydrocarbons beneath earth’s surface produces some microbiological anomalies in soils and sediments. The detection of such microbial populations involves pure bio chemical processes which are specialized, expensive and time consuming. This paper proposes a new algorithm of context based association rule mining on non spatial data. The algorithm is a modified form of already developed algorithm which was for spatial database only. The algorithm is applied to mine context based association rules on microbial database to extract interesting and useful associations of microbial attributes with existence of hydrocarbon reserve. The surface and soil manifestations caused by the presence of hydrocarbon oxidizing microbes are selected from existing literature and stored in a shared database. The algorithm is applied on the said database to generate direct and indirect associations among the stored microbial indicators. These associations are then correlated with the probability of hydrocarbon’s existence. The numerical evaluation shows better accuracy for non-spatial data as compared to conventional algorithms at generating reliable and robust rules. PMID:28393846

  7. Exploring context and content links in social media: a latent space method.

    PubMed

    Qi, Guo-Jun; Aggarwal, Charu; Tian, Qi; Ji, Heng; Huang, Thomas S

    2012-05-01

    Social media networks contain both content and context-specific information. Most existing methods work with either of the two for the purpose of multimedia mining and retrieval. In reality, both content and context information are rich sources of information for mining, and the full power of mining and processing algorithms can be realized only with the use of a combination of the two. This paper proposes a new algorithm which mines both context and content links in social media networks to discover the underlying latent semantic space. This mapping of the multimedia objects into latent feature vectors enables the use of any off-the-shelf multimedia retrieval algorithms. Compared to the state-of-the-art latent methods in multimedia analysis, this algorithm effectively solves the problem of sparse context links by mining the geometric structure underlying the content links between multimedia objects. Specifically for multimedia annotation, we show that an effective algorithm can be developed to directly construct annotation models by simultaneously leveraging both context and content information based on latent structure between correlated semantic concepts. We conduct experiments on the Flickr data set, which contains user tags linked with images. We illustrate the advantages of our approach over the state-of-the-art multimedia retrieval techniques.

  8. Combined mine tremors source location and error evaluation in the Lubin Copper Mine (Poland)

    NASA Astrophysics Data System (ADS)

    Leśniak, Andrzej; Pszczoła, Grzegorz

    2008-08-01

    A modified method of mine tremors location used in Lubin Copper Mine is presented in the paper. In mines where an intensive exploration is carried out a high accuracy source location technique is usually required. The effect of the flatness of the geophones array, complex geological structure of the rock mass and intense exploitation make the location results ambiguous in such mines. In the present paper an effective method of source location and location's error evaluations are presented, combining data from two different arrays of geophones. The first consists of uniaxial geophones spaced in the whole mine area. The second is installed in one of the mining panels and consists of triaxial geophones. The usage of the data obtained from triaxial geophones allows to increase the hypocenter vertical coordinate precision. The presented two-step location procedure combines standard location methods: P-waves directions and P-waves arrival times. Using computer simulations the efficiency of the created algorithm was tested. The designed algorithm is fully non-linear and was tested on the multilayered rock mass model of the Lubin Copper Mine, showing a computational better efficiency than the traditional P-wave arrival times location algorithm. In this paper we present the complete procedure that effectively solves the non-linear location problems, i.e. the mine tremor location and measurement of the error propagation.

  9. On Interestingness Measures for Mining Statistically Significant and Novel Clinical Associations from EMRs

    PubMed Central

    Abar, Orhan; Charnigo, Richard J.; Rayapati, Abner

    2017-01-01

    Association rule mining has received significant attention from both the data mining and machine learning communities. While data mining researchers focus more on designing efficient algorithms to mine rules from large datasets, the learning community has explored applications of rule mining to classification. A major problem with rule mining algorithms is the explosion of rules even for moderate sized datasets making it very difficult for end users to identify both statistically significant and potentially novel rules that could lead to interesting new insights and hypotheses. Researchers have proposed many domain independent interestingness measures using which, one can rank the rules and potentially glean useful rules from the top ranked ones. However, these measures have not been fully explored for rule mining in clinical datasets owing to the relatively large sizes of the datasets often encountered in healthcare and also due to limited access to domain experts for review/analysis. In this paper, using an electronic medical record (EMR) dataset of diagnoses and medications from over three million patient visits to the University of Kentucky medical center and affiliated clinics, we conduct a thorough evaluation of dozens of interestingness measures proposed in data mining literature, including some new composite measures. Using cumulative relevance metrics from information retrieval, we compare these interestingness measures against human judgments obtained from a practicing psychiatrist for association rules involving the depressive disorders class as the consequent. Our results not only surface new interesting associations for depressive disorders but also indicate classes of interestingness measures that weight rule novelty and statistical strength in contrasting ways, offering new insights for end users in identifying interesting rules. PMID:28736771

  10. Implementation of Data Mining to Analyze Drug Cases Using C4.5 Decision Tree

    NASA Astrophysics Data System (ADS)

    Wahyuni, Sri

    2018-03-01

    Data mining was the process of finding useful information from a large set of databases. One of the existing techniques in data mining was classification. The method used was decision tree method and algorithm used was C4.5 algorithm. The decision tree method was a method that transformed a very large fact into a decision tree which was presenting the rules. Decision tree method was useful for exploring data, as well as finding a hidden relationship between a number of potential input variables with a target variable. The decision tree of the C4.5 algorithm was constructed with several stages including the selection of attributes as roots, created a branch for each value and divided the case into the branch. These stages would be repeated for each branch until all the cases on the branch had the same class. From the solution of the decision tree there would be some rules of a case. In this case the researcher classified the data of prisoners at Labuhan Deli prison to know the factors of detainees committing criminal acts of drugs. By applying this C4.5 algorithm, then the knowledge was obtained as information to minimize the criminal acts of drugs. From the findings of the research, it was found that the most influential factor of the detainee committed the criminal act of drugs was from the address variable.

  11. A novel artificial immune clonal selection classification and rule mining with swarm learning model

    NASA Astrophysics Data System (ADS)

    Al-Sheshtawi, Khaled A.; Abdul-Kader, Hatem M.; Elsisi, Ashraf B.

    2013-06-01

    Metaheuristic optimisation algorithms have become popular choice for solving complex problems. By integrating Artificial Immune clonal selection algorithm (CSA) and particle swarm optimisation (PSO) algorithm, a novel hybrid Clonal Selection Classification and Rule Mining with Swarm Learning Algorithm (CS2) is proposed. The main goal of the approach is to exploit and explore the parallel computation merit of Clonal Selection and the speed and self-organisation merits of Particle Swarm by sharing information between clonal selection population and particle swarm. Hence, we employed the advantages of PSO to improve the mutation mechanism of the artificial immune CSA and to mine classification rules within datasets. Consequently, our proposed algorithm required less training time and memory cells in comparison to other AIS algorithms. In this paper, classification rule mining has been modelled as a miltiobjective optimisation problem with predictive accuracy. The multiobjective approach is intended to allow the PSO algorithm to return an approximation to the accuracy and comprehensibility border, containing solutions that are spread across the border. We compared our proposed algorithm classification accuracy CS2 with five commonly used CSAs, namely: AIRS1, AIRS2, AIRS-Parallel, CLONALG, and CSCA using eight benchmark datasets. We also compared our proposed algorithm classification accuracy CS2 with other five methods, namely: Naïve Bayes, SVM, MLP, CART, and RFB. The results show that the proposed algorithm is comparable to the 10 studied algorithms. As a result, the hybridisation, built of CSA and PSO, can develop respective merit, compensate opponent defect, and make search-optimal effect and speed better.

  12. A software tool for determination of breast cancer treatment methods using data mining approach.

    PubMed

    Cakır, Abdülkadir; Demirel, Burçin

    2011-12-01

    In this work, breast cancer treatment methods are determined using data mining. For this purpose, software is developed to help to oncology doctor for the suggestion of application of the treatment methods about breast cancer patients. 462 breast cancer patient data, obtained from Ankara Oncology Hospital, are used to determine treatment methods for new patients. This dataset is processed with Weka data mining tool. Classification algorithms are applied one by one for this dataset and results are compared to find proper treatment method. Developed software program called as "Treatment Assistant" uses different algorithms (IB1, Multilayer Perception and Decision Table) to find out which one is giving better result for each attribute to predict and by using Java Net beans interface. Treatment methods are determined for the post surgical operation of breast cancer patients using this developed software tool. At modeling step of data mining process, different Weka algorithms are used for output attributes. For hormonotherapy output IB1, for tamoxifen and radiotherapy outputs Multilayer Perceptron and for the chemotherapy output decision table algorithm shows best accuracy performance compare to each other. In conclusion, this work shows that data mining approach can be a useful tool for medical applications particularly at the treatment decision step. Data mining helps to the doctor to decide in a short time.

  13. A novel approach for incremental uncertainty rule generation from databases with missing values handling: application to dynamic medical databases.

    PubMed

    Konias, Sokratis; Chouvarda, Ioanna; Vlahavas, Ioannis; Maglaveras, Nicos

    2005-09-01

    Current approaches for mining association rules usually assume that the mining is performed in a static database, where the problem of missing attribute values does not practically exist. However, these assumptions are not preserved in some medical databases, like in a home care system. In this paper, a novel uncertainty rule algorithm is illustrated, namely URG-2 (Uncertainty Rule Generator), which addresses the problem of mining dynamic databases containing missing values. This algorithm requires only one pass from the initial dataset in order to generate the item set, while new metrics corresponding to the notion of Support and Confidence are used. URG-2 was evaluated over two medical databases, introducing randomly multiple missing values for each record's attribute (rate: 5-20% by 5% increments) in the initial dataset. Compared with the classical approach (records with missing values are ignored), the proposed algorithm was more robust in mining rules from datasets containing missing values. In all cases, the difference in preserving the initial rules ranged between 30% and 60% in favour of URG-2. Moreover, due to its incremental nature, URG-2 saved over 90% of the time required for thorough re-mining. Thus, the proposed algorithm can offer a preferable solution for mining in dynamic relational databases.

  14. Statistically significant performance results of a mine detector and fusion algorithm from an x-band high-resolution SAR

    NASA Astrophysics Data System (ADS)

    Williams, Arnold C.; Pachowicz, Peter W.

    2004-09-01

    Current mine detection research indicates that no single sensor or single look from a sensor will detect mines/minefields in a real-time manner at a performance level suitable for a forward maneuver unit. Hence, the integrated development of detectors and fusion algorithms are of primary importance. A problem in this development process has been the evaluation of these algorithms with relatively small data sets, leading to anecdotal and frequently over trained results. These anecdotal results are often unreliable and conflicting among various sensors and algorithms. Consequently, the physical phenomena that ought to be exploited and the performance benefits of this exploitation are often ambiguous. The Army RDECOM CERDEC Night Vision Laboratory and Electron Sensors Directorate has collected large amounts of multisensor data such that statistically significant evaluations of detection and fusion algorithms can be obtained. Even with these large data sets care must be taken in algorithm design and data processing to achieve statistically significant performance results for combined detectors and fusion algorithms. This paper discusses statistically significant detection and combined multilook fusion results for the Ellipse Detector (ED) and the Piecewise Level Fusion Algorithm (PLFA). These statistically significant performance results are characterized by ROC curves that have been obtained through processing this multilook data for the high resolution SAR data of the Veridian X-Band radar. We discuss the implications of these results on mine detection and the importance of statistical significance, sample size, ground truth, and algorithm design in performance evaluation.

  15. Application and Exploration of Big Data Mining in Clinical Medicine

    PubMed Central

    Zhang, Yue; Guo, Shu-Li; Han, Li-Na; Li, Tie-Ling

    2016-01-01

    Objective: To review theories and technologies of big data mining and their application in clinical medicine. Data Sources: Literatures published in English or Chinese regarding theories and technologies of big data mining and the concrete applications of data mining technology in clinical medicine were obtained from PubMed and Chinese Hospital Knowledge Database from 1975 to 2015. Study Selection: Original articles regarding big data mining theory/technology and big data mining's application in the medical field were selected. Results: This review characterized the basic theories and technologies of big data mining including fuzzy theory, rough set theory, cloud theory, Dempster–Shafer theory, artificial neural network, genetic algorithm, inductive learning theory, Bayesian network, decision tree, pattern recognition, high-performance computing, and statistical analysis. The application of big data mining in clinical medicine was analyzed in the fields of disease risk assessment, clinical decision support, prediction of disease development, guidance of rational use of drugs, medical management, and evidence-based medicine. Conclusion: Big data mining has the potential to play an important role in clinical medicine. PMID:26960378

  16. hs-CRP is strongly associated with coronary heart disease (CHD): A data mining approach using decision tree algorithm.

    PubMed

    Tayefi, Maryam; Tajfard, Mohammad; Saffar, Sara; Hanachi, Parichehr; Amirabadizadeh, Ali Reza; Esmaeily, Habibollah; Taghipour, Ali; Ferns, Gordon A; Moohebati, Mohsen; Ghayour-Mobarhan, Majid

    2017-04-01

    Coronary heart disease (CHD) is an important public health problem globally. Algorithms incorporating the assessment of clinical biomarkers together with several established traditional risk factors can help clinicians to predict CHD and support clinical decision making with respect to interventions. Decision tree (DT) is a data mining model for extracting hidden knowledge from large databases. We aimed to establish a predictive model for coronary heart disease using a decision tree algorithm. Here we used a dataset of 2346 individuals including 1159 healthy participants and 1187 participant who had undergone coronary angiography (405 participants with negative angiography and 782 participants with positive angiography). We entered 10 variables of a total 12 variables into the DT algorithm (including age, sex, FBG, TG, hs-CRP, TC, HDL, LDL, SBP and DBP). Our model could identify the associated risk factors of CHD with sensitivity, specificity, accuracy of 96%, 87%, 94% and respectively. Serum hs-CRP levels was at top of the tree in our model, following by FBG, gender and age. Our model appears to be an accurate, specific and sensitive model for identifying the presence of CHD, but will require validation in prospective studies. Copyright © 2017 Elsevier B.V. All rights reserved.

  17. Mining Co-Location Patterns with Clustering Items from Spatial Data Sets

    NASA Astrophysics Data System (ADS)

    Zhou, G.; Li, Q.; Deng, G.; Yue, T.; Zhou, X.

    2018-05-01

    The explosive growth of spatial data and widespread use of spatial databases emphasize the need for the spatial data mining. Co-location patterns discovery is an important branch in spatial data mining. Spatial co-locations represent the subsets of features which are frequently located together in geographic space. However, the appearance of a spatial feature C is often not determined by a single spatial feature A or B but by the two spatial features A and B, that is to say where A and B appear together, C often appears. We note that this co-location pattern is different from the traditional co-location pattern. Thus, this paper presents a new concept called clustering terms, and this co-location pattern is called co-location patterns with clustering items. And the traditional algorithm cannot mine this co-location pattern, so we introduce the related concept in detail and propose a novel algorithm. This algorithm is extended by join-based approach proposed by Huang. Finally, we evaluate the performance of this algorithm.

  18. Modular Algorithm Testbed Suite (MATS): A Software Framework for Automatic Target Recognition

    DTIC Science & Technology

    2017-01-01

    004 OFFICE OF NAVAL RESEARCH ATTN JASON STACK MINE WARFARE & OCEAN ENGINEERING PROGRAMS CODE 32, SUITE 1092 875 N RANDOLPH ST ARLINGTON VA 22203 ONR...naval mine countermeasures (MCM) operations by automating a large portion of the data analysis. Successful long-term implementation of ATR requires a...Modular Algorithm Testbed Suite; MATS; Mine Countermeasures Operations U U U SAR 24 Derek R. Kolacinski (850) 230-7218 THIS PAGE INTENTIONALLY LEFT

  19. MINE: Module Identification in Networks

    PubMed Central

    2011-01-01

    Background Graphical models of network associations are useful for both visualizing and integrating multiple types of association data. Identifying modules, or groups of functionally related gene products, is an important challenge in analyzing biological networks. However, existing tools to identify modules are insufficient when applied to dense networks of experimentally derived interaction data. To address this problem, we have developed an agglomerative clustering method that is able to identify highly modular sets of gene products within highly interconnected molecular interaction networks. Results MINE outperforms MCODE, CFinder, NEMO, SPICi, and MCL in identifying non-exclusive, high modularity clusters when applied to the C. elegans protein-protein interaction network. The algorithm generally achieves superior geometric accuracy and modularity for annotated functional categories. In comparison with the most closely related algorithm, MCODE, the top clusters identified by MINE are consistently of higher density and MINE is less likely to designate overlapping modules as a single unit. MINE offers a high level of granularity with a small number of adjustable parameters, enabling users to fine-tune cluster results for input networks with differing topological properties. Conclusions MINE was created in response to the challenge of discovering high quality modules of gene products within highly interconnected biological networks. The algorithm allows a high degree of flexibility and user-customisation of results with few adjustable parameters. MINE outperforms several popular clustering algorithms in identifying modules with high modularity and obtains good overall recall and precision of functional annotations in protein-protein interaction networks from both S. cerevisiae and C. elegans. PMID:21605434

  20. Improved mine blast algorithm for optimal cost design of water distribution systems

    NASA Astrophysics Data System (ADS)

    Sadollah, Ali; Guen Yoo, Do; Kim, Joong Hoon

    2015-12-01

    The design of water distribution systems is a large class of combinatorial, nonlinear optimization problems with complex constraints such as conservation of mass and energy equations. Since feasible solutions are often extremely complex, traditional optimization techniques are insufficient. Recently, metaheuristic algorithms have been applied to this class of problems because they are highly efficient. In this article, a recently developed optimizer called the mine blast algorithm (MBA) is considered. The MBA is improved and coupled with the hydraulic simulator EPANET to find the optimal cost design for water distribution systems. The performance of the improved mine blast algorithm (IMBA) is demonstrated using the well-known Hanoi, New York tunnels and Balerma benchmark networks. Optimization results obtained using IMBA are compared to those using MBA and other optimizers in terms of their minimum construction costs and convergence rates. For the complex Balerma network, IMBA offers the cheapest network design compared to other optimization algorithms.

  1. Genetic Algorithm Calibration of Probabilistic Cellular Automata for Modeling Mining Permit Activity

    USGS Publications Warehouse

    Louis, S.J.; Raines, G.L.

    2003-01-01

    We use a genetic algorithm to calibrate a spatially and temporally resolved cellular automata to model mining activity on public land in Idaho and western Montana. The genetic algorithm searches through a space of transition rule parameters of a two dimensional cellular automata model to find rule parameters that fit observed mining activity data. Previous work by one of the authors in calibrating the cellular automaton took weeks - the genetic algorithm takes a day and produces rules leading to about the same (or better) fit to observed data. These preliminary results indicate that genetic algorithms are a viable tool in calibrating cellular automata for this application. Experience gained during the calibration of this cellular automata suggests that mineral resource information is a critical factor in the quality of the results. With automated calibration, further refinements of how the mineral-resource information is provided to the cellular automaton will probably improve our model.

  2. Mining subspace clusters from DNA microarray data using large itemset techniques.

    PubMed

    Chang, Ye-In; Chen, Jiun-Rung; Tsai, Yueh-Chi

    2009-05-01

    Mining subspace clusters from the DNA microarrays could help researchers identify those genes which commonly contribute to a disease, where a subspace cluster indicates a subset of genes whose expression levels are similar under a subset of conditions. Since in a DNA microarray, the number of genes is far larger than the number of conditions, those previous proposed algorithms which compute the maximum dimension sets (MDSs) for any two genes will take a long time to mine subspace clusters. In this article, we propose the Large Itemset-Based Clustering (LISC) algorithm for mining subspace clusters. Instead of constructing MDSs for any two genes, we construct only MDSs for any two conditions. Then, we transform the task of finding the maximal possible gene sets into the problem of mining large itemsets from the condition-pair MDSs. Since we are only interested in those subspace clusters with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonable large support values in the condition-pair MDSs. From our simulation results, we show that the proposed algorithm needs shorter processing time than those previous proposed algorithms which need to construct gene-pair MDSs.

  3. Simple, Scalable, Script-based, Science Processor for Measurements - Data Mining Edition (S4PM-DME)

    NASA Astrophysics Data System (ADS)

    Pham, L. B.; Eng, E. K.; Lynnes, C. S.; Berrick, S. W.; Vollmer, B. E.

    2005-12-01

    The S4PM-DME is the Goddard Earth Sciences Distributed Active Archive Center's (GES DAAC) web-based data mining environment. The S4PM-DME replaces the Near-line Archive Data Mining (NADM) system with a better web environment and a richer set of production rules. S4PM-DME enables registered users to submit and execute custom data mining algorithms. The S4PM-DME system uses the GES DAAC developed Simple Scalable Script-based Science Processor for Measurements (S4PM) to automate tasks and perform the actual data processing. A web interface allows the user to access the S4PM-DME system. The user first develops personalized data mining algorithm on his/her home platform and then uploads them to the S4PM-DME system. Algorithms in C and FORTRAN languages are currently supported. The user developed algorithm is automatically audited for any potential security problems before it is installed within the S4PM-DME system and made available to the user. Once the algorithm has been installed the user can promote the algorithm to the "operational" environment. From here the user can search and order the data available in the GES DAAC archive for his/her science algorithm. The user can also set up a processing subscription. The subscription will automatically process new data as it becomes available in the GES DAAC archive. The generated mined data products are then made available for FTP pickup. The benefits of using S4PM-DME are 1) to decrease the downloading time it typically takes a user to transfer the GES DAAC data to his/her system thus off-load the heavy network traffic, 2) to free-up the load on their system, and last 3) to utilize the rich and abundance ocean, atmosphere data from the MODIS and AIRS instruments available from the GES DAAC.

  4. Commonality of drug-associated adverse events detected by 4 commonly used data mining algorithms.

    PubMed

    Sakaeda, Toshiyuki; Kadoyama, Kaori; Minami, Keiko; Okuno, Yasushi

    2014-01-01

    Data mining algorithms have been developed for the quantitative detection of drug-associated adverse events (signals) from a large database on spontaneously reported adverse events. In the present study, the commonality of signals detected by 4 commonly used data mining algorithms was examined. A total of 2,231,029 reports were retrieved from the public release of the US Food and Drug Administration Adverse Event Reporting System database between 2004 and 2009. The deletion of duplicated submissions and revision of arbitrary drug names resulted in a reduction in the number of reports to 1,644,220. Associations with adverse events were analyzed for 16 unrelated drugs, using the proportional reporting ratio (PRR), reporting odds ratio (ROR), information component (IC), and empirical Bayes geometric mean (EBGM). All EBGM-based signals were included in the PRR-based signals as well as IC- or ROR-based ones, and PRR- and IC-based signals were included in ROR-based ones. The PRR scores of PRR-based signals were significantly larger for 15 of 16 drugs when adverse events were also detected as signals by the EBGM method, as were the IC scores of IC-based signals for all drugs; however, no such effect was observed in the ROR scores of ROR-based signals. The EBGM method was the most conservative among the 4 methods examined, which suggested its better suitability for pharmacoepidemiological studies. Further examinations should be performed on the reproducibility of clinical observations, especially for EBGM-based signals.

  5. Depth data research of GIS based on clustering analysis algorithm

    NASA Astrophysics Data System (ADS)

    Xiong, Yan; Xu, Wenli

    2018-03-01

    The data of GIS have spatial distribution. Geographic data has both spatial characteristics and attribute characteristics, and also changes with time. Therefore, the amount of data is very large. Nowadays, many industries and departments in the society are using GIS. However, without proper data analysis and mining scheme, GIS will not exert its maximum effectiveness and will waste a lot of data. In this paper, we use the geographic information demand of a national security department as the experimental object, combining the characteristics of GIS data, taking into account the characteristics of time, space, attributes and so on, and using cluster analysis algorithm. We further study the mining scheme for depth data, and get the algorithm model. This algorithm can automatically classify sample data, and then carry out exploratory analysis. The research shows that the algorithm model and the information mining scheme can quickly find hidden depth information from the surface data of GIS, thus improving the efficiency of the security department. This algorithm can also be extended to other fields.

  6. Rare itemsets mining algorithm based on RP-Tree and spark framework

    NASA Astrophysics Data System (ADS)

    Liu, Sainan; Pan, Haoan

    2018-05-01

    For the issues of the rare itemsets mining in big data, this paper proposed a rare itemsets mining algorithm based on RP-Tree and Spark framework. Firstly, it arranged the data vertically according to the transaction identifier, in order to solve the defects of scan the entire data set, the vertical datasets are divided into frequent vertical datasets and rare vertical datasets. Then, it adopted the RP-Tree algorithm to construct the frequent pattern tree that contains rare items and generate rare 1-itemsets. After that, it calculated the support of the itemsets by scanning the two vertical data sets, finally, it used the iterative process to generate rare itemsets. The experimental show that the algorithm can effectively excavate rare itemsets and have great superiority in execution time.

  7. Vlsi implementation of flexible architecture for decision tree classification in data mining

    NASA Astrophysics Data System (ADS)

    Sharma, K. Venkatesh; Shewandagn, Behailu; Bhukya, Shankar Nayak

    2017-07-01

    The Data mining algorithms have become vital to researchers in science, engineering, medicine, business, search and security domains. In recent years, there has been a terrific raise in the size of the data being collected and analyzed. Classification is the main difficulty faced in data mining. In a number of the solutions developed for this problem, most accepted one is Decision Tree Classification (DTC) that gives high precision while handling very large amount of data. This paper presents VLSI implementation of flexible architecture for Decision Tree classification in data mining using c4.5 algorithm.

  8. Graph Mining Meets the Semantic Web

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lee, Sangkeun; Sukumar, Sreenivas R; Lim, Seung-Hwan

    The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today, data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. We address that need through implementation of three popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, and PageRank). We implement these algorithms as SPARQL queries, wrapped within Python scripts. We evaluatemore » the performance of our implementation on 6 real world data sets and show graph mining algorithms (that have a linear-algebra formulation) can indeed be unleashed on data represented as RDF graphs using the SPARQL query interface.« less

  9. Quantum algorithm for association rules mining

    NASA Astrophysics Data System (ADS)

    Yu, Chao-Hua; Gao, Fei; Wang, Qing-Le; Wen, Qiao-Yan

    2016-10-01

    Association rules mining (ARM) is one of the most important problems in knowledge discovery and data mining. Given a transaction database that has a large number of transactions and items, the task of ARM is to acquire consumption habits of customers by discovering the relationships between itemsets (sets of items). In this paper, we address ARM in the quantum settings and propose a quantum algorithm for the key part of ARM, finding frequent itemsets from the candidate itemsets and acquiring their supports. Specifically, for the case in which there are Mf(k ) frequent k -itemsets in the Mc(k ) candidate k -itemsets (Mf(k )≤Mc(k ) ), our algorithm can efficiently mine these frequent k -itemsets and estimate their supports by using parallel amplitude estimation and amplitude amplification with complexity O (k/√{Mc(k )Mf(k ) } ɛ ) , where ɛ is the error for estimating the supports. Compared with the classical counterpart, i.e., the classical sampling-based algorithm, whose complexity is O (k/Mc(k ) ɛ2) , our quantum algorithm quadratically improves the dependence on both ɛ and Mc(k ) in the best case when Mf(k )≪Mc(k ) and on ɛ alone in the worst case when Mf(k )≈Mc(k ) .

  10. Performance analysis of a multispectral system for mine detection in the littoral zone

    NASA Astrophysics Data System (ADS)

    Hargrove, John T.; Louchard, Eric

    2004-09-01

    Science & Technology International (STI) has developed, under contract with the Office of Naval Research, a system of multispectral airborne sensors and processing algorithms capable of detecting mine-like objects in the surf zone. STI has used this system to detect mine-like objects in a littoral environment as part of blind tests at Kaneohe Marine Corps Base Hawaii, and Panama City, Florida. The airborne and ground subsystems are described. The detection algorithm is graphically illustrated. We report on the performance of the system configured to operate without a human in the loop. A subsurface (underwater bottom proud mine in the surf zone and moored mine in shallow water) mine detection capability is demonstrated in the surf zone, and in shallow water with wave spillage and foam. Our analysis demonstrates that this STI-developed multispectral airborne mine detection system provides a technical foundation for a viable mine counter-measures system for use prior to an amphibious assault.

  11. Graphics-based intelligent search and abstracting using Data Modeling

    NASA Astrophysics Data System (ADS)

    Jaenisch, Holger M.; Handley, James W.; Case, Carl T.; Songy, Claude G.

    2002-11-01

    This paper presents an autonomous text and context-mining algorithm that converts text documents into point clouds for visual search cues. This algorithm is applied to the task of data-mining a scriptural database comprised of the Old and New Testaments from the Bible and the Book of Mormon, Doctrine and Covenants, and the Pearl of Great Price. Results are generated which graphically show the scripture that represents the average concept of the database and the mining of the documents down to the verse level.

  12. Optimizing research in symptomatic uterine fibroids with development of a computable phenotype for use with electronic health records.

    PubMed

    Hoffman, Sarah R; Vines, Anissa I; Halladay, Jacqueline R; Pfaff, Emily; Schiff, Lauren; Westreich, Daniel; Sundaresan, Aditi; Johnson, La-Shell; Nicholson, Wanda K

    2018-06-01

    Women with symptomatic uterine fibroids can report a myriad of symptoms, including pain, bleeding, infertility, and psychosocial sequelae. Optimizing fibroid research requires the ability to enroll populations of women with image-confirmed symptomatic uterine fibroids. Our objective was to develop an electronic health record-based algorithm to identify women with symptomatic uterine fibroids for a comparative effectiveness study of medical or surgical treatments on quality-of-life measures. Using an iterative process and text-mining techniques, an effective computable phenotype algorithm, composed of demographics, and clinical and laboratory characteristics, was developed with reasonable performance. Such algorithms provide a feasible, efficient way to identify populations of women with symptomatic uterine fibroids for the conduct of large traditional or pragmatic trials and observational comparative effectiveness studies. Symptomatic uterine fibroids, due to menorrhagia, pelvic pain, bulk symptoms, or infertility, are a source of substantial morbidity for reproductive-age women. Comparing Treatment Options for Uterine Fibroids is a multisite registry study to compare the effectiveness of hormonal or surgical fibroid treatments on women's perceptions of their quality of life. Electronic health record-based algorithms are able to identify large numbers of women with fibroids, but additional work is needed to develop electronic health record algorithms that can identify women with symptomatic fibroids to optimize fibroid research. We sought to develop an efficient electronic health record-based algorithm that can identify women with symptomatic uterine fibroids in a large health care system for recruitment into large-scale observational and interventional research in fibroid management. We developed and assessed the accuracy of 3 algorithms to identify patients with symptomatic fibroids using an iterative approach. The data source was the Carolina Data Warehouse for Health, a repository for the health system's electronic health record data. In addition to International Classification of Diseases, Ninth Revision diagnosis and procedure codes and clinical characteristics, text data-mining software was used to derive information from imaging reports to confirm the presence of uterine fibroids. Results of each algorithm were compared with expert manual review to calculate the positive predictive values for each algorithm. Algorithm 1 was composed of the following criteria: (1) age 18-54 years; (2) either ≥1 International Classification of Diseases, Ninth Revision diagnosis codes for uterine fibroids or mention of fibroids using text-mined key words in imaging records or documents; and (3) no International Classification of Diseases, Ninth Revision or Current Procedural Terminology codes for hysterectomy and no reported history of hysterectomy. The positive predictive value was 47% (95% confidence interval 39-56%). Algorithm 2 required ≥2 International Classification of Diseases, Ninth Revision diagnosis codes for fibroids and positive text-mined key words and had a positive predictive value of 65% (95% confidence interval 50-79%). In algorithm 3, further refinements included ≥2 International Classification of Diseases, Ninth Revision diagnosis codes for fibroids on separate outpatient visit dates, the exclusion of women who had a positive pregnancy test within 3 months of their fibroid-related visit, and exclusion of incidentally detected fibroids during prenatal or emergency department visits. Algorithm 3 achieved a positive predictive value of 76% (95% confidence interval 71-81%). An electronic health record-based algorithm is capable of identifying cases of symptomatic uterine fibroids with moderate positive predictive value and may be an efficient approach for large-scale study recruitment. Copyright © 2018 Elsevier Inc. All rights reserved.

  13. Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper.

    PubMed

    Luo, Gang

    2017-12-01

    For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic.

  14. Toward a Progress Indicator for Machine Learning Model Building and Data Mining Algorithm Execution: A Position Paper

    PubMed Central

    Luo, Gang

    2017-01-01

    For user-friendliness, many software systems offer progress indicators for long-duration tasks. A typical progress indicator continuously estimates the remaining task execution time as well as the portion of the task that has been finished. Building a machine learning model often takes a long time, but no existing machine learning software supplies a non-trivial progress indicator. Similarly, running a data mining algorithm often takes a long time, but no existing data mining software provides a nontrivial progress indicator. In this article, we consider the problem of offering progress indicators for machine learning model building and data mining algorithm execution. We discuss the goals and challenges intrinsic to this problem. Then we describe an initial framework for implementing such progress indicators and two advanced, potential uses of them, with the goal of inspiring future research on this topic. PMID:29177022

  15. RANWAR: rank-based weighted association rule mining from gene expression and methylation data.

    PubMed

    Mallik, Saurav; Mukhopadhyay, Anirban; Maulik, Ujjwal

    2015-01-01

    Ranking of association rules is currently an interesting topic in data mining and bioinformatics. The huge number of evolved rules of items (or, genes) by association rule mining (ARM) algorithms makes confusion to the decision maker. In this article, we propose a weighted rule-mining technique (say, RANWAR or rank-based weighted association rule-mining) to rank the rules using two novel rule-interestingness measures, viz., rank-based weighted condensed support (wcs) and weighted condensed confidence (wcc) measures to bypass the problem. These measures are basically depended on the rank of items (genes). Using the rank, we assign weight to each item. RANWAR generates much less number of frequent itemsets than the state-of-the-art association rule mining algorithms. Thus, it saves time of execution of the algorithm. We run RANWAR on gene expression and methylation datasets. The genes of the top rules are biologically validated by Gene Ontologies (GOs) and KEGG pathway analyses. Many top ranked rules extracted from RANWAR that hold poor ranks in traditional Apriori, are highly biologically significant to the related diseases. Finally, the top rules evolved from RANWAR, that are not in Apriori, are reported.

  16. Differentially Private Frequent Subgraph Mining

    PubMed Central

    Xu, Shengzhi; Xiong, Li; Cheng, Xiang; Xiao, Ke

    2016-01-01

    Mining frequent subgraphs from a collection of input graphs is an important topic in data mining research. However, if the input graphs contain sensitive information, releasing frequent subgraphs may pose considerable threats to individual's privacy. In this paper, we study the problem of frequent subgraph mining (FGM) under the rigorous differential privacy model. We introduce a novel differentially private FGM algorithm, which is referred to as DFG. In this algorithm, we first privately identify frequent subgraphs from input graphs, and then compute the noisy support of each identified frequent subgraph. In particular, to privately identify frequent subgraphs, we present a frequent subgraph identification approach which can improve the utility of frequent subgraph identifications through candidates pruning. Moreover, to compute the noisy support of each identified frequent subgraph, we devise a lattice-based noisy support derivation approach, where a series of methods has been proposed to improve the accuracy of the noisy supports. Through formal privacy analysis, we prove that our DFG algorithm satisfies ε-differential privacy. Extensive experimental results on real datasets show that the DFG algorithm can privately find frequent subgraphs with high data utility. PMID:27616876

  17. A gossip based information fusion protocol for distributed frequent itemset mining

    NASA Astrophysics Data System (ADS)

    Sohrabi, Mohammad Karim

    2018-07-01

    The computational complexity, huge memory space requirement, and time-consuming nature of frequent pattern mining process are the most important motivations for distribution and parallelization of this mining process. On the other hand, the emergence of distributed computational and operational environments, which causes the production and maintenance of data on different distributed data sources, makes the parallelization and distribution of the knowledge discovery process inevitable. In this paper, a gossip based distributed itemset mining (GDIM) algorithm is proposed to extract frequent itemsets, which are special types of frequent patterns, in a wireless sensor network environment. In this algorithm, local frequent itemsets of each sensor are extracted using a bit-wise horizontal approach (LHPM) from the nodes which are clustered using a leach-based protocol. Heads of clusters exploit a gossip based protocol in order to communicate each other to find the patterns which their global support is equal to or more than the specified support threshold. Experimental results show that the proposed algorithm outperforms the best existing gossip based algorithm in term of execution time.

  18. Java implementation of Class Association Rule algorithms

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tamura, Makio

    2007-08-30

    Java implementation of three Class Association Rule mining algorithms, NETCAR, CARapriori, and clustering based rule mining. NETCAR algorithm is a novel algorithm developed by Makio Tamura. The algorithm is discussed in a paper: UCRL-JRNL-232466-DRAFT, and would be published in a peer review scientific journal. The software is used to extract combinations of genes relevant with a phenotype from a phylogenetic profile and a phenotype profile. The phylogenetic profiles is represented by a binary matrix and a phenotype profile is represented by a binary vector. The present application of this software will be in genome analysis, however, it could be appliedmore » more generally.« less

  19. Formulations and algorithms for problems on rock mass and support deformation during mining

    NASA Astrophysics Data System (ADS)

    Seryakov, VM

    2018-03-01

    The analysis of problem formulations to calculate stress-strain state of mine support and surrounding rocks mass in rock mechanics shows that such formulations incompletely describe the mechanical features of joint deformation in the rock mass–support system. The present paper proposes an algorithm to take into account the actual conditions of rock mass and support interaction and the algorithm implementation method to ensure efficient calculation of stresses in rocks and support.

  20. Real -time dispatching modelling for trucks with different capacities in open pit mines / Modelowanie w czasie rzeczywistym przewozów ciężarówek o różnej ładowności w kopalni odkrywkowej

    NASA Astrophysics Data System (ADS)

    Ahangaran, Daryoush Kaveh; Yasrebi, Amir Bijan; Wetherelt, Andy; Foster, Patrick

    2012-10-01

    Application of fully automated systems for truck dispatching plays a major role in decreasing the transportation costs which often represent the majority of costs spent on open pit mining. Consequently, the application of a truck dispatching system has become fundamentally important in most of the world's open pit mines. Recent experiences indicate that by decreasing a truck's travelling time and the associated waiting time of its associated shovel then due to the application of a truck dispatching system the rate of production will be considerably improved. Computer-based truck dispatching systems using algorithms, advanced and accurate software are examples of these innovations. Developing an algorithm of a computer- based program appropriated to a specific mine's conditions is considered as one of the most important activities in connection with computer-based dispatching in open pit mines. In this paper the changing trend of programming and dispatching control algorithms and automation conditions will be discussed. Furthermore, since the transportation fleet of most mines use trucks with different capacities, innovative methods, operational optimisation techniques and the best possible methods for developing the required algorithm for real-time dispatching are selected by conducting research on mathematical-based planning methods. Finally, a real-time dispatching model compatible with the requirement of trucks with different capacities is developed by using two techniques of flow networks and integer programming.

  1. Detecting and visualizing weak signatures in hyperspectral data

    NASA Astrophysics Data System (ADS)

    MacPherson, Duncan James

    This thesis evaluates existing techniques for detecting weak spectral signatures from remotely sensed hyperspectral data. Algorithms are presented that successfully detect hard-to-find 'mystery' signatures in unknown cluttered backgrounds. The term 'mystery' is used to describe a scenario where the spectral target and background endmembers are unknown. Sub-Pixel analysis and background suppression are used to find deeply embedded signatures which can be less than 10% of the total signal strength. Existing 'mystery target' detection algorithms are derived and compared. Several techniques are shown to be superior both visually and quantitatively. Detection performance is evaluated using confidence metrics that are developed. A multiple algorithm approach is shown to improve detection confidence significantly. Although the research focuses on remote sensing applications, the algorithms presented can be applied to a wide variety of diverse fields such as medicine, law enforcement, manufacturing, earth science, food production, and astrophysics. The algorithms are shown to be general and can be applied to both the reflective and emissive parts of the electromagnetic spectrum. The application scope is a broad one and the final results open new opportunities for many specific applications including: land mine detection, pollution and hazardous waste detection, crop abundance calculations, volcanic activity monitoring, detecting diseases in food, automobile or airplane target recognition, cancer detection, mining operations, extracting galactic gas emissions, etc.

  2. Anchor-Free Localization Method for Mobile Targets in Coal Mine Wireless Sensor Networks

    PubMed Central

    Pei, Zhongmin; Deng, Zhidong; Xu, Shuo; Xu, Xiao

    2009-01-01

    Severe natural conditions and complex terrain make it difficult to apply precise localization in underground mines. In this paper, an anchor-free localization method for mobile targets is proposed based on non-metric multi-dimensional scaling (Multi-dimensional Scaling: MDS) and rank sequence. Firstly, a coal mine wireless sensor network is constructed in underground mines based on the ZigBee technology. Then a non-metric MDS algorithm is imported to estimate the reference nodes’ location. Finally, an improved sequence-based localization algorithm is presented to complete precise localization for mobile targets. The proposed method is tested through simulations with 100 nodes, outdoor experiments with 15 ZigBee physical nodes, and the experiments in the mine gas explosion laboratory with 12 ZigBee nodes. Experimental results show that our method has better localization accuracy and is more robust in underground mines. PMID:22574048

  3. Anchor-free localization method for mobile targets in coal mine wireless sensor networks.

    PubMed

    Pei, Zhongmin; Deng, Zhidong; Xu, Shuo; Xu, Xiao

    2009-01-01

    Severe natural conditions and complex terrain make it difficult to apply precise localization in underground mines. In this paper, an anchor-free localization method for mobile targets is proposed based on non-metric multi-dimensional scaling (Multi-dimensional Scaling: MDS) and rank sequence. Firstly, a coal mine wireless sensor network is constructed in underground mines based on the ZigBee technology. Then a non-metric MDS algorithm is imported to estimate the reference nodes' location. Finally, an improved sequence-based localization algorithm is presented to complete precise localization for mobile targets. The proposed method is tested through simulations with 100 nodes, outdoor experiments with 15 ZigBee physical nodes, and the experiments in the mine gas explosion laboratory with 12 ZigBee nodes. Experimental results show that our method has better localization accuracy and is more robust in underground mines.

  4. Data Mining Methods Applied to Flight Operations Quality Assurance Data: A Comparison to Standard Statistical Methods

    NASA Technical Reports Server (NTRS)

    Stolzer, Alan J.; Halford, Carl

    2007-01-01

    In a previous study, multiple regression techniques were applied to Flight Operations Quality Assurance-derived data to develop parsimonious model(s) for fuel consumption on the Boeing 757 airplane. The present study examined several data mining algorithms, including neural networks, on the fuel consumption problem and compared them to the multiple regression results obtained earlier. Using regression methods, parsimonious models were obtained that explained approximately 85% of the variation in fuel flow. In general data mining methods were more effective in predicting fuel consumption. Classification and Regression Tree methods reported correlation coefficients of .91 to .92, and General Linear Models and Multilayer Perceptron neural networks reported correlation coefficients of about .99. These data mining models show great promise for use in further examining large FOQA databases for operational and safety improvements.

  5. Convalescing Cluster Configuration Using a Superlative Framework

    PubMed Central

    Sabitha, R.; Karthik, S.

    2015-01-01

    Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895

  6. EAGLE: 'EAGLE'Is an' Algorithmic Graph Library for Exploration

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2015-01-16

    The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. Today there is no tools to conduct "graph mining" on RDF standard data sets. We address that need through implementation of popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, degree distribution,more » diversity degree, PageRank, etc.). We implement these algorithms as SPARQL queries, wrapped within Python scripts and call our software tool as EAGLE. In RDF style, EAGLE stands for "EAGLE 'Is an' algorithmic graph library for exploration. EAGLE is like 'MATLAB' for 'Linked Data.'« less

  7. Predicting the survival of diabetes using neural network

    NASA Astrophysics Data System (ADS)

    Mamuda, Mamman; Sathasivam, Saratha

    2017-08-01

    Data mining techniques at the present time are used in predicting diseases of health care industries. Neural Network is one among the prevailing method in data mining techniques of an intelligent field for predicting diseases in health care industries. This paper presents a study on the prediction of the survival of diabetes diseases using different learning algorithms from the supervised learning algorithms of neural network. Three learning algorithms are considered in this study: (i) The levenberg-marquardt learning algorithm (ii) The Bayesian regulation learning algorithm and (iii) The scaled conjugate gradient learning algorithm. The network is trained using the Pima Indian Diabetes Dataset with the help of MATLAB R2014(a) software. The performance of each algorithm is further discussed through regression analysis. The prediction accuracy of the best algorithm is further computed to validate the accurate prediction

  8. Sensor feature fusion for detecting buried objects

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Clark, G.A.; Sengupta, S.K.; Sherwood, R.J.

    1993-04-01

    Given multiple registered images of the earth`s surface from dual-band sensors, our system fuses information from the sensors to reduce the effects of clutter and improve the ability to detect buried or surface target sites. The sensor suite currently includes two sensors (5 micron and 10 micron wavelengths) and one ground penetrating radar (GPR) of the wide-band pulsed synthetic aperture type. We use a supervised teaming pattern recognition approach to detect metal and plastic land mines buried in soil. The overall process consists of four main parts: Preprocessing, feature extraction, feature selection, and classification. These parts are used in amore » two step process to classify a subimage. Thee first step, referred to as feature selection, determines the features of sub-images which result in the greatest separability among the classes. The second step, image labeling, uses the selected features and the decisions from a pattern classifier to label the regions in the image which are likely to correspond to buried mines. We extract features from the images, and use feature selection algorithms to select only the most important features according to their contribution to correct detections. This allows us to save computational complexity and determine which of the sensors add value to the detection system. The most important features from the various sensors are fused using supervised teaming pattern classifiers (including neural networks). We present results of experiments to detect buried land mines from real data, and evaluate the usefulness of fusing feature information from multiple sensor types, including dual-band infrared and ground penetrating radar. The novelty of the work lies mostly in the combination of the algorithms and their application to the very important and currently unsolved operational problem of detecting buried land mines from an airborne standoff platform.« less

  9. The Weather Forecast Using Data Mining Research Based on Cloud Computing.

    NASA Astrophysics Data System (ADS)

    Wang, ZhanJie; Mazharul Mujib, A. B. M.

    2017-10-01

    Weather forecasting has been an important application in meteorology and one of the most scientifically and technologically challenging problem around the world. In my study, we have analyzed the use of data mining techniques in forecasting weather. This paper proposes a modern method to develop a service oriented architecture for the weather information systems which forecast weather using these data mining techniques. This can be carried out by using Artificial Neural Network and Decision tree Algorithms and meteorological data collected in Specific time. Algorithm has presented the best results to generate classification rules for the mean weather variables. The results showed that these data mining techniques can be enough for weather forecasting.

  10. A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xu, Songhua; Krauthammer, Prof. Michael

    2010-01-01

    There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manuallymore » labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use.« less

  11. Privacy Preserving Sequential Pattern Mining in Data Stream

    NASA Astrophysics Data System (ADS)

    Huang, Qin-Hua

    The privacy preserving data mining technique researches have gained much attention in recent years. For data stream systems, wireless networks and mobile devices, the related stream data mining techniques research is still in its' early stage. In this paper, an data mining algorithm dealing with privacy preserving problem in data stream is presented.

  12. A multiobjective optimization model and an orthogonal design-based hybrid heuristic algorithm for regional urban mining management problems.

    PubMed

    Wu, Hao; Wan, Zhong

    2018-02-01

    In this paper, a multiobjective mixed-integer piecewise nonlinear programming model (MOMIPNLP) is built to formulate the management problem of urban mining system, where the decision variables are associated with buy-back pricing, choices of sites, transportation planning, and adjustment of production capacity. Different from the existing approaches, the social negative effect, generated from structural optimization of the recycling system, is minimized in our model, as well as the total recycling profit and utility from environmental improvement are jointly maximized. For solving the problem, the MOMIPNLP model is first transformed into an ordinary mixed-integer nonlinear programming model by variable substitution such that the piecewise feature of the model is removed. Then, based on technique of orthogonal design, a hybrid heuristic algorithm is developed to find an approximate Pareto-optimal solution, where genetic algorithm is used to optimize the structure of search neighborhood, and both local branching algorithm and relaxation-induced neighborhood search algorithm are employed to cut the searching branches and reduce the number of variables in each branch. Numerical experiments indicate that this algorithm spends less CPU (central processing unit) time in solving large-scale regional urban mining management problems, especially in comparison with the similar ones available in literature. By case study and sensitivity analysis, a number of practical managerial implications are revealed from the model. Since the metal stocks in society are reliable overground mineral sources, urban mining has been paid great attention as emerging strategic resources in an era of resource shortage. By mathematical modeling and development of efficient algorithms, this paper provides decision makers with useful suggestions on the optimal design of recycling system in urban mining. For example, this paper can answer how to encourage enterprises to join the recycling activities by government's support and subsidies, whether the existing recycling system can meet the developmental requirements or not, and what is a reasonable adjustment of production capacity.

  13. Exploration of the association rules mining technique for the signal detection of adverse drug events in spontaneous reporting systems.

    PubMed

    Wang, Chao; Guo, Xiao-Jing; Xu, Jin-Fang; Wu, Cheng; Sun, Ya-Lin; Ye, Xiao-Fei; Qian, Wei; Ma, Xiu-Qiang; Du, Wen-Min; He, Jia

    2012-01-01

    The detection of signals of adverse drug events (ADEs) has increased because of the use of data mining algorithms in spontaneous reporting systems (SRSs). However, different data mining algorithms have different traits and conditions for application. The objective of our study was to explore the application of association rule (AR) mining in ADE signal detection and to compare its performance with that of other algorithms. Monte Carlo simulation was applied to generate drug-ADE reports randomly according to the characteristics of SRS datasets. Thousand simulated datasets were mined by AR and other algorithms. On average, 108,337 reports were generated by the Monte Carlo simulation. Based on the predefined criterion that 10% of the drug-ADE combinations were true signals, with RR equaling to 10, 4.9, 1.5, and 1.2, AR detected, on average, 284 suspected associations with a minimum support of 3 and a minimum lift of 1.2. The area under the receiver operating characteristic (ROC) curve of the AR was 0.788, which was equivalent to that shown for other algorithms. Additionally, AR was applied to reports submitted to the Shanghai SRS in 2009. Five hundred seventy combinations were detected using AR from 24,297 SRS reports, and they were compared with recognized ADEs identified by clinical experts and various other sources. AR appears to be an effective method for ADE signal detection, both in simulated and real SRS datasets. The limitations of this method exposed in our study, i.e., a non-uniform thresholds setting and redundant rules, require further research.

  14. Detection of buried objects by fusing dual-band infrared images

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Clark, G.A.; Sengupta, S.K.; Sherwood, R.J.

    1993-11-01

    We have conducted experiments to demonstrate the enhanced detectability of buried land mines using sensor fusion techniques. Multiple sensors, including visible imagery, infrared imagery, and ground penetrating radar (GPR), have been used to acquire data on a number of buried mines and mine surrogates. Because the visible wavelength and GPR data are currently incomplete. This paper focuses on the fusion of two-band infrared images. We use feature-level fusion and supervised learning with the probabilistic neural network (PNN) to evaluate detection performance. The novelty of the work lies in the application of advanced target recognition algorithms, the fusion of dual-band infraredmore » images and evaluation of the techniques using two real data sets.« less

  15. Exploiting Sequential Patterns Found in Users' Solutions and Virtual Tutor Behavior to Improve Assistance in ITS

    ERIC Educational Resources Information Center

    Fournier-Viger, Philippe; Faghihi, Usef; Nkambou, Roger; Nguifo, Engelbert Mephu

    2010-01-01

    We propose to mine temporal patterns in Intelligent Tutoring Systems (ITSs) to uncover useful knowledge that can enhance their ability to provide assistance. To discover patterns, we suggest using a custom, sequential pattern-mining algorithm. Two ways of applying the algorithm to enhance an ITS's capabilities are addressed. The first is to…

  16. Efficiently hiding sensitive itemsets with transaction deletion based on genetic algorithms.

    PubMed

    Lin, Chun-Wei; Zhang, Binbin; Yang, Kuo-Tung; Hong, Tzung-Pei

    2014-01-01

    Data mining is used to mine meaningful and useful information or knowledge from a very large database. Some secure or private information can be discovered by data mining techniques, thus resulting in an inherent risk of threats to privacy. Privacy-preserving data mining (PPDM) has thus arisen in recent years to sanitize the original database for hiding sensitive information, which can be concerned as an NP-hard problem in sanitization process. In this paper, a compact prelarge GA-based (cpGA2DT) algorithm to delete transactions for hiding sensitive itemsets is thus proposed. It solves the limitations of the evolutionary process by adopting both the compact GA-based (cGA) mechanism and the prelarge concept. A flexible fitness function with three adjustable weights is thus designed to find the appropriate transactions to be deleted in order to hide sensitive itemsets with minimal side effects of hiding failure, missing cost, and artificial cost. Experiments are conducted to show the performance of the proposed cpGA2DT algorithm compared to the simple GA-based (sGA2DT) algorithm and the greedy approach in terms of execution time and three side effects.

  17. Data Mining and Optimization Tools for Developing Engine Parameters Tools

    NASA Technical Reports Server (NTRS)

    Dhawan, Atam P.

    1998-01-01

    This project was awarded for understanding the problem and developing a plan for Data Mining tools for use in designing and implementing an Engine Condition Monitoring System. Tricia Erhardt and I studied the problem domain for developing an Engine Condition Monitoring system using the sparse and non-standardized datasets to be available through a consortium at NASA Lewis Research Center. We visited NASA three times to discuss additional issues related to dataset which was not made available to us. We discussed and developed a general framework of data mining and optimization tools to extract useful information from sparse and non-standard datasets. These discussions lead to the training of Tricia Erhardt to develop Genetic Algorithm based search programs which were written in C++ and used to demonstrate the capability of GA algorithm in searching an optimal solution in noisy, datasets. From the study and discussion with NASA LeRC personnel, we then prepared a proposal, which is being submitted to NASA for future work for the development of data mining algorithms for engine conditional monitoring. The proposed set of algorithm uses wavelet processing for creating multi-resolution pyramid of tile data for GA based multi-resolution optimal search.

  18. MISAGA: An Algorithm for Mining Interesting Subgraphs in Attributed Graphs.

    PubMed

    He, Tiantian; Chan, Keith C C

    2018-05-01

    An attributed graph contains vertices that are associated with a set of attribute values. Mining clusters or communities, which are interesting subgraphs in the attributed graph is one of the most important tasks of graph analytics. Many problems can be defined as the mining of interesting subgraphs in attributed graphs. Algorithms that discover subgraphs based on predefined topologies cannot be used to tackle these problems. To discover interesting subgraphs in the attributed graph, we propose an algorithm called mining interesting subgraphs in attributed graph algorithm (MISAGA). MISAGA performs its tasks by first using a probabilistic measure to determine whether the strength of association between a pair of attribute values is strong enough to be interesting. Given the interesting pairs of attribute values, then the degree of association is computed for each pair of vertices using an information theoretic measure. Based on the edge structure and degree of association between each pair of vertices, MISAGA identifies interesting subgraphs by formulating it as a constrained optimization problem and solves it by identifying the optimal affiliation of subgraphs for the vertices in the attributed graph. MISAGA has been tested with several large-sized real graphs and is found to be potentially very useful for various applications.

  19. Analysis of data mining classification by comparison of C4.5 and ID algorithms

    NASA Astrophysics Data System (ADS)

    Sudrajat, R.; Irianingsih, I.; Krisnawan, D.

    2017-01-01

    The rapid development of information technology, triggered by the intensive use of information technology. For example, data mining widely used in investment. Many techniques that can be used assisting in investment, the method that used for classification is decision tree. Decision tree has a variety of algorithms, such as C4.5 and ID3. Both algorithms can generate different models for similar data sets and different accuracy. C4.5 and ID3 algorithms with discrete data provide accuracy are 87.16% and 99.83% and C4.5 algorithm with numerical data is 89.69%. C4.5 and ID3 algorithms with discrete data provides 520 and 598 customers and C4.5 algorithm with numerical data is 546 customers. From the analysis of the both algorithm it can classified quite well because error rate less than 15%.

  20. Application of data mining approaches to drug delivery.

    PubMed

    Ekins, Sean; Shimada, Jun; Chang, Cheng

    2006-11-30

    Computational approaches play a key role in all areas of the pharmaceutical industry from data mining, experimental and clinical data capture to pharmacoeconomics and adverse events monitoring. They will likely continue to be indispensable assets along with a growing library of software applications. This is primarily due to the increasingly massive amount of biology, chemistry and clinical data, which is now entering the public domain mainly as a result of NIH and commercially funded projects. We are therefore in need of new methods for mining this mountain of data in order to enable new hypothesis generation. The computational approaches include, but are not limited to, database compilation, quantitative structure activity relationships (QSAR), pharmacophores, network visualization models, decision trees, machine learning algorithms and multidimensional data visualization software that could be used to improve drug delivery after mining public and/or proprietary data. We will discuss some areas of unmet needs in the area of data mining for drug delivery that can be addressed with new software tools or databases of relevance to future pharmaceutical projects.

  1. Improving machine operation management efficiency via improving the vehicle park structure and using the production operation information database

    NASA Astrophysics Data System (ADS)

    Koptev, V. Yu

    2017-02-01

    The work represents the results of studying basic interconnected criteria of separate equipment units of the transport network machines fleet, depending on production and mining factors to improve the transport systems management. Justifying the selection of a control system necessitates employing new methodologies and models, augmented with stability and transport flow criteria, accounting for mining work development dynamics on mining sites. A necessary condition is the accounting of technical and operating parameters related to vehicle operation. Modern open pit mining dispatching systems must include such kinds of the information database. An algorithm forming a machine fleet is presented based on multi-variation task solution in connection with defining reasonable operating features of a machine working as a part of a complex. Proposals cited in the work may apply to mining machines (drilling equipment, excavators) and construction equipment (bulldozers, cranes, pile-drivers), city transport and other types of production activities using machine fleet.

  2. A study on eco-environmental vulnerability of mining cities: a case study of Panzhihua city of Sichuan province in China

    NASA Astrophysics Data System (ADS)

    Shao, Huaiyong; Xian, Wei; Yang, Wunian

    2009-07-01

    The large-scale and super-strength development of mineral resources in mining cities in long term has made great contributions to China's economic construction and development, but it has caused serious damage to the ecological environment even ecological imbalance at the same time because the neglect of the environmental impact even to the expense of the environment to some extent. In this study, according to the characteristics of mining cities, the scientific and practical eco-environmental vulnerability evaluation index system of mining cities had been established. Taking Panzhihua city of Sichuan province as an example, using remote sensing and GIS technology, applying various types of remote sensing image (TM, SPOT5, IKONOS) and Statistical data, the ecological environment evaluation data of mining cities was extracted effectively. For the non-linear relationship between the evaluation indexes and the degree of eco-environmental vulnerability in mining cities, this study innovative took the evaluation of eco-environmental vulnerability of the study area by using artificial neural network whose training used SCE-UA algorithm that well overcome the slow learning and difficult convergence of traditional neural network algorithm. The results of ecoenvironmental vulnerability evaluation of the study area were objective, reasonable and the credibility was high. The results showed that the area distribution of five eco-environmental vulnerability grade types was basically normal, and the overall ecological environment situation of Panzhihua city was in the middle level, the degree of eco-environmental vulnerability in the south was higher than the north, and mining activities were dominant factors to cause ecoenvironmental damage and eco-environmental Vulnerability. In this study, a comprehensive theory and technology system of regional eco-environmental vulnerability evaluation which included the establishment of eco-environmental vulnerability evaluation index system, processing of evaluation data and establishing of evaluation model. New ideas and methods had provided for eco-environmental vulnerability of mining cities.

  3. Astrophysical data mining with GPU. A case study: Genetic classification of globular clusters

    NASA Astrophysics Data System (ADS)

    Cavuoti, S.; Garofalo, M.; Brescia, M.; Paolillo, M.; Pescape', A.; Longo, G.; Ventre, G.

    2014-01-01

    We present a multi-purpose genetic algorithm, designed and implemented with GPGPU/CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource, http://dame.dsf.unina.it/beta_info.html), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200× in the training phase with respect to the CPU based version.

  4. Semi-automated based ground-truthing GUI for airborne imagery

    NASA Astrophysics Data System (ADS)

    Phan, Chung; Lydic, Rich; Moore, Tim; Trang, Anh; Agarwal, Sanjeev; Tiwari, Spandan

    2005-06-01

    Over the past several years, an enormous amount of airborne imagery consisting of various formats has been collected and will continue into the future to support airborne mine/minefield detection processes, improve algorithm development, and aid in imaging sensor development. The ground-truthing of imagery is a very essential part of the algorithm development process to help validate the detection performance of the sensor and improving algorithm techniques. The GUI (Graphical User Interface) called SemiTruth was developed using Matlab software incorporating signal processing, image processing, and statistics toolboxes to aid in ground-truthing imagery. The semi-automated ground-truthing GUI is made possible with the current data collection method, that is including UTM/GPS (Universal Transverse Mercator/Global Positioning System) coordinate measurements for the mine target and fiducial locations on the given minefield layout to support in identification of the targets on the raw imagery. This semi-automated ground-truthing effort has developed by the US Army RDECOM CERDEC Night Vision and Electronic Sensors Directorate (NVESD), Countermine Division, Airborne Application Branch with some support by the University of Missouri-Rolla.

  5. Differentiation of closely related isomers: application of data mining techniques in conjunction with variable wavelength infrared multiple photon dissociation mass spectrometry for identification of glucose-containing disaccharide ions.

    PubMed

    Stefan, Sarah E; Ehsan, Mohammad; Pearson, Wright L; Aksenov, Alexander; Boginski, Vladimir; Bendiak, Brad; Eyler, John R

    2011-11-15

    Data mining algorithms have been used to analyze the infrared multiple photon dissociation (IRMPD) patterns of gas-phase lithiated disaccharide isomers irradiated with either a line-tunable CO(2) laser or a free electron laser (FEL). The IR fragmentation patterns over the wavelength range of 9.2-10.6 μm have been shown in earlier work to correlate uniquely with the asymmetry at the anomeric carbon in each disaccharide. Application of data mining approaches for data analysis allowed unambiguous determination of the anomeric carbon configurations for each disaccharide isomer pair using fragmentation data at a single wavelength. In addition, the linkage positions were easily assigned. This combination of wavelength-selective IRMPD and data mining offers a powerful and convenient tool for differentiation of structurally closely related isomers, including those of gas-phase carbohydrate complexes.

  6. An Evaluation of Pixel-Based Methods for the Detection of Floating Objects on the Sea Surface

    NASA Astrophysics Data System (ADS)

    Borghgraef, Alexander; Barnich, Olivier; Lapierre, Fabian; Van Droogenbroeck, Marc; Philips, Wilfried; Acheroy, Marc

    2010-12-01

    Ship-based automatic detection of small floating objects on an agitated sea surface remains a hard problem. Our main concern is the detection of floating mines, which proved a real threat to shipping in confined waterways during the first Gulf War, but applications include salvaging, search-and-rescue operation, perimeter, or harbour defense. Detection in infrared (IR) is challenging because a rough sea is seen as a dynamic background of moving objects with size order, shape, and temperature similar to those of the floating mine. In this paper we have applied a selection of background subtraction algorithms to the problem, and we show that the recent algorithms such as ViBe and behaviour subtraction, which take into account spatial and temporal correlations within the dynamic scene, significantly outperform the more conventional parametric techniques, with only little prior assumptions about the physical properties of the scene.

  7. Use of data mining to predict significant factors and benefits of bilateral cochlear implantation.

    PubMed

    Ramos-Miguel, Angel; Perez-Zaballos, Teresa; Perez, Daniel; Falconb, Juan Carlos; Ramosb, Angel

    2015-11-01

    Data mining (DM) is a technique used to discover pattern and knowledge from a big amount of data. It uses artificial intelligence, automatic learning, statistics, databases, etc. In this study, DM was successfully used as a predictive tool to assess disyllabic speech test performance in bilateral implanted patients with a success rate above 90%. 60 bilateral sequentially implanted adult patients were included in the study. The DM algorithms developed found correlations between unilateral medical records and Audiological test results and bilateral performance by establishing relevant variables based on two DM techniques: the classifier and the estimation. The nearest neighbor algorithm was implemented in the first case, and the linear regression in the second. The results showed that patients with unilateral disyllabic test results below 70% benefited the most from a bilateral implantation. Finally, it was observed that its benefits decrease as the inter-implant time increases.

  8. Developing and Implementing the Data Mining Algorithms in RAVEN

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sen, Ramazan Sonat; Maljovec, Daniel Patrick; Alfonsi, Andrea

    The RAVEN code is becoming a comprehensive tool to perform probabilistic risk assessment, uncertainty quantification, and verification and validation. The RAVEN code is being developed to support many programs and to provide a set of methodologies and algorithms for advanced analysis. Scientific computer codes can generate enormous amounts of data. To post-process and analyze such data might, in some cases, take longer than the initial software runtime. Data mining algorithms/methods help in recognizing and understanding patterns in the data, and thus discover knowledge in databases. The methodologies used in the dynamic probabilistic risk assessment or in uncertainty and error quantificationmore » analysis couple system/physics codes with simulation controller codes, such as RAVEN. RAVEN introduces both deterministic and stochastic elements into the simulation while the system/physics code model the dynamics deterministically. A typical analysis is performed by sampling values of a set of parameter values. A major challenge in using dynamic probabilistic risk assessment or uncertainty and error quantification analysis for a complex system is to analyze the large number of scenarios generated. Data mining techniques are typically used to better organize and understand data, i.e. recognizing patterns in the data. This report focuses on development and implementation of Application Programming Interfaces (APIs) for different data mining algorithms, and the application of these algorithms to different databases.« less

  9. Function Clustering Self-Organization Maps (FCSOMs) for mining differentially expressed genes in Drosophila and its correlation with the growth medium.

    PubMed

    Liu, L L; Liu, M J; Ma, M

    2015-09-28

    The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.

  10. Knowledge discovery through games and game theory

    NASA Astrophysics Data System (ADS)

    Smith, James F., III; Rhyne, Robert D.

    2001-03-01

    A fuzzy logic based expert system has been developed that automatically allocates electronic attack (EA) resources in real-time over many dissimilar platforms. The platforms can be very general, e.g., ships, planes, robots, land based facilities, etc. Potential foes the platforms deal with can also be general. The initial version of the algorithm was optimized using a genetic algorithm employing fitness functions constructed based on expertise. A new approach is being explored that involves embedding the resource manager in a electronic game environment. The game allows a human expert to play against the resource manager in a simulated battlespace with each of the defending platforms being exclusively directed by the fuzzy resource manager and the attacking platforms being controlled by the human expert or operating autonomously under their own logic. This approach automates the data mining problem. The game automatically creates a database reflecting the domain expert's knowledge, it calls a data mining function, a genetic algorithm, for data mining of the database as required. The game allows easy evaluation of the information mined in the second step. The measure of effectiveness (MOE) for re-optimization is discussed. The mined information is extremely valuable as shown through demanding scenarios.

  11. Health Terrain: Visualizing Large Scale Health Data

    DTIC Science & Technology

    2015-12-01

    Text mining ; Data mining . 16. SECURITY  CLASSIFICATION  OF: 17... text   mining  algorithms  to  construct  a  concept  space.  A   browser-­‐based  user  interface  is  developed  to...Public  health  data,  Notifiable  condition  detector,   Text   mining ,  Data   mining   4 of 29 Disease Patient Location Term

  12. Pattern Discovery and Change Detection of Online Music Query Streams

    NASA Astrophysics Data System (ADS)

    Li, Hua-Fu

    In this paper, an efficient stream mining algorithm, called FTP-stream (Frequent Temporal Pattern mining of streams), is proposed to find the frequent temporal patterns over melody sequence streams. In the framework of our proposed algorithm, an effective bit-sequence representation is used to reduce the time and memory needed to slide the windows. The FTP-stream algorithm can calculate the support threshold in only a single pass based on the concept of bit-sequence representation. It takes the advantage of "left" and "and" operations of the representation. Experiments show that the proposed algorithm only scans the music query stream once, and runs significant faster and consumes less memory than existing algorithms, such as SWFI-stream and Moment.

  13. Customizing FP-growth algorithm to parallel mining with Charm++ library

    NASA Astrophysics Data System (ADS)

    Puścian, Marek

    2017-08-01

    This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.

  14. Effective and efficient analysis of spatio-temporal data

    NASA Astrophysics Data System (ADS)

    Zhang, Zhongnan

    Spatio-temporal data mining, i.e., mining knowledge from large amount of spatio-temporal data, is a highly demanding field because huge amounts of spatio-temporal data have been collected in various applications, ranging from remote sensing, to geographical information systems (GIS), computer cartography, environmental assessment and planning, etc. The collection data far exceeded human's ability to analyze which make it crucial to develop analysis tools. Recent studies on data mining have extended to the scope of data mining from relational and transactional datasets to spatial and temporal datasets. Among the various forms of spatio-temporal data, remote sensing images play an important role, due to the growing wide-spreading of outer space satellites. In this dissertation, we proposed two approaches to analyze the remote sensing data. The first one is about applying association rules mining onto images processing. Each image was divided into a number of image blocks. We built a spatial relationship for these blocks during the dividing process. This made a large number of images into a spatio-temporal dataset since each image was shot in time-series. The second one implemented co-occurrence patterns discovery from these images. The generated patterns represent subsets of spatial features that are located together in space and time. A weather analysis is composed of individual analysis of several meteorological variables. These variables include temperature, pressure, dew point, wind, clouds, visibility and so on. Local-scale models provide detailed analysis and forecasts of meteorological phenomena ranging from a few kilometers to about 100 kilometers in size. When some of above meteorological variables have some special change tendency, some kind of severe weather will happen in most cases. Using the discovery of association rules, we found that some special meteorological variables' changing has tight relation with some severe weather situation that will happen very soon. This dissertation is composed of three parts: an introduction, some basic knowledges and relative works, and my own three contributions to the development of approaches for spatio-temporal data mining: DYSTAL algorithm, STARSI algorithm, and COSTCOP+ algorithm.

  15. Educational Data Mining Application for Estimating Students Performance in Weka Environment

    NASA Astrophysics Data System (ADS)

    Gowri, G. Shiyamala; Thulasiram, Ramasamy; Amit Baburao, Mahindra

    2017-11-01

    Educational data mining (EDM) is a multi-disciplinary research area that examines artificial intelligence, statistical modeling and data mining with the data generated from an educational institution. EDM utilizes computational ways to deal with explicate educational information keeping in mind the end goal to examine educational inquiries. To make a country stand unique among the other nations of the world, the education system has to undergo a major transition by redesigning its framework. The concealed patterns and data from various information repositories can be extracted by adopting the techniques of data mining. In order to summarize the performance of students with their credentials, we scrutinize the exploitation of data mining in the field of academics. Apriori algorithmic procedure is extensively applied to the database of students for a wider classification based on various categorizes. K-means procedure is applied to the same set of databases in order to accumulate them into a specific category. Apriori algorithm deals with mining the rules in order to extract patterns that are similar along with their associations in relation to various set of records. The records can be extracted from academic information repositories. The parameters used in this study gives more importance to psychological traits than academic features. The undesirable student conduct can be clearly witnessed if we make use of information mining frameworks. Thus, the algorithms efficiently prove to profile the students in any educational environment. The ultimate objective of the study is to suspect if a student is prone to violence or not.

  16. A primer to frequent itemset mining for bioinformatics

    PubMed Central

    Naulaerts, Stefan; Meysman, Pieter; Bittremieux, Wout; Vu, Trung Nghia; Vanden Berghe, Wim; Goethals, Bart

    2015-01-01

    Over the past two decades, pattern mining techniques have become an integral part of many bioinformatics solutions. Frequent itemset mining is a popular group of pattern mining techniques designed to identify elements that frequently co-occur. An archetypical example is the identification of products that often end up together in the same shopping basket in supermarket transactions. A number of algorithms have been developed to address variations of this computationally non-trivial problem. Frequent itemset mining techniques are able to efficiently capture the characteristics of (complex) data and succinctly summarize it. Owing to these and other interesting properties, these techniques have proven their value in biological data analysis. Nevertheless, information about the bioinformatics applications of these techniques remains scattered. In this primer, we introduce frequent itemset mining and their derived association rules for life scientists. We give an overview of various algorithms, and illustrate how they can be used in several real-life bioinformatics application domains. We end with a discussion of the future potential and open challenges for frequent itemset mining in the life sciences. PMID:24162173

  17. Boosting association rule mining in large datasets via Gibbs sampling.

    PubMed

    Qian, Guoqi; Rao, Calyampudi Radhakrishna; Sun, Xiaoying; Wu, Yuehua

    2016-05-03

    Current algorithms for association rule mining from transaction data are mostly deterministic and enumerative. They can be computationally intractable even for mining a dataset containing just a few hundred transaction items, if no action is taken to constrain the search space. In this paper, we develop a Gibbs-sampling-induced stochastic search procedure to randomly sample association rules from the itemset space, and perform rule mining from the reduced transaction dataset generated by the sample. Also a general rule importance measure is proposed to direct the stochastic search so that, as a result of the randomly generated association rules constituting an ergodic Markov chain, the overall most important rules in the itemset space can be uncovered from the reduced dataset with probability 1 in the limit. In the simulation study and a real genomic data example, we show how to boost association rule mining by an integrated use of the stochastic search and the Apriori algorithm.

  18. MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun

    Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A≈WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms thatmore » iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.« less

  19. MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization

    DOE PAGES

    Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun

    2017-10-30

    Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A≈WH. NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient parallel algorithms to solve the problem for big data sets. The main contribution of this work is a new, high-performance parallel computational framework for a broad class of NMF algorithms thatmore » iteratively solves alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). The framework is flexible and able to leverage a variety of NMF and NLS algorithms, including Multiplicative Update, Hierarchical Alternating Least Squares, and Block Principal Pivoting. Our implementation allows us to benchmark and compare different algorithms on massive dense and sparse data matrices of size that spans from few hundreds of millions to billions. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements. The code and the datasets used for conducting the experiments are available online.« less

  20. The Algorithm of Development the World Ocean Mining of the Industry During the Global Crisis

    NASA Astrophysics Data System (ADS)

    Nyrkov, Anatoliy; Budnik, Vladislav; Sokolov, Sergei; Chernyi, Sergei

    2016-08-01

    In the article reviewed extraction effect of hydrocarbons on the general country's developing, under the impact of economical, demographical and technological factors, as well as it's future role in the world energy balance. Also adduced facts which designate offshore and deep water production of unconventional and conventional hydrocarbons including mining of marine mineral resources as perspective area of development in the future, despite all the difficulties of this sector. In the article considered the state and prospects of the Russian continental shelf, in consideration of its geographical location and its all existing problems.

  1. Performance analysis of a multispectral framing camera for detecting mines in the littoral zone and beach zone

    NASA Astrophysics Data System (ADS)

    Louchard, Eric; Farm, Brian; Acker, Andrew

    2008-04-01

    BAE Systems Sensor Systems Identification & Surveillance (IS) has developed, under contract with the Office of Naval Research, a multispectral airborne sensor system and processing algorithms capable of detecting mine-like objects in the surf zone and land mines in the beach zone. BAE Systems has used this system in a blind test at a test range established by the Naval Surface Warfare Center - Panama City Division (NSWC-PCD) at Eglin Air Force Base. The airborne and ground subsystems used in this test are described, with graphical illustrations of the detection algorithms. We report on the performance of the system configured to operate with a human operator analyzing data on a ground station. A subsurface (underwater bottom proud mine in the surf zone and moored mine in shallow water) mine detection capability is demonstrated in the surf zone. Surface float detection and proud land mine detection capability is also demonstrated. Our analysis shows that this BAE Systems-developed multispectral airborne sensor provides a robust technical foundation for a viable system for mine counter-measures, and would be a valuable asset for use prior to an amphibious assault.

  2. Co-evolutionary data mining for fuzzy rules: automatic fitness function creation phase space, and experiments

    NASA Astrophysics Data System (ADS)

    Smith, James F., III; Blank, Joseph A.

    2003-03-01

    An approach is being explored that involves embedding a fuzzy logic based resource manager in an electronic game environment. Game agents can function under their own autonomous logic or human control. This approach automates the data mining problem. The game automatically creates a cleansed database reflecting the domain expert's knowledge, it calls a data mining function, a genetic algorithm, for data mining of the data base as required and allows easy evaluation of the information extracted. The co-evolutionary fitness functions, chromosomes and stopping criteria for ending the game are discussed. Genetic algorithm and genetic program based data mining procedures are discussed that automatically discover new fuzzy rules and strategies. The strategy tree concept and its relationship to co-evolutionary data mining are examined as well as the associated phase space representation of fuzzy concepts. The overlap of fuzzy concepts in phase space reduces the effective strategies available to adversaries. Co-evolutionary data mining alters the geometric properties of the overlap region known as the admissible region of phase space significantly enhancing the performance of the resource manager. Procedures for validation of the information data mined are discussed and significant experimental results provided.

  3. Implementation of a Flexible Tool for Automated Literature-Mining and Knowledgebase Development (DevToxMine)

    EPA Science Inventory

    Deriving novel relationships from the scientific literature is an important adjunct to datamining activities for complex datasets in genomics and high-throughput screening activities. Automated text-mining algorithms can be used to extract relevant content from the literature and...

  4. Protective and control relays as coal-mine power-supply ACS subsystem

    NASA Astrophysics Data System (ADS)

    Kostin, V. N.; Minakova, T. E.

    2017-10-01

    The paper presents instantaneous selective short-circuit protection for the cabling of the underground part of a coal mine and central control algorithms as a Coal-Mine Power-Supply ACS Subsystem. In order to improve the reliability of electricity supply and reduce the mining equipment down-time, a dual channel relay protection and central control system is proposed as a subsystem of the coal-mine power-supply automated control system (PS ACS).

  5. MoleculaRnetworks: an integrated graph theoretic and data mining tool to explore solvent organization in molecular simulation.

    PubMed

    Mooney, Barbara Logan; Corrales, L René; Clark, Aurora E

    2012-03-30

    This work discusses scripts for processing molecular simulations data written using the software package R: A Language and Environment for Statistical Computing. These scripts, named moleculaRnetworks, are intended for the geometric and solvent network analysis of aqueous solutes and can be extended to other H-bonded solvents. New algorithms, several of which are based on graph theory, that interrogate the solvent environment about a solute are presented and described. This includes a novel method for identifying the geometric shape adopted by the solvent in the immediate vicinity of the solute and an exploratory approach for describing H-bonding, both based on the PageRank algorithm of Google search fame. The moleculaRnetworks codes include a preprocessor, which distills simulation trajectories into physicochemical data arrays, and an interactive analysis script that enables statistical, trend, and correlation analysis, and other data mining. The goal of these scripts is to increase access to the wealth of structural and dynamical information that can be obtained from molecular simulations. Copyright © 2012 Wiley Periodicals, Inc.

  6. A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering

    ERIC Educational Resources Information Center

    Chahine, Firas Safwan

    2012-01-01

    Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…

  7. Tree-based approach for exploring marine spatial patterns with raster datasets.

    PubMed

    Liao, Xiaohan; Xue, Cunjin; Su, Fenzhen

    2017-01-01

    From multiple raster datasets to spatial association patterns, the data-mining technique is divided into three subtasks, i.e., raster dataset pretreatment, mining algorithm design, and spatial pattern exploration from the mining results. Comparison with the former two subtasks reveals that the latter remains unresolved. Confronted with the interrelated marine environmental parameters, we propose a Tree-based Approach for eXploring Marine Spatial Patterns with multiple raster datasets called TAXMarSP, which includes two models. One is the Tree-based Cascading Organization Model (TCOM), and the other is the Spatial Neighborhood-based CAlculation Model (SNCAM). TCOM designs the "Spatial node→Pattern node" from top to bottom layers to store the table-formatted frequent patterns. Together with TCOM, SNCAM considers the spatial neighborhood contributions to calculate the pattern-matching degree between the specified marine parameters and the table-formatted frequent patterns and then explores the marine spatial patterns. Using the prevalent quantification Apriori algorithm and a real remote sensing dataset from January 1998 to December 2014, a successful application of TAXMarSP to marine spatial patterns in the Pacific Ocean is described, and the obtained marine spatial patterns present not only the well-known but also new patterns to Earth scientists.

  8. Evaluating data mining algorithms using molecular dynamics trajectories.

    PubMed

    Tatsis, Vasileios A; Tjortjis, Christos; Tzirakis, Panagiotis

    2013-01-01

    Molecular dynamics simulations provide a sample of a molecule's conformational space. Experiments on the mus time scale, resulting in large amounts of data, are nowadays routine. Data mining techniques such as classification provide a way to analyse such data. In this work, we evaluate and compare several classification algorithms using three data sets which resulted from computer simulations, of a potential enzyme mimetic biomolecule. We evaluated 65 classifiers available in the well-known data mining toolkit Weka, using 'classification' errors to assess algorithmic performance. Results suggest that: (i) 'meta' classifiers perform better than the other groups, when applied to molecular dynamics data sets; (ii) Random Forest and Rotation Forest are the best classifiers for all three data sets; and (iii) classification via clustering yields the highest classification error. Our findings are consistent with bibliographic evidence, suggesting a 'roadmap' for dealing with such data.

  9. Spatial-temporal travel pattern mining using massive taxi trajectory data

    NASA Astrophysics Data System (ADS)

    Zheng, Linjiang; Xia, Dong; Zhao, Xin; Tan, Longyou; Li, Hang; Chen, Li; Liu, Weining

    2018-07-01

    Deep understanding of residents' travel patterns would provide helpful insights into the mechanisms of many socioeconomic phenomena. With the rapid development of location-aware computing technologies, researchers have easy access to large quantities of travel data. As an important data source, taxi trajectory data are featured by their high quality, good continuity and wide distribution, making it suitable for travel pattern mining. In this paper, we use taxi trajectory data to study spatial-temporal characterization of urban residents' travel patterns from two aspects: attractive areas and hot paths. Firstly, a framework of trajectory preprocessing, including data cleaning and extracting the taxi passenger pick-up/drop-off points, is presented to reduce the noise and redundancy in raw trajectory data. Then, a grid density based clustering algorithm is proposed to discover travel attractive areas in different periods of a day. On this basis, we put forward a spatial-temporal trajectory clustering method to discover hot paths among travel attractive areas. Compared with previous algorithms, which only consider the spatial constraint between trajectories, temporal constraint is also considered in our method. Through the experiments, we discuss how to determine the optimal parameters of the two clustering algorithms and verify the effectiveness of the algorithms using real data. Furthermore, we analyze spatial-temporal characterization of Chongqing residents' travel pattern.

  10. Electro-Optic Identification (EOID) Research Program

    DTIC Science & Technology

    2002-09-30

    The goal of this research is to provide computer-assisted identification of underwater mines in electro - optic imagery. Identification algorithms will...greatly reduce the time and risk to reacquire mine-like-objects for positive classification and identification. The objectives are to collect electro ... optic data under a wide range of operating and environmental conditions and develop precise algorithms that can provide accurate target recognition on this data for all possible conditions.

  11. Attribute index and uniform design based multiobjective association rule mining with evolutionary algorithm.

    PubMed

    Zhang, Jie; Wang, Yuping; Feng, Junhong

    2013-01-01

    In association rule mining, evaluating an association rule needs to repeatedly scan database to compare the whole database with the antecedent, consequent of a rule and the whole rule. In order to decrease the number of comparisons and time consuming, we present an attribute index strategy. It only needs to scan database once to create the attribute index of each attribute. Then all metrics values to evaluate an association rule do not need to scan database any further, but acquire data only by means of the attribute indices. The paper visualizes association rule mining as a multiobjective problem rather than a single objective one. In order to make the acquired solutions scatter uniformly toward the Pareto frontier in the objective space, elitism policy and uniform design are introduced. The paper presents the algorithm of attribute index and uniform design based multiobjective association rule mining with evolutionary algorithm, abbreviated as IUARMMEA. It does not require the user-specified minimum support and minimum confidence anymore, but uses a simple attribute index. It uses a well-designed real encoding so as to extend its application scope. Experiments performed on several databases demonstrate that the proposed algorithm has excellent performance, and it can significantly reduce the number of comparisons and time consumption.

  12. Attribute Index and Uniform Design Based Multiobjective Association Rule Mining with Evolutionary Algorithm

    PubMed Central

    Wang, Yuping; Feng, Junhong

    2013-01-01

    In association rule mining, evaluating an association rule needs to repeatedly scan database to compare the whole database with the antecedent, consequent of a rule and the whole rule. In order to decrease the number of comparisons and time consuming, we present an attribute index strategy. It only needs to scan database once to create the attribute index of each attribute. Then all metrics values to evaluate an association rule do not need to scan database any further, but acquire data only by means of the attribute indices. The paper visualizes association rule mining as a multiobjective problem rather than a single objective one. In order to make the acquired solutions scatter uniformly toward the Pareto frontier in the objective space, elitism policy and uniform design are introduced. The paper presents the algorithm of attribute index and uniform design based multiobjective association rule mining with evolutionary algorithm, abbreviated as IUARMMEA. It does not require the user-specified minimum support and minimum confidence anymore, but uses a simple attribute index. It uses a well-designed real encoding so as to extend its application scope. Experiments performed on several databases demonstrate that the proposed algorithm has excellent performance, and it can significantly reduce the number of comparisons and time consumption. PMID:23766683

  13. Using text-mining techniques in electronic patient records to identify ADRs from medicine use.

    PubMed

    Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise

    2012-05-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. © 2011 The Authors. British Journal of Clinical Pharmacology © 2011 The British Pharmacological Society.

  14. Using text-mining techniques in electronic patient records to identify ADRs from medicine use

    PubMed Central

    Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise

    2012-01-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. PMID:22122057

  15. A Contextualized, Differential Sequence Mining Method to Derive Students' Learning Behavior Patterns

    ERIC Educational Resources Information Center

    Kinnebrew, John S.; Loretz, Kirk M.; Biswas, Gautam

    2013-01-01

    Computer-based learning environments can produce a wealth of data on student learning interactions. This paper presents an exploratory data mining methodology for assessing and comparing students' learning behaviors from these interaction traces. The core algorithm employs a novel combination of sequence mining techniques to identify deferentially…

  16. An AK-LDMeans algorithm based on image clustering

    NASA Astrophysics Data System (ADS)

    Chen, Huimin; Li, Xingwei; Zhang, Yongbin; Chen, Nan

    2018-03-01

    Clustering is an effective analytical technique for handling unmarked data for value mining. Its ultimate goal is to mark unclassified data quickly and correctly. We use the roadmap for the current image processing as the experimental background. In this paper, we propose an AK-LDMeans algorithm to automatically lock the K value by designing the Kcost fold line, and then use the long-distance high-density method to select the clustering centers to further replace the traditional initial clustering center selection method, which further improves the efficiency and accuracy of the traditional K-Means Algorithm. And the experimental results are compared with the current clustering algorithm and the results are obtained. The algorithm can provide effective reference value in the fields of image processing, machine vision and data mining.

  17. Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression

    PubMed Central

    Dipnall, Joanna F.

    2016-01-01

    Background Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. Methods The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009–2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. Results After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). Conclusion The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin. PMID:26848571

  18. Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

    PubMed

    Dipnall, Joanna F; Pasco, Julie A; Berk, Michael; Williams, Lana J; Dodd, Seetal; Jacka, Felice N; Meyer, Denny

    2016-01-01

    Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study. The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators. After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001). The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

  19. A new pivoting and iterative text detection algorithm for biomedical images.

    PubMed

    Xu, Songhua; Krauthammer, Michael

    2010-12-01

    There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use. Copyright © 2010 Elsevier Inc. All rights reserved.

  20. Creating ensembles of oblique decision trees with evolutionary algorithms and sampling

    DOEpatents

    Cantu-Paz, Erick [Oakland, CA; Kamath, Chandrika [Tracy, CA

    2006-06-13

    A decision tree system that is part of a parallel object-oriented pattern recognition system, which in turn is part of an object oriented data mining system. A decision tree process includes the step of reading the data. If necessary, the data is sorted. A potential split of the data is evaluated according to some criterion. An initial split of the data is determined. The final split of the data is determined using evolutionary algorithms and statistical sampling techniques. The data is split. Multiple decision trees are combined in ensembles.

  1. Empirical evaluation of interest-level criteria

    NASA Astrophysics Data System (ADS)

    Sahar, Sigal; Mansour, Yishay

    1999-02-01

    Efficient association rule mining algorithms already exist, however, as the size of databases increases, the number of patterns mined by the algorithms increases to such an extent that their manual evaluation becomes impractical. Automatic evaluation methods are, therefore, required in order to sift through the initial list of rules, which the datamining algorithm outputs. These evaluation methods, or criteria, rank the association rules mined from the dataset. We empirically examined several such statistical criteria: new criteria, as well as previously known ones. The empirical evaluation was conducted using several databases, including a large real-life dataset, acquired from an order-by-phone grocery store, a dataset composed from www proxy logs, and several datasets from the UCI repository. We were interested in discovering whether the ranking performed by the various criteria is similar or easily distinguishable. Our evaluation detected, when significant differences exist, three patterns of behavior in the eight criteria we examined. There is an obvious dilemma in determining how many association rules to choose (in accordance with support and confidence parameters). The tradeoff is between having stringent parameters and, therefore, few rules, or lenient parameters and, thus, a multitude of rules. In many cases, our empirical evaluation revealed that most of the rules found by the comparably strict parameters ranked highly according to the interestingness criteria, when using lax parameters (producing significantly more association rules). Finally, we discuss the association rules that ranked highest, explain why these results are sound, and how they direct future research.

  2. Mining the National Career Assessment Examination Result Using Clustering Algorithm

    NASA Astrophysics Data System (ADS)

    Pagudpud, M. V.; Palaoag, T. T.; Padirayon, L. M.

    2018-03-01

    Education is an essential process today which elicits authorities to discover and establish innovative strategies for educational improvement. This study applied data mining using clustering technique for knowledge extraction from the National Career Assessment Examination (NCAE) result in the Division of Quirino. The NCAE is an examination given to all grade 9 students in the Philippines to assess their aptitudes in the different domains. Clustering the students is helpful in identifying students’ learning considerations. With the use of the RapidMiner tool, clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), k-means, k-medoid, expectation maximization clustering, and support vector clustering algorithms were analyzed. The silhouette indexes of the said clustering algorithms were compared, and the result showed that the k-means algorithm with k = 3 and silhouette index equal to 0.196 is the most appropriate clustering algorithm to group the students. Three groups were formed having 477 students in the determined group (cluster 0), 310 proficient students (cluster 1) and 396 developing students (cluster 2). The data mining technique used in this study is essential in extracting useful information from the NCAE result to better understand the abilities of students which in turn is a good basis for adopting teaching strategies.

  3. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Gray, A.

    2014-04-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  4. Scalable Machine Learning for Massive Astronomical Datasets

    NASA Astrophysics Data System (ADS)

    Ball, Nicholas M.; Astronomy Data Centre, Canadian

    2014-01-01

    We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.

  5. Application-Specific Graph Sampling for Frequent Subgraph Mining and Community Detection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Purohit, Sumit; Choudhury, Sutanay; Holder, Lawrence B.

    Graph mining is an important data analysis methodology, but struggles as the input graph size increases. The scalability and usability challenges posed by such large graphs make it imperative to sample the input graph and reduce its size. The critical challenge in sampling is to identify the appropriate algorithm to insure the resulting analysis does not suffer heavily from the data reduction. Predicting the expected performance degradation for a given graph and sampling algorithm is also useful. In this paper, we present different sampling approaches for graph mining applications such as Frequent Subgrpah Mining (FSM), and Community Detection (CD). Wemore » explore graph metrics such as PageRank, Triangles, and Diversity to sample a graph and conclude that for heterogeneous graphs Triangles and Diversity perform better than degree based metrics. We also present two new sampling variations for targeted graph mining applications. We present empirical results to show that knowledge of the target application, along with input graph properties can be used to select the best sampling algorithm. We also conclude that performance degradation is an abrupt, rather than gradual phenomena, as the sample size decreases. We present the empirical results to show that the performance degradation follows a logistic function.« less

  6. A Segment-Based Trajectory Similarity Measure in the Urban Transportation Systems.

    PubMed

    Mao, Yingchi; Zhong, Haishi; Xiao, Xianjian; Li, Xiaofang

    2017-03-06

    With the rapid spread of built-in GPS handheld smart devices, the trajectory data from GPS sensors has grown explosively. Trajectory data has spatio-temporal characteristics and rich information. Using trajectory data processing techniques can mine the patterns of human activities and the moving patterns of vehicles in the intelligent transportation systems. A trajectory similarity measure is one of the most important issues in trajectory data mining (clustering, classification, frequent pattern mining, etc.). Unfortunately, the main similarity measure algorithms with the trajectory data have been found to be inaccurate, highly sensitive of sampling methods, and have low robustness for the noise data. To solve the above problems, three distances and their corresponding computation methods are proposed in this paper. The point-segment distance can decrease the sensitivity of the point sampling methods. The prediction distance optimizes the temporal distance with the features of trajectory data. The segment-segment distance introduces the trajectory shape factor into the similarity measurement to improve the accuracy. The three kinds of distance are integrated with the traditional dynamic time warping algorithm (DTW) algorithm to propose a new segment-based dynamic time warping algorithm (SDTW). The experimental results show that the SDTW algorithm can exhibit about 57%, 86%, and 31% better accuracy than the longest common subsequence algorithm (LCSS), and edit distance on real sequence algorithm (EDR) , and DTW, respectively, and that the sensitivity to the noise data is lower than that those algorithms.

  7. Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications.

    PubMed

    Zhang, Yiyan; Xin, Yi; Li, Qin; Ma, Jianshe; Li, Shuai; Lv, Xiaodan; Lv, Weiqi

    2017-11-02

    Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly. In this paper, seven kinds of sophisticated active algorithms, namely, C4.5, support vector machine, AdaBoost, k-nearest neighbor, naïve Bayes, random forest, and logistic regression, were selected as the research objects. The seven algorithms were applied to the 12 top-click UCI public datasets with the task of classification, and their performances were compared through induction and analysis. The sample size, number of attributes, number of missing values, and the sample size of each class, correlation coefficients between variables, class entropy of task variable, and the ratio of the sample size of the largest class to the least class were calculated to character the 12 research datasets. The two ensemble algorithms reach high accuracy of classification on most datasets. Moreover, random forest performs better than AdaBoost on the unbalanced dataset of the multi-class task. Simple algorithms, such as the naïve Bayes and logistic regression model are suitable for a small dataset with high correlation between the task and other non-task attribute variables. K-nearest neighbor and C4.5 decision tree algorithms perform well on binary- and multi-class task datasets. Support vector machine is more adept on the balanced small dataset of the binary-class task. No algorithm can maintain the best performance in all datasets. The applicability of the seven data mining algorithms on the datasets with different characteristics was summarized to provide a reference for biomedical researchers or beginners in different fields.

  8. A Recommendation Algorithm for Automating Corollary Order Generation

    PubMed Central

    Klann, Jeffrey; Schadow, Gunther; McCoy, JM

    2009-01-01

    Manual development and maintenance of decision support content is time-consuming and expensive. We explore recommendation algorithms, e-commerce data-mining tools that use collective order history to suggest purchases, to assist with this. In particular, previous work shows corollary order suggestions are amenable to automated data-mining techniques. Here, an item-based collaborative filtering algorithm augmented with association rule interestingness measures mined suggestions from 866,445 orders made in an inpatient hospital in 2007, generating 584 potential corollary orders. Our expert physician panel evaluated the top 92 and agreed 75.3% were clinically meaningful. Also, at least one felt 47.9% would be directly relevant in guideline development. This automated generation of a rough-cut of corollary orders confirms prior indications about automated tools in building decision support content. It is an important step toward computerized augmentation to decision support development, which could increase development efficiency and content quality while automatically capturing local standards. PMID:20351875

  9. A recommendation algorithm for automating corollary order generation.

    PubMed

    Klann, Jeffrey; Schadow, Gunther; McCoy, J M

    2009-11-14

    Manual development and maintenance of decision support content is time-consuming and expensive. We explore recommendation algorithms, e-commerce data-mining tools that use collective order history to suggest purchases, to assist with this. In particular, previous work shows corollary order suggestions are amenable to automated data-mining techniques. Here, an item-based collaborative filtering algorithm augmented with association rule interestingness measures mined suggestions from 866,445 orders made in an inpatient hospital in 2007, generating 584 potential corollary orders. Our expert physician panel evaluated the top 92 and agreed 75.3% were clinically meaningful. Also, at least one felt 47.9% would be directly relevant in guideline development. This automated generation of a rough-cut of corollary orders confirms prior indications about automated tools in building decision support content. It is an important step toward computerized augmentation to decision support development, which could increase development efficiency and content quality while automatically capturing local standards.

  10. COBRA ATD minefield detection model initial performance analysis

    NASA Astrophysics Data System (ADS)

    Holmes, V. Todd; Kenton, Arthur C.; Hilton, Russell J.; Witherspoon, Ned H.; Holloway, John H., Jr.

    2000-08-01

    A statistical performance analysis of the USMC Coastal Battlefield Reconnaissance and Analysis (COBRA) Minefield Detection (MFD) Model has been performed in support of the COBRA ATD Program under execution by the Naval Surface Warfare Center/Dahlgren Division/Coastal Systems Station . This analysis uses the Veridian ERIM International MFD model from the COBRA Sensor Performance Evaluation and Computational Tools for Research Analysis modeling toolbox and a collection of multispectral mine detection algorithm response distributions for mines and minelike clutter objects. These mine detection response distributions were generated form actual COBRA ATD test missions over littoral zone minefields. This analysis serves to validate both the utility and effectiveness of the COBRA MFD Model as a predictive MFD performance too. COBRA ATD minefield detection model algorithm performance results based on a simulate baseline minefield detection scenario are presented, as well as result of a MFD model algorithm parametric sensitivity study.

  11. Advances in Machine Learning and Data Mining for Astronomy

    NASA Astrophysics Data System (ADS)

    Way, Michael J.; Scargle, Jeffrey D.; Ali, Kamal M.; Srivastava, Ashok N.

    2012-03-01

    Advances in Machine Learning and Data Mining for Astronomy documents numerous successful collaborations among computer scientists, statisticians, and astronomers who illustrate the application of state-of-the-art machine learning and data mining techniques in astronomy. Due to the massive amount and complexity of data in most scientific disciplines, the material discussed in this text transcends traditional boundaries between various areas in the sciences and computer science. The book's introductory part provides context to issues in the astronomical sciences that are also important to health, social, and physical sciences, particularly probabilistic and statistical aspects of classification and cluster analysis. The next part describes a number of astrophysics case studies that leverage a range of machine learning and data mining technologies. In the last part, developers of algorithms and practitioners of machine learning and data mining show how these tools and techniques are used in astronomical applications. With contributions from leading astronomers and computer scientists, this book is a practical guide to many of the most important developments in machine learning, data mining, and statistics. It explores how these advances can solve current and future problems in astronomy and looks at how they could lead to the creation of entirely new algorithms within the data mining community.

  12. Identifying Learning Behaviors by Contextualizing Differential Sequence Mining with Action Features and Performance Evolution

    ERIC Educational Resources Information Center

    Kinnebrew, John S.; Biswas, Gautam

    2012-01-01

    Our learning-by-teaching environment, Betty's Brain, captures a wealth of data on students' learning interactions as they teach a virtual agent. This paper extends an exploratory data mining methodology for assessing and comparing students' learning behaviors from these interaction traces. The core algorithm employs sequence mining techniques to…

  13. Data Mining and Privacy of Social Network Sites' Users: Implications of the Data Mining Problem.

    PubMed

    Al-Saggaf, Yeslam; Islam, Md Zahidul

    2015-08-01

    This paper explores the potential of data mining as a technique that could be used by malicious data miners to threaten the privacy of social network sites (SNS) users. It applies a data mining algorithm to a real dataset to provide empirically-based evidence of the ease with which characteristics about the SNS users can be discovered and used in a way that could invade their privacy. One major contribution of this article is the use of the decision forest data mining algorithm (SysFor) to the context of SNS, which does not only build a decision tree but rather a forest allowing the exploration of more logic rules from a dataset. One logic rule that SysFor built in this study, for example, revealed that anyone having a profile picture showing just the face or a picture showing a family is less likely to be lonely. Another contribution of this article is the discussion of the implications of the data mining problem for governments, businesses, developers and the SNS users themselves.

  14. Adaptive semantic tag mining from heterogeneous clinical research texts.

    PubMed

    Hao, T; Weng, C

    2015-01-01

    To develop an adaptive approach to mine frequent semantic tags (FSTs) from heterogeneous clinical research texts. We develop a "plug-n-play" framework that integrates replaceable unsupervised kernel algorithms with formatting, functional, and utility wrappers for FST mining. Temporal information identification and semantic equivalence detection were two example functional wrappers. We first compared this approach's recall and efficiency for mining FSTs from ClinicalTrials.gov to that of a recently published tag-mining algorithm. Then we assessed this approach's adaptability to two other types of clinical research texts: clinical data requests and clinical trial protocols, by comparing the prevalence trends of FSTs across three texts. Our approach increased the average recall and speed by 12.8% and 47.02% respectively upon the baseline when mining FSTs from ClinicalTrials.gov, and maintained an overlap in relevant FSTs with the base- line ranging between 76.9% and 100% for varying FST frequency thresholds. The FSTs saturated when the data size reached 200 documents. Consistent trends in the prevalence of FST were observed across the three texts as the data size or frequency threshold changed. This paper contributes an adaptive tag-mining framework that is scalable and adaptable without sacrificing its recall. This component-based architectural design can be potentially generalizable to improve the adaptability of other clinical text mining methods.

  15. Air Pollution Monitoring and Mining Based on Sensor Grid in London

    PubMed Central

    Ma, Yajie; Richards, Mark; Ghanem, Moustafa; Guo, Yike; Hassard, John

    2008-01-01

    In this paper, we present a distributed infrastructure based on wireless sensors network and Grid computing technology for air pollution monitoring and mining, which aims to develop low-cost and ubiquitous sensor networks to collect real-time, large scale and comprehensive environmental data from road traffic emissions for air pollution monitoring in urban environment. The main informatics challenges in respect to constructing the high-throughput sensor Grid are discussed in this paper. We present a two-layer network framework, a P2P e-Science Grid architecture, and the distributed data mining algorithm as the solutions to address the challenges. We simulated the system in TinyOS to examine the operation of each sensor as well as the networking performance. We also present the distributed data mining result to examine the effectiveness of the algorithm. PMID:27879895

  16. Air Pollution Monitoring and Mining Based on Sensor Grid in London.

    PubMed

    Ma, Yajie; Richards, Mark; Ghanem, Moustafa; Guo, Yike; Hassard, John

    2008-06-01

    In this paper, we present a distributed infrastructure based on wireless sensors network and Grid computing technology for air pollution monitoring and mining, which aims to develop low-cost and ubiquitous sensor networks to collect real-time, large scale and comprehensive environmental data from road traffic emissions for air pollution monitoring in urban environment. The main informatics challenges in respect to constructing the high-throughput sensor Grid are discussed in this paper. We present a twolayer network framework, a P2P e-Science Grid architecture, and the distributed data mining algorithm as the solutions to address the challenges. We simulated the system in TinyOS to examine the operation of each sensor as well as the networking performance. We also present the distributed data mining result to examine the effectiveness of the algorithm.

  17. Algorithms and data structures for automated change detection and classification of sidescan sonar imagery

    NASA Astrophysics Data System (ADS)

    Gendron, Marlin Lee

    During Mine Warfare (MIW) operations, MIW analysts perform change detection by visually comparing historical sidescan sonar imagery (SSI) collected by a sidescan sonar with recently collected SSI in an attempt to identify objects (which might be explosive mines) placed at sea since the last time the area was surveyed. This dissertation presents a data structure and three algorithms, developed by the author, that are part of an automated change detection and classification (ACDC) system. MIW analysts at the Naval Oceanographic Office, to reduce the amount of time to perform change detection, are currently using ACDC. The dissertation introductory chapter gives background information on change detection, ACDC, and describes how SSI is produced from raw sonar data. Chapter 2 presents the author's Geospatial Bitmap (GB) data structure, which is capable of storing information geographically and is utilized by the three algorithms. This chapter shows that a GB data structure used in a polygon-smoothing algorithm ran between 1.3--48.4x faster than a sparse matrix data structure. Chapter 3 describes the GB clustering algorithm, which is the author's repeatable, order-independent method for clustering. Results from tests performed in this chapter show that the time to cluster a set of points is not affected by the distribution or the order of the points. In Chapter 4, the author presents his real-time computer-aided detection (CAD) algorithm that automatically detects mine-like objects on the seafloor in SSI. The author ran his GB-based CAD algorithm on real SSI data, and results of these tests indicate that his real-time CAD algorithm performs comparably to or better than other non-real-time CAD algorithms. The author presents his computer-aided search (CAS) algorithm in Chapter 5. CAS helps MIW analysts locate mine-like features that are geospatially close to previously detected features. A comparison between the CAS and a great circle distance algorithm shows that the CAS performs geospatial searching 1.75x faster on large data sets. Finally, the concluding chapter of this dissertation gives important details on how the completed ACDC system will function, and discusses the author's future research to develop additional algorithms and data structures for ACDC.

  18. Data Mining Web Services for Science Data Repositories

    NASA Astrophysics Data System (ADS)

    Graves, S.; Ramachandran, R.; Keiser, K.; Maskey, M.; Lynnes, C.; Pham, L.

    2006-12-01

    The maturation of web services standards and technologies sets the stage for a distributed "Service-Oriented Architecture" (SOA) for NASA's next generation science data processing. This architecture will allow members of the scientific community to create and combine persistent distributed data processing services and make them available to other users over the Internet. NASA has initiated a project to create a suite of specialized data mining web services designed specifically for science data. The project leverages the Algorithm Development and Mining (ADaM) toolkit as its basis. The ADaM toolkit is a robust, mature and freely available science data mining toolkit that is being used by several research organizations and educational institutions worldwide. These mining services will give the scientific community a powerful and versatile data mining capability that can be used to create higher order products such as thematic maps from current and future NASA satellite data records with methods that are not currently available. The package of mining and related services are being developed using Web Services standards so that community-based measurement processing systems can access and interoperate with them. These standards-based services allow users different options for utilizing them, from direct remote invocation by a client application to deployment of a Business Process Execution Language (BPEL) solutions package where a complex data mining workflow is exposed to others as a single service. The ability to deploy and operate these services at a data archive allows the data mining algorithms to be run where the data are stored, a more efficient scenario than moving large amounts of data over the network. This will be demonstrated in a scenario in which a user uses a remote Web-Service-enabled clustering algorithm to create cloud masks from satellite imagery at the Goddard Earth Sciences Data and Information Services Center (GES DISC).

  19. Comparative performance between compressed and uncompressed airborne imagery

    NASA Astrophysics Data System (ADS)

    Phan, Chung; Rupp, Ronald; Agarwal, Sanjeev; Trang, Anh; Nair, Sumesh

    2008-04-01

    The US Army's RDECOM CERDEC Night Vision and Electronic Sensors Directorate (NVESD), Countermine Division is evaluating the compressibility of airborne multi-spectral imagery for mine and minefield detection application. Of particular interest is to assess the highest image data compression rate that can be afforded without the loss of image quality for war fighters in the loop and performance of near real time mine detection algorithm. The JPEG-2000 compression standard is used to perform data compression. Both lossless and lossy compressions are considered. A multi-spectral anomaly detector such as RX (Reed & Xiaoli), which is widely used as a core algorithm baseline in airborne mine and minefield detection on different mine types, minefields, and terrains to identify potential individual targets, is used to compare the mine detection performance. This paper presents the compression scheme and compares detection performance results between compressed and uncompressed imagery for various level of compressions. The compression efficiency is evaluated and its dependence upon different backgrounds and other factors are documented and presented using multi-spectral data.

  20. Metaheuristic Optimization and its Applications in Earth Sciences

    NASA Astrophysics Data System (ADS)

    Yang, Xin-She

    2010-05-01

    A common but challenging task in modelling geophysical and geological processes is to handle massive data and to minimize certain objectives. This can essentially be considered as an optimization problem, and thus many new efficient metaheuristic optimization algorithms can be used. In this paper, we will introduce some modern metaheuristic optimization algorithms such as genetic algorithms, harmony search, firefly algorithm, particle swarm optimization and simulated annealing. We will also discuss how these algorithms can be applied to various applications in earth sciences, including nonlinear least-squares, support vector machine, Kriging, inverse finite element analysis, and data-mining. We will present a few examples to show how different problems can be reformulated as optimization. Finally, we will make some recommendations for choosing various algorithms to suit various problems. References 1) D. H. Wolpert and W. G. Macready, No free lunch theorems for optimization, IEEE Trans. Evolutionary Computation, Vol. 1, 67-82 (1997). 2) X. S. Yang, Nature-Inspired Metaheuristic Algorithms, Luniver Press, (2008). 3) X. S. Yang, Mathematical Modelling for Earth Sciences, Dunedin Academic Press, (2008).

  1. Predicting CD4 count changes among patients on antiretroviral treatment: Application of data mining techniques.

    PubMed

    Kebede, Mihiretu; Zegeye, Desalegn Tigabu; Zeleke, Berihun Megabiaw

    2017-12-01

    To monitor the progress of therapy and disease progression, periodic CD4 counts are required throughout the course of HIV/AIDS care and support. The demand for CD4 count measurement is increasing as ART programs expand over the last decade. This study aimed to predict CD4 count changes and to identify the predictors of CD4 count changes among patients on ART. A cross-sectional study was conducted at the University of Gondar Hospital from 3,104 adult patients on ART with CD4 counts measured at least twice (baseline and most recent). Data were retrieved from the HIV care clinic electronic database and patients` charts. Descriptive data were analyzed by SPSS version 20. Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology was followed to undertake the study. WEKA version 3.8 was used to conduct a predictive data mining. Before building the predictive data mining models, information gain values and correlation-based Feature Selection methods were used for attribute selection. Variables were ranked according to their relevance based on their information gain values. J48, Neural Network, and Random Forest algorithms were experimented to assess model accuracies. The median duration of ART was 191.5 weeks. The mean CD4 count change was 243 (SD 191.14) cells per microliter. Overall, 2427 (78.2%) patients had their CD4 counts increased by at least 100 cells per microliter, while 4% had a decline from the baseline CD4 value. Baseline variables including age, educational status, CD8 count, ART regimen, and hemoglobin levels predicted CD4 count changes with predictive accuracies of J48, Neural Network, and Random Forest being 87.1%, 83.5%, and 99.8%, respectively. Random Forest algorithm had a superior performance accuracy level than both J48 and Artificial Neural Network. The precision, sensitivity and recall values of Random Forest were also more than 99%. Nearly accurate prediction results were obtained using Random Forest algorithm. This algorithm could be used in a low-resource setting to build a web-based prediction model for CD4 count changes. Copyright © 2017 Elsevier B.V. All rights reserved.

  2. QuadBase2: web server for multiplexed guanine quadruplex mining and visualization

    PubMed Central

    Dhapola, Parashar; Chowdhury, Shantanu

    2016-01-01

    DNA guanine quadruplexes or G4s are non-canonical DNA secondary structures which affect genomic processes like replication, transcription and recombination. G4s are computationally identified by specific nucleotide motifs which are also called putative G4 (PG4) motifs. Despite the general relevance of these structures, there is currently no tool available that can allow batch queries and genome-wide analysis of these motifs in a user-friendly interface. QuadBase2 (quadbase.igib.res.in) presents a completely reinvented web server version of previously published QuadBase database. QuadBase2 enables users to mine PG4 motifs in up to 178 eukaryotes through the EuQuad module. This module interfaces with Ensembl Compara database, to allow users mine PG4 motifs in the orthologues of genes of interest across eukaryotes. PG4 motifs can be mined across genes and their promoter sequences in 1719 prokaryotes through ProQuad module. This module includes a feature that allows genome-wide mining of PG4 motifs and their visualization as circular histograms. TetraplexFinder, the module for mining PG4 motifs in user-provided sequences is now capable of handling up to 20 MB of data. QuadBase2 is a comprehensive PG4 motif mining tool that further expands the configurations and algorithms for mining PG4 motifs in a user-friendly way. PMID:27185890

  3. Computational intelligence techniques for biological data mining: An overview

    NASA Astrophysics Data System (ADS)

    Faye, Ibrahima; Iqbal, Muhammad Javed; Said, Abas Md; Samir, Brahim Belhaouari

    2014-10-01

    Computational techniques have been successfully utilized for a highly accurate analysis and modeling of multifaceted and raw biological data gathered from various genome sequencing projects. These techniques are proving much more effective to overcome the limitations of the traditional in-vitro experiments on the constantly increasing sequence data. However, most critical problems that caught the attention of the researchers may include, but not limited to these: accurate structure and function prediction of unknown proteins, protein subcellular localization prediction, finding protein-protein interactions, protein fold recognition, analysis of microarray gene expression data, etc. To solve these problems, various classification and clustering techniques using machine learning have been extensively used in the published literature. These techniques include neural network algorithms, genetic algorithms, fuzzy ARTMAP, K-Means, K-NN, SVM, Rough set classifiers, decision tree and HMM based algorithms. Major difficulties in applying the above algorithms include the limitations found in the previous feature encoding and selection methods while extracting the best features, increasing classification accuracy and decreasing the running time overheads of the learning algorithms. The application of this research would be potentially useful in the drug design and in the diagnosis of some diseases. This paper presents a concise overview of the well-known protein classification techniques.

  4. Data mining: sophisticated forms of managed care modeling through artificial intelligence.

    PubMed

    Borok, L S

    1997-01-01

    Data mining is a recent development in computer science that combines artificial intelligence algorithms and relational databases to discover patterns automatically, without the use of traditional statistical methods. Work with data mining tools in health care is in a developmental stage that holds great promise, given the combination of demographic and diagnostic information.

  5. Computer-aided visual assessment in mine planning and design

    Treesearch

    Michael Hatfield; A. J. LeRoy Balzer; Roger E. Nelson

    1979-01-01

    A computer modeling technique is described for evaluating the visual impact of a proposed surface mine located within the viewshed of a national park. A computer algorithm analyzes digitized USGS baseline topography and identifies areas subject to surface disturbance visible from the park. Preliminary mine and reclamation plan information is used to describe how the...

  6. Applying data mining techniques to determine important parameters in chronic kidney disease and the relations of these parameters to each other.

    PubMed

    Tahmasebian, Shahram; Ghazisaeedi, Marjan; Langarizadeh, Mostafa; Mokhtaran, Mehrshad; Mahdavi-Mazdeh, Mitra; Javadian, Parisa

    2017-01-01

    Introduction: Chronic kidney disease (CKD) includes a wide range of pathophysiological processes which will be observed along with abnormal function of kidneys and progressive decrease in glomerular filtration rate (GFR). According to the definition decreasing GFR must have been present for at least three months. CKD will eventually result in end-stage kidney disease. In this process different factors play role and finding the relations between effective parameters in this regard can help to prevent or slow progression of this disease. There are always a lot of data being collected from the patients' medical records. This huge array of data can be considered a valuable source for analyzing, exploring and discovering information. Objectives: Using the data mining techniques, the present study tries to specify the effective parameters and also aims to determine their relations with each other in Iranian patients with CKD. Material and Methods: The study population includes 31996 patients with CKD. First, all of the data is registered in the database. Then data mining tools were used to find the hidden rules and relationships between parameters in collected data. Results: After data cleaning based on CRISP-DM (Cross Industry Standard Process for Data Mining) methodology and running mining algorithms on the data in the database the relationships between the effective parameters was specified. Conclusion: This study was done using the data mining method pertaining to the effective factors on patients with CKD.

  7. Applying data mining techniques to determine important parameters in chronic kidney disease and the relations of these parameters to each other

    PubMed Central

    Tahmasebian, Shahram; Ghazisaeedi, Marjan; Langarizadeh, Mostafa; Mokhtaran, Mehrshad; Mahdavi-Mazdeh, Mitra; Javadian, Parisa

    2017-01-01

    Introduction: Chronic kidney disease (CKD) includes a wide range of pathophysiological processes which will be observed along with abnormal function of kidneys and progressive decrease in glomerular filtration rate (GFR). According to the definition decreasing GFR must have been present for at least three months. CKD will eventually result in end-stage kidney disease. In this process different factors play role and finding the relations between effective parameters in this regard can help to prevent or slow progression of this disease. There are always a lot of data being collected from the patients’ medical records. This huge array of data can be considered a valuable source for analyzing, exploring and discovering information. Objectives: Using the data mining techniques, the present study tries to specify the effective parameters and also aims to determine their relations with each other in Iranian patients with CKD. Material and Methods: The study population includes 31996 patients with CKD. First, all of the data is registered in the database. Then data mining tools were used to find the hidden rules and relationships between parameters in collected data. Results: After data cleaning based on CRISP-DM (Cross Industry Standard Process for Data Mining) methodology and running mining algorithms on the data in the database the relationships between the effective parameters was specified. Conclusion: This study was done using the data mining method pertaining to the effective factors on patients with CKD. PMID:28497080

  8. Automatic detection of referral patients due to retinal pathologies through data mining.

    PubMed

    Quellec, Gwenolé; Lamard, Mathieu; Erginay, Ali; Chabouis, Agnès; Massin, Pascale; Cochener, Béatrice; Cazuguel, Guy

    2016-04-01

    With the increased prevalence of retinal pathologies, automating the detection of these pathologies is becoming more and more relevant. In the past few years, many algorithms have been developed for the automated detection of a specific pathology, typically diabetic retinopathy, using eye fundus photography. No matter how good these algorithms are, we believe many clinicians would not use automatic detection tools focusing on a single pathology and ignoring any other pathology present in the patient's retinas. To solve this issue, an algorithm for characterizing the appearance of abnormal retinas, as well as the appearance of the normal ones, is presented. This algorithm does not focus on individual images: it considers examination records consisting of multiple photographs of each retina, together with contextual information about the patient. Specifically, it relies on data mining in order to learn diagnosis rules from characterizations of fundus examination records. The main novelty is that the content of examination records (images and context) is characterized at multiple levels of spatial and lexical granularity: 1) spatial flexibility is ensured by an adaptive decomposition of composite retinal images into a cascade of regions, 2) lexical granularity is ensured by an adaptive decomposition of the feature space into a cascade of visual words. This multigranular representation allows for great flexibility in automatically characterizing normality and abnormality: it is possible to generate diagnosis rules whose precision and generalization ability can be traded off depending on data availability. A variation on usual data mining algorithms, originally designed to mine static data, is proposed so that contextual and visual data at adaptive granularity levels can be mined. This framework was evaluated in e-ophtha, a dataset of 25,702 examination records from the OPHDIAT screening network, as well as in the publicly-available Messidor dataset. It was successfully applied to the detection of patients that should be referred to an ophthalmologist and also to the specific detection of several pathologies. Copyright © 2016 Elsevier B.V. All rights reserved.

  9. Near-line Archive Data Mining at the Goddard Distributed Active Archive Center

    NASA Astrophysics Data System (ADS)

    Pham, L.; Mack, R.; Eng, E.; Lynnes, C.

    2002-12-01

    NASA's Earth Observing System (EOS) is generating immense volumes of data, in some cases too much to provide to users with data-intensive needs. As an alternative to moving the data to the user and his/her research algorithms, we are providing a means to move the algorithms to the data. The Near-line Archive Data Mining (NADM) system is the Goddard Earth Sciences Distributed Active Archive Center's (GES DAAC) web data mining portal to the EOS Data and Information System (EOSDIS) data pool, a 50-TB online disk cache. The NADM web portal enables registered users to submit and execute data mining algorithm codes on the data in the EOSDIS data pool. A web interface allows the user to access the NADM system. The users first develops personalized data mining code on their home platform and then uploads them to the NADM system. The C, FORTRAN and IDL languages are currently supported. The user developed code is automatically audited for any potential security problems before it is installed within the NADM system and made available to the user. Once the code has been installed the user is provided a test environment where he/she can test the execution of the software against data sets of the user's choosing. When the user is satisfied with the results, he/she can promote their code to the "operational" environment. From here the user can interactively run his/her code on the data available in the EOSDIS data pool. The user can also set up a processing subscription. The subscription will automatically process new data as it becomes available in the EOSDIS data pool. The generated mined data products are then made available for FTP pickup. The NADM system uses the GES DAAC-developed Simple Scalable Script-based Science Processor (S4P) to automate tasks and perform the actual data processing. Users will also have the option of selecting a DAAC-provided data mining algorithm and using it to process the data of their choice.

  10. Moment tensor inversion with three-dimensional sensor configuration of mining induced seismicity (Kiruna mine, Sweden)

    NASA Astrophysics Data System (ADS)

    Ma, Ju; Dineva, Savka; Cesca, Simone; Heimann, Sebastian

    2018-06-01

    Mining induced seismicity is an undesired consequence of mining operations, which poses significant hazard to miners and infrastructures and requires an accurate analysis of the rupture process. Seismic moment tensors of mining-induced events help to understand the nature of mining-induced seismicity by providing information about the relationship between the mining, stress redistribution and instabilities in the rock mass. In this work, we adapt and test a waveform-based inversion method on high frequency data recorded by a dense underground seismic system in one of the largest underground mines in the world (Kiruna mine, Sweden). A stable algorithm for moment tensor inversion for comparatively small mining induced earthquakes, resolving both the double-couple and full moment tensor with high frequency data, is very challenging. Moreover, the application to underground mining system requires accounting for the 3-D geometry of the monitoring system. We construct a Green's function database using a homogeneous velocity model, but assuming a 3-D distribution of potential sources and receivers. We first perform a set of moment tensor inversions using synthetic data to test the effects of different factors on moment tensor inversion stability and source parameters accuracy, including the network spatial coverage, the number of sensors and the signal-to-noise ratio. The influence of the accuracy of the input source parameters on the inversion results is also tested. Those tests show that an accurate selection of the inversion parameters allows resolving the moment tensor also in the presence of realistic seismic noise conditions. Finally, the moment tensor inversion methodology is applied to eight events chosen from mining block #33/34 at Kiruna mine. Source parameters including scalar moment, magnitude, double-couple, compensated linear vector dipole and isotropic contributions as well as the strike, dip and rake configurations of the double-couple term were obtained. The orientations of the nodal planes of the double-couple component in most cases vary from NNW to NNE with a dip along the ore body or in the opposite direction.

  11. Moment Tensor Inversion with 3D sensor configuration of Mining Induced Seismicity (Kiruna mine, Sweden)

    NASA Astrophysics Data System (ADS)

    Ma, Ju; Dineva, Savka; Cesca, Simone; Heimann, Sebastian

    2018-03-01

    Mining induced seismicity is an undesired consequence of mining operations, which poses significant hazard to miners and infrastructures and requires an accurate analysis of the rupture process. Seismic moment tensors of mining-induced events help to understand the nature of mining-induced seismicity by providing information about the relationship between the mining, stress redistribution and instabilities in the rock mass. In this work, we adapt and test a waveform-based inversion method on high frequency data recorded by a dense underground seismic system in one of the largest underground mines in the world (Kiruna mine, Sweden). Stable algorithm for moment tensor inversion for comparatively small mining induced earthquakes, resolving both the double couple and full moment tensor with high frequency data is very challenging. Moreover, the application to underground mining system requires accounting for the 3D geometry of the monitoring system. We construct a Green's function database using a homogeneous velocity model, but assuming a 3D distribution of potential sources and receivers. We first perform a set of moment tensor inversions using synthetic data to test the effects of different factors on moment tensor inversion stability and source parameters accuracy, including the network spatial coverage, the number of sensors and the signal-to-noise ratio. The influence of the accuracy of the input source parameters on the inversion results is also tested. Those tests show that an accurate selection of the inversion parameters allows resolving the moment tensor also in presence of realistic seismic noise conditions. Finally, the moment tensor inversion methodology is applied to 8 events chosen from mining block #33/34 at Kiruna mine. Source parameters including scalar moment, magnitude, double couple, compensated linear vector dipole and isotropic contributions as well as the strike, dip, rake configurations of the double couple term were obtained. The orientations of the nodal planes of the double-couple component in most cases vary from NNW to NNE with a dip along the ore body or in the opposite direction.

  12. State-Estimation Algorithm Based on Computer Vision

    NASA Technical Reports Server (NTRS)

    Bayard, David; Brugarolas, Paul

    2007-01-01

    An algorithm and software to implement the algorithm are being developed as means to estimate the state (that is, the position and velocity) of an autonomous vehicle, relative to a visible nearby target object, to provide guidance for maneuvering the vehicle. In the original intended application, the autonomous vehicle would be a spacecraft and the nearby object would be a small astronomical body (typically, a comet or asteroid) to be explored by the spacecraft. The algorithm could also be used on Earth in analogous applications -- for example, for guiding underwater robots near such objects of interest as sunken ships, mineral deposits, or submerged mines. It is assumed that the robot would be equipped with a vision system that would include one or more electronic cameras, image-digitizing circuitry, and an imagedata- processing computer that would generate feature-recognition data products.

  13. A New Pivoting and Iterative Text Detection Algorithm for Biomedical Images

    PubMed Central

    Xu, Songhua; Krauthammer, Michael

    2010-01-01

    There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper’s key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. In this paper, we demonstrate that a projection histogram-based text detection approach is well suited for text detection in biomedical images, with a performance of F score of .60. The approach performs better than comparable approaches for text detection. Further, we show that the iterative application of the algorithm is boosting overall detection performance. A C++ implementation of our algorithm is freely available through email request for academic use. PMID:20887803

  14. A framework for interval-valued information system

    NASA Astrophysics Data System (ADS)

    Yin, Yunfei; Gong, Guanghong; Han, Liang

    2012-09-01

    Interval-valued information system is used to transform the conventional dataset into the interval-valued form. To conduct the interval-valued data mining, we conduct two investigations: (1) construct the interval-valued information system, and (2) conduct the interval-valued knowledge discovery. In constructing the interval-valued information system, we first make the paired attributes in the database discovered, and then, make them stored in the neighbour locations in a common database and regard them as 'one' new field. In conducting the interval-valued knowledge discovery, we utilise some related priori knowledge and regard the priori knowledge as the control objectives; and design an approximate closed-loop control mining system. On the implemented experimental platform (prototype), we conduct the corresponding experiments and compare the proposed algorithms with several typical algorithms, such as the Apriori algorithm, the FP-growth algorithm and the CLOSE+ algorithm. The experimental results show that the interval-valued information system method is more effective than the conventional algorithms in discovering interval-valued patterns.

  15. Assessing semantic similarity of texts - Methods and algorithms

    NASA Astrophysics Data System (ADS)

    Rozeva, Anna; Zerkova, Silvia

    2017-12-01

    Assessing the semantic similarity of texts is an important part of different text-related applications like educational systems, information retrieval, text summarization, etc. This task is performed by sophisticated analysis, which implements text-mining techniques. Text mining involves several pre-processing steps, which provide for obtaining structured representative model of the documents in a corpus by means of extracting and selecting the features, characterizing their content. Generally the model is vector-based and enables further analysis with knowledge discovery approaches. Algorithms and measures are used for assessing texts at syntactical and semantic level. An important text-mining method and similarity measure is latent semantic analysis (LSA). It provides for reducing the dimensionality of the document vector space and better capturing the text semantics. The mathematical background of LSA for deriving the meaning of the words in a given text by exploring their co-occurrence is examined. The algorithm for obtaining the vector representation of words and their corresponding latent concepts in a reduced multidimensional space as well as similarity calculation are presented.

  16. Virtual Observatories, Data Mining, and Astroinformatics

    NASA Astrophysics Data System (ADS)

    Borne, Kirk

    The historical, current, and future trends in knowledge discovery from data in astronomy are presented here. The story begins with a brief history of data gathering and data organization. A description of the development ofnew information science technologies for astronomical discovery is then presented. Among these are e-Science and the virtual observatory, with its data discovery, access, display, and integration protocols; astroinformatics and data mining for exploratory data analysis, information extraction, and knowledge discovery from distributed data collections; new sky surveys' databases, including rich multivariate observational parameter sets for large numbers of objects; and the emerging discipline of data-oriented astronomical research, called astroinformatics. Astroinformatics is described as the fourth paradigm of astronomical research, following the three traditional research methodologies: observation, theory, and computation/modeling. Astroinformatics research areas include machine learning, data mining, visualization, statistics, semantic science, and scientific data management.Each of these areas is now an active research discipline, with significantscience-enabling applications in astronomy. Research challenges and sample research scenarios are presented in these areas, in addition to sample algorithms for data-oriented research. These information science technologies enable scientific knowledge discovery from the increasingly large and complex data collections in astronomy. The education and training of the modern astronomy student must consequently include skill development in these areas, whose practitioners have traditionally been limited to applied mathematicians, computer scientists, and statisticians. Modern astronomical researchers must cross these traditional discipline boundaries, thereby borrowing the best of breed methodologies from multiple disciplines. In the era of large sky surveys and numerous large telescopes, the potential for astronomical discovery is equally large, and so the data-oriented research methods, algorithms, and techniques that are presented here will enable the greatest discovery potential from the ever-growing data and information resources in astronomy.

  17. Fault Tolerant Frequent Pattern Mining

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shohdy, Sameh; Vishnu, Abhinav; Agrawal, Gagan

    FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing,more » though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.« less

  18. Recognition physical activities with optimal number of wearable sensors using data mining algorithms and deep belief network.

    PubMed

    Al-Fatlawi, Ali H; Fatlawi, Hayder K; Sai Ho Ling

    2017-07-01

    Daily physical activities monitoring is benefiting the health care field in several ways, in particular with the development of the wearable sensors. This paper adopts effective ways to calculate the optimal number of the necessary sensors and to build a reliable and a high accuracy monitoring system. Three data mining algorithms, namely Decision Tree, Random Forest and PART Algorithm, have been applied for the sensors selection process. Furthermore, the deep belief network (DBN) has been investigated to recognise 33 physical activities effectively. The results indicated that the proposed method is reliable with an overall accuracy of 96.52% and the number of sensors is minimised from nine to six sensors.

  19. A fuzzy hill-climbing algorithm for the development of a compact associative classifier

    NASA Astrophysics Data System (ADS)

    Mitra, Soumyaroop; Lam, Sarah S.

    2012-02-01

    Classification, a data mining technique, has widespread applications including medical diagnosis, targeted marketing, and others. Knowledge discovery from databases in the form of association rules is one of the important data mining tasks. An integrated approach, classification based on association rules, has drawn the attention of the data mining community over the last decade. While attention has been mainly focused on increasing classifier accuracies, not much efforts have been devoted towards building interpretable and less complex models. This paper discusses the development of a compact associative classification model using a hill-climbing approach and fuzzy sets. The proposed methodology builds the rule-base by selecting rules which contribute towards increasing training accuracy, thus balancing classification accuracy with the number of classification association rules. The results indicated that the proposed associative classification model can achieve competitive accuracies on benchmark datasets with continuous attributes and lend better interpretability, when compared with other rule-based systems.

  20. Automated Assessment of Patients' Self-Narratives for Posttraumatic Stress Disorder Screening Using Natural Language Processing and Text Mining.

    PubMed

    He, Qiwei; Veldkamp, Bernard P; Glas, Cees A W; de Vries, Theo

    2017-03-01

    Patients' narratives about traumatic experiences and symptoms are useful in clinical screening and diagnostic procedures. In this study, we presented an automated assessment system to screen patients for posttraumatic stress disorder via a natural language processing and text-mining approach. Four machine-learning algorithms-including decision tree, naive Bayes, support vector machine, and an alternative classification approach called the product score model-were used in combination with n-gram representation models to identify patterns between verbal features in self-narratives and psychiatric diagnoses. With our sample, the product score model with unigrams attained the highest prediction accuracy when compared with practitioners' diagnoses. The addition of multigrams contributed most to balancing the metrics of sensitivity and specificity. This article also demonstrates that text mining is a promising approach for analyzing patients' self-expression behavior, thus helping clinicians identify potential patients from an early stage.

  1. Littoral Assessment of Mine Burial Signatures (LAMBS) buried land mine/background spectral signature analyses

    USGS Publications Warehouse

    Kenton, A.C.; Geci, D.M.; Ray, K.J.; Thomas, C.M.; Salisbury, J.W.; Mars, J.C.; Crowley, J.K.; Witherspoon, N.H.; Holloway, J.H.; Harmon R.S.Broach J.T.Holloway, Jr. J.H.

    2004-01-01

    The objective of the Office of Naval Research (ONR) Rapid Overt Reconnaissance (ROR) program and the Airborne Littoral Reconnaissance Technologies (ALRT) project's LAMBS effort is to determine if electro-optical spectral discriminants exist that are useful for the detection of land mines in littoral regions. Statistically significant buried mine overburden and background signature data were collected over a wide spectral range (0.35 to 14 ??m) to identify robust spectral features that might serve as discriminants for new airborne sensor concepts. LAMBS has expanded previously collected databases to littoral areas - primarily dry and wet sandy soils - where tidal, surf, and wind conditions can severely modify spectral signatures. At AeroSense 2003, we reported completion of three buried mine collections at an inland bay, Atlantic and Gulf of Mexico beach sites.1 We now report LAMBS spectral database analyses results using metrics which characterize the detection performance of general types of spectral detection algorithms. These metrics include mean contrast, spectral signal-to-clutter, covariance, information content, and spectral matched filter analyses. Detection performance of the buried land mines was analyzed with regard to burial age, background type, and environmental conditions. These analyses considered features observed due to particle size differences, surface roughness, surface moisture, and compositional differences.

  2. On the classification techniques in data mining for microarray data classification

    NASA Astrophysics Data System (ADS)

    Aydadenta, Husna; Adiwijaya

    2018-03-01

    Cancer is one of the deadly diseases, according to data from WHO by 2015 there are 8.8 million more deaths caused by cancer, and this will increase every year if not resolved earlier. Microarray data has become one of the most popular cancer-identification studies in the field of health, since microarray data can be used to look at levels of gene expression in certain cell samples that serve to analyze thousands of genes simultaneously. By using data mining technique, we can classify the sample of microarray data thus it can be identified with cancer or not. In this paper we will discuss some research using some data mining techniques using microarray data, such as Support Vector Machine (SVM), Artificial Neural Network (ANN), Naive Bayes, k-Nearest Neighbor (kNN), and C4.5, and simulation of Random Forest algorithm with technique of reduction dimension using Relief. The result of this paper show performance measure (accuracy) from classification algorithm (SVM, ANN, Naive Bayes, kNN, C4.5, and Random Forets).The results in this paper show the accuracy of Random Forest algorithm higher than other classification algorithms (Support Vector Machine (SVM), Artificial Neural Network (ANN), Naive Bayes, k-Nearest Neighbor (kNN), and C4.5). It is hoped that this paper can provide some information about the speed, accuracy, performance and computational cost generated from each Data Mining Classification Technique based on microarray data.

  3. Cooperative organic mine avoidance path planning

    NASA Astrophysics Data System (ADS)

    McCubbin, Christopher B.; Piatko, Christine D.; Peterson, Adam V.; Donnald, Creighton R.; Cohen, David

    2005-06-01

    The JHU/APL Path Planning team has developed path planning techniques to look for paths that balance the utility and risk associated with different routes through a minefield. Extending on previous years' efforts, we investigated real-world Naval mine avoidance requirements and developed a tactical decision aid (TDA) that satisfies those requirements. APL has developed new mine path planning techniques using graph based and genetic algorithms which quickly produce near-minimum risk paths for complicated fitness functions incorporating risk, path length, ship kinematics, and naval doctrine. The TDA user interface, a Java Swing application that obtains data via Corba interfaces to path planning databases, allows the operator to explore a fusion of historic and in situ mine field data, control the path planner, and display the planning results. To provide a context for the minefield data, the user interface also renders data from the Digital Nautical Chart database, a database created by the National Geospatial-Intelligence Agency containing charts of the world's ports and coastal regions. This TDA has been developed in conjunction with the COMID (Cooperative Organic Mine Defense) system. This paper presents a description of the algorithms, architecture, and application produced.

  4. A prototype system based on visual interactive SDM called VGC

    NASA Astrophysics Data System (ADS)

    Jia, Zelu; Liu, Yaolin; Liu, Yanfang

    2009-10-01

    In many application domains, data is collected and referenced by its geo-spatial location. Spatial data mining, or the discovery of interesting patterns in such databases, is an important capability in the development of database systems. Spatial data mining recently emerges from a number of real applications, such as real-estate marketing, urban planning, weather forecasting, medical image analysis, road traffic accident analysis, etc. It demands for efficient solutions for many new, expensive, and complicated problems. For spatial data mining of large data sets to be effective, it is also important to include humans in the data exploration process and combine their flexibility, creativity, and general knowledge with the enormous storage capacity and computational power of today's computers. Visual spatial data mining applies human visual perception to the exploration of large data sets. Presenting data in an interactive, graphical form often fosters new insights, encouraging the information and validation of new hypotheses to the end of better problem-solving and gaining deeper domain knowledge. In this paper a visual interactive spatial data mining prototype system (visual geo-classify) based on VC++6.0 and MapObject2.0 are designed and developed, the basic algorithms of the spatial data mining is used decision tree and Bayesian networks, and data classify are used training and learning and the integration of the two to realize. The result indicates it's a practical and extensible visual interactive spatial data mining tool.

  5. Mining User Dwell Time for Personalized Web Search Re-Ranking

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xu, Songhua; Jiang, Hao; Lau, Francis

    We propose a personalized re-ranking algorithm through mining user dwell times derived from a user's previously online reading or browsing activities. We acquire document level user dwell times via a customized web browser, from which we then infer conceptword level user dwell times in order to understand a user's personal interest. According to the estimated concept word level user dwell times, our algorithm can estimate a user's potential dwell time over a new document, based on which personalized webpage re-ranking can be carried out. We compare the rankings produced by our algorithm with rankings generated by popular commercial search enginesmore » and a recently proposed personalized ranking algorithm. The results clearly show the superiority of our method. In this paper, we propose a new personalized webpage ranking algorithmthrough mining dwell times of a user. We introduce a quantitative model to derive concept word level user dwell times from the observed document level user dwell times. Once we have inferred a user's interest over the set of concept words the user has encountered in previous readings, we can then predict the user's potential dwell time over a new document. Such predicted user dwell time allows us to carry out personalized webpage re-ranking. To explore the effectiveness of our algorithm, we measured the performance of our algorithm under two conditions - one with a relatively limited amount of user dwell time data and the other with a doubled amount. Both evaluation cases put our algorithm for generating personalized webpage rankings to satisfy a user's personal preference ahead of those by Google, Yahoo!, and Bing, as well as a recent personalized webpage ranking algorithm.« less

  6. Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics.

    PubMed

    Yoon, Sunmoo; Gutierrez, Jose

    Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions.

  7. Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics

    PubMed Central

    Yoon, Sunmoo; Gutierrez, Jose

    2015-01-01

    Purpose Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Methods Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Results Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Conclusions Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions. PMID:26835413

  8. Visual cues for data mining

    NASA Astrophysics Data System (ADS)

    Rogowitz, Bernice E.; Rabenhorst, David A.; Gerth, John A.; Kalin, Edward B.

    1996-04-01

    This paper describes a set of visual techniques, based on principles of human perception and cognition, which can help users analyze and develop intuitions about tabular data. Collections of tabular data are widely available, including, for example, multivariate time series data, customer satisfaction data, stock market performance data, multivariate profiles of companies and individuals, and scientific measurements. In our approach, we show how visual cues can help users perform a number of data mining tasks, including identifying correlations and interaction effects, finding clusters and understanding the semantics of cluster membership, identifying anomalies and outliers, and discovering multivariate relationships among variables. These cues are derived from psychological studies on perceptual organization, visual search, perceptual scaling, and color perception. These visual techniques are presented as a complement to the statistical and algorithmic methods more commonly associated with these tasks, and provide an interactive interface for the human analyst.

  9. Comparative data mining analysis for information retrieval of MODIS images: monitoring lake turbidity changes at Lake Okeechobee, Florida

    NASA Astrophysics Data System (ADS)

    Chang, Ni-Bin; Daranpob, Ammarin; Yang, Y. Jeffrey; Jin, Kang-Ren

    2009-09-01

    In the remote sensing field, a frequently recurring question is: Which computational intelligence or data mining algorithms are most suitable for the retrieval of essential information given that most natural systems exhibit very high non-linearity. Among potential candidates might be empirical regression, neural network model, support vector machine, genetic algorithm/genetic programming, analytical equation, etc. This paper compares three types of data mining techniques, including multiple non-linear regression, artificial neural networks, and genetic programming, for estimating multi-temporal turbidity changes following hurricane events at Lake Okeechobee, Florida. This retrospective analysis aims to identify how the major hurricanes impacted the water quality management in 2003-2004. The Moderate Resolution Imaging Spectroradiometer (MODIS) Terra 8-day composite imageries were used to retrieve the spatial patterns of turbidity distributions for comparison against the visual patterns discernible in the in-situ observations. By evaluating four statistical parameters, the genetic programming model was finally selected as the most suitable data mining tool for classification in which the MODIS band 1 image and wind speed were recognized as the major determinants by the model. The multi-temporal turbidity maps generated before and after the major hurricane events in 2003-2004 showed that turbidity levels were substantially higher after hurricane episodes. The spatial patterns of turbidity confirm that sediment-laden water travels to the shore where it reduces the intensity of the light necessary to submerged plants for photosynthesis. This reduction results in substantial loss of biomass during the post-hurricane period.

  10. Automatic crack detection method for loaded coal in vibration failure process

    PubMed Central

    Li, Chengwu

    2017-01-01

    In the coal mining process, the destabilization of loaded coal mass is a prerequisite for coal and rock dynamic disaster, and surface cracks of the coal and rock mass are important indicators, reflecting the current state of the coal body. The detection of surface cracks in the coal body plays an important role in coal mine safety monitoring. In this paper, a method for detecting the surface cracks of loaded coal by a vibration failure process is proposed based on the characteristics of the surface cracks of coal and support vector machine (SVM). A large number of cracked images are obtained by establishing a vibration-induced failure test system and industrial camera. Histogram equalization and a hysteresis threshold algorithm were used to reduce the noise and emphasize the crack; then, 600 images and regions, including cracks and non-cracks, were manually labelled. In the crack feature extraction stage, eight features of the cracks are extracted to distinguish cracks from other objects. Finally, a crack identification model with an accuracy over 95% was trained by inputting the labelled sample images into the SVM classifier. The experimental results show that the proposed algorithm has a higher accuracy than the conventional algorithm and can effectively identify cracks on the surface of the coal and rock mass automatically. PMID:28973032

  11. Automatic crack detection method for loaded coal in vibration failure process.

    PubMed

    Li, Chengwu; Ai, Dihao

    2017-01-01

    In the coal mining process, the destabilization of loaded coal mass is a prerequisite for coal and rock dynamic disaster, and surface cracks of the coal and rock mass are important indicators, reflecting the current state of the coal body. The detection of surface cracks in the coal body plays an important role in coal mine safety monitoring. In this paper, a method for detecting the surface cracks of loaded coal by a vibration failure process is proposed based on the characteristics of the surface cracks of coal and support vector machine (SVM). A large number of cracked images are obtained by establishing a vibration-induced failure test system and industrial camera. Histogram equalization and a hysteresis threshold algorithm were used to reduce the noise and emphasize the crack; then, 600 images and regions, including cracks and non-cracks, were manually labelled. In the crack feature extraction stage, eight features of the cracks are extracted to distinguish cracks from other objects. Finally, a crack identification model with an accuracy over 95% was trained by inputting the labelled sample images into the SVM classifier. The experimental results show that the proposed algorithm has a higher accuracy than the conventional algorithm and can effectively identify cracks on the surface of the coal and rock mass automatically.

  12. Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data.

    PubMed

    Liu, Zhenqiu; Sun, Fengzhu; McGovern, Dermot P

    2017-01-01

    Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L 1 , SCAD and MC+. However, none of the existing algorithms optimizes L 0 , which penalizes the number of nonzero features directly. In this paper, we develop a novel sparse generalized linear model (GLM) with L 0 approximation for feature selection and prediction with big omics data. The proposed approach approximate the L 0 optimization directly. Even though the original L 0 problem is non-convex, the problem is approximated by sequential convex optimizations with the proposed algorithm. The proposed method is easy to implement with only several lines of code. Novel adaptive ridge algorithms ( L 0 ADRIDGE) for L 0 penalized GLM with ultra high dimensional big data are developed. The proposed approach outperforms the other cutting edge regularization methods including SCAD and MC+ in simulations. When it is applied to integrated analysis of mRNA, microRNA, and methylation data from TCGA ovarian cancer, multilevel gene signatures associated with suboptimal debulking are identified simultaneously. The biological significance and potential clinical importance of those genes are further explored. The developed Software L 0 ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge.

  13. Highly scalable and robust rule learner: performance evaluation and comparison.

    PubMed

    Kurgan, Lukasz A; Cios, Krzysztof J; Dick, Scott

    2006-02-01

    Business intelligence and bioinformatics applications increasingly require the mining of datasets consisting of millions of data points, or crafting real-time enterprise-level decision support systems for large corporations and drug companies. In all cases, there needs to be an underlying data mining system, and this mining system must be highly scalable. To this end, we describe a new rule learner called DataSqueezer. The learner belongs to the family of inductive supervised rule extraction algorithms. DataSqueezer is a simple, greedy, rule builder that generates a set of production rules from labeled input data. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact, comprehensible, and have accuracy comparable to rules generated by other state-of-the-art rule extraction algorithms. The main advantages of DataSqueezer are very high efficiency, and missing data resistance. DataSqueezer exhibits log-linear asymptotic complexity with the number of training examples, and it is faster than other state-of-the-art rule learners. The learner is also robust to large quantities of missing data, as verified by extensive experimental comparison with the other learners. DataSqueezer is thus well suited to modern data mining and business intelligence tasks, which commonly involve huge datasets with a large fraction of missing data.

  14. Long-range prediction of Indian summer monsoon rainfall using data mining and statistical approaches

    NASA Astrophysics Data System (ADS)

    H, Vathsala; Koolagudi, Shashidhar G.

    2017-10-01

    This paper presents a hybrid model to better predict Indian summer monsoon rainfall. The algorithm considers suitable techniques for processing dense datasets. The proposed three-step algorithm comprises closed itemset generation-based association rule mining for feature selection, cluster membership for dimensionality reduction, and simple logistic function for prediction. The application of predicting rainfall into flood, excess, normal, deficit, and drought based on 36 predictors consisting of land and ocean variables is presented. Results show good accuracy in the considered study period of 37years (1969-2005).

  15. Design of Automatic Extraction Algorithm of Knowledge Points for MOOCs

    PubMed Central

    Chen, Haijian; Han, Dongmei; Zhao, Lina

    2015-01-01

    In recent years, Massive Open Online Courses (MOOCs) are very popular among college students and have a powerful impact on academic institutions. In the MOOCs environment, knowledge discovery and knowledge sharing are very important, which currently are often achieved by ontology techniques. In building ontology, automatic extraction technology is crucial. Because the general methods of text mining algorithm do not have obvious effect on online course, we designed automatic extracting course knowledge points (AECKP) algorithm for online course. It includes document classification, Chinese word segmentation, and POS tagging for each document. Vector Space Model (VSM) is used to calculate similarity and design the weight to optimize the TF-IDF algorithm output values, and the higher scores will be selected as knowledge points. Course documents of “C programming language” are selected for the experiment in this study. The results show that the proposed approach can achieve satisfactory accuracy rate and recall rate. PMID:26448738

  16. Toward detecting deception in intelligent systems

    NASA Astrophysics Data System (ADS)

    Santos, Eugene, Jr.; Johnson, Gregory, Jr.

    2004-08-01

    Contemporary decision makers often must choose a course of action using knowledge from several sources. Knowledge may be provided from many diverse sources including electronic sources such as knowledge-based diagnostic or decision support systems or through data mining techniques. As the decision maker becomes more dependent on these electronic information sources, detecting deceptive information from these sources becomes vital to making a correct, or at least more informed, decision. This applies to unintentional disinformation as well as intentional misinformation. Our ongoing research focuses on employing models of deception and deception detection from the fields of psychology and cognitive science to these systems as well as implementing deception detection algorithms for probabilistic intelligent systems. The deception detection algorithms are used to detect, classify and correct attempts at deception. Algorithms for detecting unexpected information rely upon a prediction algorithm from the collaborative filtering domain to predict agent responses in a multi-agent system.

  17. Evaluation of Aster Images for Characterization and Mapping of Amethyst Mining Residues

    NASA Astrophysics Data System (ADS)

    Markoski, P. R.; Rolim, S. B. A.

    2012-07-01

    The objective of this work was to evaluate the potential of Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER), subsystems VNIR (Visible and Near Infrared) and SWIR (Short Wave Infrared) images, for discrimination and mapping of amethyst mining residues (basalt) in the Ametista do Sul Region, Rio Grande do Sul State, Brazil. This region provides the most part of amethyst mining of the World. The basalt is extracted during the mining process and deposited outside the mine. As a result, mounts of residues (basalt) rise up. These mounts are many times smaller than ASTER pixel size (VNIR - 15 meters and SWIR - 30 meters). Thus, the pixel composition becomes a mixing of various materials, hampering its identification and mapping. Trying to solve this problem, multispectral algorithm Maximum Likelihood (MaxVer) and the hyperspectral technique SAM (Spectral Angle Mapper) were used in this work. Images from ASTER subsystems VNIR and SWIR were used to perform the classifications. SAM technique produced better results than MaxVer algorithm. The main error found by the techniques was the mixing between "shadow" and "mining residues/basalt" classes. With the SAM technique the confusion decreased because it employed the basalt spectral curve as a reference, while the multispectral techniques employed pixels groups that could have spectral mixture with other targets. The results showed that in tropical terrains as the study area, ASTER data can be efficacious for the characterization of mining residues.

  18. Analysis of Mining Terrain Deformation Characteristics with Deformation Information System

    NASA Astrophysics Data System (ADS)

    Blachowski, Jan; Milczarek, Wojciech; Grzempowski, Piotr

    2014-05-01

    Mapping and prediction of mining related deformations of the earth surface is an important measure for minimising threat to surface infrastructure, human population, the environment and safety of the mining operation itself arising from underground extraction of useful minerals. The number of methods and techniques used for monitoring and analysis of mining terrain deformations is wide and increasing with the development of geographical information technologies. These include for example: terrestrial geodetic measurements, global positioning systems, remote sensing, spatial interpolation, finite element method modelling, GIS based modelling, geological modelling, empirical modelling using the Knothe theory, artificial neural networks, fuzzy logic calculations and other. The aim of this paper is to introduce the concept of an integrated Deformation Information System (DIS) developed in geographic information systems environment for analysis and modelling of various spatial data related to mining activity and demonstrate its applications for mapping and visualising, as well as identifying possible mining terrain deformation areas with various spatial modelling methods. The DIS concept is based on connected modules that include: the spatial database - the core of the system, the spatial data collection module formed by: terrestrial, satellite and remote sensing measurements of the ground changes, the spatial data mining module for data discovery and extraction, the geological modelling module, the spatial data modeling module with data processing algorithms for spatio-temporal analysis and mapping of mining deformations and their characteristics (e.g. deformation parameters: tilt, curvature and horizontal strain), the multivariate spatial data classification module and the visualization module allowing two-dimensional interactive and static mapping and three-dimensional visualizations of mining ground characteristics. The Systems's functionality has been presented on the case study of a coal mining region in SW Poland where it has been applied to study characteristics and map mining induced ground deformations in a city in the last two decades of underground coal extraction and in the first decade after the end of mining. The mining subsidence area and its deformation parameters (tilt and curvature) have been calculated and the latter classified and mapped according to the Polish regulations. In addition possible areas of ground deformation have been indicated based on multivariate spatial data analysis of geological and mining operation characteristics with the geographically weighted regression method.

  19. Biomedical text mining and its applications in cancer research.

    PubMed

    Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong

    2013-04-01

    Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. Copyright © 2012 Elsevier Inc. All rights reserved.

  20. Imitating manual curation of text-mined facts in biomedicine.

    PubMed

    Rodriguez-Esteban, Raul; Iossifov, Ivan; Rzhetsky, Andrey

    2006-09-08

    Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

  1. A New Metaheuristic Algorithm for Long-Term Open-Pit Production Planning / Nowy meta-heurystyczny algorytm wspomagający długoterminowe planowanie produkcji w kopalni odkrywkowej

    NASA Astrophysics Data System (ADS)

    Sattarvand, Javad; Niemann-Delius, Christian

    2013-03-01

    Paper describes a new metaheuristic algorithm which has been developed based on the Ant Colony Optimisation (ACO) and its efficiency have been discussed. To apply the ACO process on mine planning problem, a series of variables are considered for each block as the pheromone trails that represent the desirability of the block for being the deepest point of the mine in that column for the given mining period. During implementation several mine schedules are constructed in each iteration. Then the pheromone values of all blocks are reduced to a certain percentage and additionally the pheromone value of those blocks that are used in defining the constructed schedules are increased according to the quality of the generated solutions. By repeated iterations, the pheromone values of those blocks that define the shape of the optimum solution are increased whereas those of the others have been significantly evaporated.

  2. Mining disease genes using integrated protein-protein interaction and gene-gene co-regulation information.

    PubMed

    Li, Jin; Wang, Limei; Guo, Maozu; Zhang, Ruijie; Dai, Qiguo; Liu, Xiaoyan; Wang, Chunyu; Teng, Zhixia; Xuan, Ping; Zhang, Mingming

    2015-01-01

    In humans, despite the rapid increase in disease-associated gene discovery, a large proportion of disease-associated genes are still unknown. Many network-based approaches have been used to prioritize disease genes. Many networks, such as the protein-protein interaction (PPI), KEGG, and gene co-expression networks, have been used. Expression quantitative trait loci (eQTLs) have been successfully applied for the determination of genes associated with several diseases. In this study, we constructed an eQTL-based gene-gene co-regulation network (GGCRN) and used it to mine for disease genes. We adopted the random walk with restart (RWR) algorithm to mine for genes associated with Alzheimer disease. Compared to the Human Protein Reference Database (HPRD) PPI network alone, the integrated HPRD PPI and GGCRN networks provided faster convergence and revealed new disease-related genes. Therefore, using the RWR algorithm for integrated PPI and GGCRN is an effective method for disease-associated gene mining.

  3. Kernel Methods for Mining Instance Data in Ontologies

    NASA Astrophysics Data System (ADS)

    Bloehdorn, Stephan; Sure, York

    The amount of ontologies and meta data available on the Web is constantly growing. The successful application of machine learning techniques for learning of ontologies from textual data, i.e. mining for the Semantic Web, contributes to this trend. However, no principal approaches exist so far for mining from the Semantic Web. We investigate how machine learning algorithms can be made amenable for directly taking advantage of the rich knowledge expressed in ontologies and associated instance data. Kernel methods have been successfully employed in various learning tasks and provide a clean framework for interfacing between non-vectorial data and machine learning algorithms. In this spirit, we express the problem of mining instances in ontologies as the problem of defining valid corresponding kernels. We present a principled framework for designing such kernels by means of decomposing the kernel computation into specialized kernels for selected characteristics of an ontology which can be flexibly assembled and tuned. Initial experiments on real world Semantic Web data enjoy promising results and show the usefulness of our approach.

  4. Optimization of C4.5 algorithm-based particle swarm optimization for breast cancer diagnosis

    NASA Astrophysics Data System (ADS)

    Muslim, M. A.; Rukmana, S. H.; Sugiharti, E.; Prasetiyo, B.; Alimah, S.

    2018-03-01

    Data mining has become a basic methodology for computational applications in the field of medical domains. Data mining can be applied in the health field such as for diagnosis of breast cancer, heart disease, diabetes and others. Breast cancer is most common in women, with more than one million cases and nearly 600,000 deaths occurring worldwide each year. The most effective way to reduce breast cancer deaths was by early diagnosis. This study aims to determine the level of breast cancer diagnosis. This research data uses Wisconsin Breast Cancer dataset (WBC) from UCI machine learning. The method used in this research is the algorithm C4.5 and Particle Swarm Optimization (PSO) as a feature option and to optimize the algorithm. C4.5. Ten-fold cross-validation is used as a validation method and a confusion matrix. The result of this research is C4.5 algorithm. The particle swarm optimization C4.5 algorithm has increased by 0.88%.

  5. HPC-NMF: A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kannan, Ramakrishnan; Sukumar, Sreenivas R.; Ballard, Grey M.

    NMF is a useful tool for many applications in different domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of efficient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems formore » $$\\WW$$ and $$\\HH$$. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementation, our algorithm is also flexible: It performs well for both dense and sparse matrices, and allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors $$\\WW$$ and $$\\HH$$ within the alternating iterations.« less

  6. Ask and Ye Shall Receive? Automated Text Mining of Michigan Capital Facility Finance Bond Election Proposals to Identify Which Topics Are Associated with Bond Passage and Voter Turnout

    ERIC Educational Resources Information Center

    Bowers, Alex J.; Chen, Jingjing

    2015-01-01

    The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the…

  7. A construction scheme of web page comment information extraction system based on frequent subtree mining

    NASA Astrophysics Data System (ADS)

    Zhang, Xiaowen; Chen, Bingfeng

    2017-08-01

    Based on the frequent sub-tree mining algorithm, this paper proposes a construction scheme of web page comment information extraction system based on frequent subtree mining, referred to as FSM system. The entire system architecture and the various modules to do a brief introduction, and then the core of the system to do a detailed description, and finally give the system prototype.

  8. Modeling Spatial Dependencies and Semantic Concepts in Data Mining

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vatsavai, Raju

    Data mining is the process of discovering new patterns and relationships in large datasets. However, several studies have shown that general data mining techniques often fail to extract meaningful patterns and relationships from the spatial data owing to the violation of fundamental geospatial principles. In this tutorial, we introduce basic principles behind explicit modeling of spatial and semantic concepts in data mining. In particular, we focus on modeling these concepts in the widely used classification, clustering, and prediction algorithms. Classification is the process of learning a structure or model (from user given inputs) and applying the known model to themore » new data. Clustering is the process of discovering groups and structures in the data that are ``similar,'' without applying any known structures in the data. Prediction is the process of finding a function that models (explains) the data with least error. One common assumption among all these methods is that the data is independent and identically distributed. Such assumptions do not hold well in spatial data, where spatial dependency and spatial heterogeneity are a norm. In addition, spatial semantics are often ignored by the data mining algorithms. In this tutorial we cover recent advances in explicitly modeling of spatial dependencies and semantic concepts in data mining.« less

  9. Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks

    NASA Astrophysics Data System (ADS)

    Esfandiar, Pooya; Bonchi, Francesco; Gleich, David F.; Greif, Chen; Lakshmanan, Laks V. S.; On, Byung-Won

    Motivated by social network data mining problems such as link prediction and collaborative filtering, significant research effort has been devoted to computing topological measures including the Katz score and the commute time. Existing approaches typically approximate all pairwise relationships simultaneously. In this paper, we are interested in computing: the score for a single pair of nodes, and the top-k nodes with the best scores from a given source node. For the pairwise problem, we apply an iterative algorithm that computes upper and lower bounds for the measures we seek. This algorithm exploits a relationship between the Lanczos process and a quadrature rule. For the top-k problem, we propose an algorithm that only accesses a small portion of the graph and is related to techniques used in personalized PageRank computing. To test the scalability and accuracy of our algorithms we experiment with three real-world networks and find that these algorithms run in milliseconds to seconds without any preprocessing.

  10. Fast katz and commuters : efficient estimation of social relatedness in large networks.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    On, Byung-Won; Lakshmanan, Laks V. S.; Greif, Chen

    Motivated by social network data mining problems such as link prediction and collaborative filtering, significant research effort has been devoted to computing topological measures including the Katz score and the commute time. Existing approaches typically approximate all pairwise relationships simultaneously. In this paper, we are interested in computing: the score for a single pair of nodes, and the top-k nodes with the best scores from a given source node. For the pairwise problem, we apply an iterative algorithm that computes upper and lower bounds for the measures we seek. This algorithm exploits a relationship between the Lanczos process and amore » quadrature rule. For the top-k problem, we propose an algorithm that only accesses a small portion of the graph and is related to techniques used in personalized PageRank computing. To test the scalability and accuracy of our algorithms we experiment with three real-world networks and find that these algorithms run in milliseconds to seconds without any preprocessing.« less

  11. Fast Adapting Ensemble: A New Algorithm for Mining Data Streams with Concept Drift

    PubMed Central

    Ortíz Díaz, Agustín; Ramos-Jiménez, Gonzalo; Frías Blanco, Isvani; Caballero Mota, Yailé; Morales-Bueno, Rafael

    2015-01-01

    The treatment of large data streams in the presence of concept drifts is one of the main challenges in the field of data mining, particularly when the algorithms have to deal with concepts that disappear and then reappear. This paper presents a new algorithm, called Fast Adapting Ensemble (FAE), which adapts very quickly to both abrupt and gradual concept drifts, and has been specifically designed to deal with recurring concepts. FAE processes the learning examples in blocks of the same size, but it does not have to wait for the batch to be complete in order to adapt its base classification mechanism. FAE incorporates a drift detector to improve the handling of abrupt concept drifts and stores a set of inactive classifiers that represent old concepts, which are activated very quickly when these concepts reappear. We compare our new algorithm with various well-known learning algorithms, taking into account, common benchmark datasets. The experiments show promising results from the proposed algorithm (regarding accuracy and runtime), handling different types of concept drifts. PMID:25879051

  12. GPU Accelerated Browser for Neuroimaging Genomics.

    PubMed

    Zigon, Bob; Li, Huang; Yao, Xiaohui; Fang, Shiaofen; Hasan, Mohammad Al; Yan, Jingwen; Moore, Jason H; Saykin, Andrew J; Shen, Li

    2018-04-25

    Neuroimaging genomics is an emerging field that provides exciting opportunities to understand the genetic basis of brain structure and function. The unprecedented scale and complexity of the imaging and genomics data, however, have presented critical computational bottlenecks. In this work we present our initial efforts towards building an interactive visual exploratory system for mining big data in neuroimaging genomics. A GPU accelerated browsing tool for neuroimaging genomics is created that implements the ANOVA algorithm for single nucleotide polymorphism (SNP) based analysis and the VEGAS algorithm for gene-based analysis, and executes them at interactive rates. The ANOVA algorithm is 110 times faster than the 4-core OpenMP version, while the VEGAS algorithm is 375 times faster than its 4-core OpenMP counter part. This approach lays a solid foundation for researchers to address the challenges of mining large-scale imaging genomics datasets via interactive visual exploration.

  13. Data Mining and Optimization Tools for Developing Engine Parameters Tools

    NASA Technical Reports Server (NTRS)

    Dhawan, Atam P.

    1998-01-01

    This project was awarded for understanding the problem and developing a plan for Data Mining tools for use in designing and implementing an Engine Condition Monitoring System. From the total budget of $5,000, Tricia and I studied the problem domain for developing ail Engine Condition Monitoring system using the sparse and non-standardized datasets to be available through a consortium at NASA Lewis Research Center. We visited NASA three times to discuss additional issues related to dataset which was not made available to us. We discussed and developed a general framework of data mining and optimization tools to extract useful information from sparse and non-standard datasets. These discussions lead to the training of Tricia Erhardt to develop Genetic Algorithm based search programs which were written in C++ and used to demonstrate the capability of GA algorithm in searching an optimal solution in noisy datasets. From the study and discussion with NASA LERC personnel, we then prepared a proposal, which is being submitted to NASA for future work for the development of data mining algorithms for engine conditional monitoring. The proposed set of algorithm uses wavelet processing for creating multi-resolution pyramid of the data for GA based multi-resolution optimal search. Wavelet processing is proposed to create a coarse resolution representation of data providing two advantages in GA based search: 1. We will have less data to begin with to make search sub-spaces. 2. It will have robustness against the noise because at every level of wavelet based decomposition, we will be decomposing the signal into low pass and high pass filters.

  14. Data Mining at NASA: From Theory to Applications

    NASA Technical Reports Server (NTRS)

    Srivastava, Ashok N.

    2009-01-01

    This slide presentation demonstrates the data mining/machine learning capabilities of NASA Ames and Intelligent Data Understanding (IDU) group. This will encompass the work done recently in the group by various group members. The IDU group develops novel algorithms to detect, classify, and predict events in large data streams for scientific and engineering systems. This presentation for Knowledge Discovery and Data Mining 2009 is to demonstrate the data mining/machine learning capabilities of NASA Ames and IDU group. This will encompass the work done re cently in the group by various group members.

  15. A Data Mining Approach to Identify Sexuality Patterns in a Brazilian University Population.

    PubMed

    Waleska Simões, Priscyla; Cesconetto, Samuel; Toniazzo de Abreu, Larissa Letieli; Côrtes de Mattos Garcia, Merisandra; Cassettari Junior, José Márcio; Comunello, Eros; Bisognin Ceretta, Luciane; Aparecida Manenti, Sandra

    2015-01-01

    This paper presents the profile and experience of sexuality generated from a data mining classification task. We used a database about sexuality and gender violence performed on a university population in southern Brazil. The data mining task identified two relationships between the variables, which enabled the distinction of subgroups that better detail the profile and experience of sexuality. The identification of the relationships between the variables define behavioral models and factors of risk that will help define the algorithms being implemented in the data mining classification task.

  16. Prediction of healthy blood with data mining classification by using Decision Tree, Naive Baysian and SVM approaches

    NASA Astrophysics Data System (ADS)

    Khalilinezhad, Mahdieh; Minaei, Behrooz; Vernazza, Gianni; Dellepiane, Silvana

    2015-03-01

    Data mining (DM) is the process of discovery knowledge from large databases. Applications of data mining in Blood Transfusion Organizations could be useful for improving the performance of blood donation service. The aim of this research is the prediction of healthiness of blood donors in Blood Transfusion Organization (BTO). For this goal, three famous algorithms such as Decision Tree C4.5, Naïve Bayesian classifier, and Support Vector Machine have been chosen and applied to a real database made of 11006 donors. Seven fields such as sex, age, job, education, marital status, type of donor, results of blood tests (doctors' comments and lab results about healthy or unhealthy blood donors) have been selected as input to these algorithms. The results of the three algorithms have been compared and an error cost analysis has been performed. According to this research and the obtained results, the best algorithm with low error cost and high accuracy is SVM. This research helps BTO to realize a model from blood donors in each area in order to predict the healthy blood or unhealthy blood of donors. This research could be useful if used in parallel with laboratory tests to better separate unhealthy blood.

  17. Urinary metabolic profiling of asymptomatic acute intermittent porphyria using a rule-mining-based algorithm.

    PubMed

    Luck, Margaux; Schmitt, Caroline; Talbi, Neila; Gouya, Laurent; Caradeuc, Cédric; Puy, Hervé; Bertho, Gildas; Pallet, Nicolas

    2018-01-01

    Metabolomic profiling combines Nuclear Magnetic Resonance spectroscopy with supervised statistical analysis that might allow to better understanding the mechanisms of a disease. In this study, the urinary metabolic profiling of individuals with porphyrias was performed to predict different types of disease, and to propose new pathophysiological hypotheses. Urine 1 H-NMR spectra of 73 patients with asymptomatic acute intermittent porphyria (aAIP) and familial or sporadic porphyria cutanea tarda (f/sPCT) were compared using a supervised rule-mining algorithm. NMR spectrum buckets bins, corresponding to rules, were extracted and a logistic regression was trained. Our rule-mining algorithm generated results were consistent with those obtained using partial least square discriminant analysis (PLS-DA) and the predictive performance of the model was significant. Buckets that were identified by the algorithm corresponded to metabolites involved in glycolysis and energy-conversion pathways, notably acetate, citrate, and pyruvate, which were found in higher concentrations in the urines of aAIP compared with PCT patients. Metabolic profiling did not discriminate sPCT from fPCT patients. These results suggest that metabolic reprogramming occurs in aAIP individuals, even in the absence of overt symptoms, and supports the relationship that occur between heme synthesis and mitochondrial energetic metabolism.

  18. Web mining in soft computing framework: relevance, state of the art and future directions.

    PubMed

    Pal, S K; Talwar, V; Mitra, P

    2002-01-01

    The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.

  19. A Local Scalable Distributed Expectation Maximization Algorithm for Large Peer-to-Peer Networks

    NASA Technical Reports Server (NTRS)

    Bhaduri, Kanishka; Srivastava, Ashok N.

    2009-01-01

    This paper offers a local distributed algorithm for expectation maximization in large peer-to-peer environments. The algorithm can be used for a variety of well-known data mining tasks in a distributed environment such as clustering, anomaly detection, target tracking to name a few. This technology is crucial for many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Centralizing all or some of the data for building global models is impractical in such peer-to-peer environments because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. The distributed algorithm we have developed in this paper is provably-correct i.e. it converges to the same result compared to a similar centralized algorithm and can automatically adapt to changes to the data and the network. We show that the communication overhead of the algorithm is very low due to its local nature. This monitoring algorithm is then used as a feedback loop to sample data from the network and rebuild the model when it is outdated. We present thorough experimental results to verify our theoretical claims.

  20. Modeling N Cycling during Succession after Forest Disturbance: an Analysis of N Mining and Retention Hypothesis

    NASA Astrophysics Data System (ADS)

    Zhou, Z.; Ollinger, S. V.; Ouimette, A.; Lovett, G. M.; Fuss, C. B.; Goodale, C. L.

    2017-12-01

    Dissolved inorganic nitrogen losses at the Hubbard Brook Experimental Forest (HBEF), New Hampshire, USA, have declined in recent decades, a pattern that counters expectations based on prevailing theory. An unbalanced ecosystem nitrogen (N) budget implies there is a missing component for N sink. Hypotheses to explain this discrepancy include increasing rates of denitrification and accumulation of N in mineral soil pools following N mining by plants. Here, we conducted a modeling analysis fused with field measurements of N cycling, specifically examining the hypothesis relevant to N mining and retention in mineral soils. We included simplified representations of both mechanisms, N mining and retention, in a revised ecosystem process model, PnET-SOM, to evaluate the dynamics of N cycling during succession after forest disturbance at the HBEF. The predicted N mining during the early succession was regulated by a metric representing a potential demand of extra soil N for large wood growth. The accumulation of nitrate in mineral soil pools was a function of the net aboveground biomass accumulation and soil N availability and parameterized based on field 15N tracer incubation data. The predicted patterns of forest N dynamics were consistent with observations. The addition of the new algorithms also improved the predicted DIN export in stream water with an R squared of 0.35 (P<0.01) aganist observations. Predicted mining processes had an average rate of 7.4 kgNha-1yr-1 and Predicted rates of N retention processes were 5.2 kgNha-1yr-1, both of which were in line with estimates only based on field data. The predicted trend of low DIN export could continue for another 70 years to pay back the mined N in mineral soils. Predicted ecosystem N balance showed that N gas loss could account for 14-46% of the total N deposition, the soil mining about 103% during the early succession, and soil retention about 35% at the current forest stage at the HBEF.

  1. Enhancements for a Dynamic Data Warehousing and Mining System for Large-Scale Human Social Cultural Behavioral (HSBC) Data

    DTIC Science & Technology

    2016-09-26

    Intelligent Automation Incorporated Enhancements for a Dynamic Data Warehousing and Mining ...Enhancements for a Dynamic Data Warehousing and Mining System for N00014-16-P-3014 Large-Scale Human Social Cultural Behavioral (HSBC) Data 5b. GRANT NUMBER...Representative Media Gallery View. We perform Scraawl’s NER algorithm to the text associated with YouTube post, which classifies the named entities into

  2. Numerical linear algebra in data mining

    NASA Astrophysics Data System (ADS)

    Eldén, Lars

    Ideas and algorithms from numerical linear algebra are important in several areas of data mining. We give an overview of linear algebra methods in text mining (information retrieval), pattern recognition (classification of handwritten digits), and PageRank computations for web search engines. The emphasis is on rank reduction as a method of extracting information from a data matrix, low-rank approximation of matrices using the singular value decomposition and clustering, and on eigenvalue methods for network analysis.

  3. MINEs: Open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics

    DOE PAGES

    Jeffryes, James G.; Colastani, Ricardo L.; Elbadawi-Sidhu, Mona; ...

    2015-08-28

    Metabolomics have proven difficult to execute in an untargeted and generalizable manner. Liquid chromatography–mass spectrometry (LC–MS) has made it possible to gather data on thousands of cellular metabolites. However, matching metabolites to their spectral features continues to be a bottleneck, meaning that much of the collected information remains uninterpreted and that new metabolites are seldom discovered in untargeted studies. These challenges require new approaches that consider compounds beyond those available in curated biochemistry databases. Here we present Metabolic In silico Network Expansions (MINEs), an extension of known metabolite databases to include molecules that have not been observed, but are likelymore » to occur based on known metabolites and common biochemical reactions. We utilize an algorithm called the Biochemical Network Integrated Computational Explorer (BNICE) and expert-curated reaction rules based on the Enzyme Commission classification system to propose the novel chemical structures and reactions that comprise MINE databases. Starting from the Kyoto Encyclopedia of Genes and Genomes (KEGG) COMPOUND database, the MINE contains over 571,000 compounds, of which 93% are not present in the PubChem database. However, these MINE compounds have on average higher structural similarity to natural products than compounds from KEGG or PubChem. MINE databases were able to propose annotations for 98.6% of a set of 667 MassBank spectra, 14% more than KEGG alone and equivalent to PubChem while returning far fewer candidates per spectra than PubChem (46 vs. 1715 median candidates). Application of MINEs to LC–MS accurate mass data enabled the identity of an unknown peak to be confidently predicted. MINE databases are freely accessible for non-commercial use via user-friendly web-tools at http://minedatabase.mcs.anl.gov and developer-friendly APIs. MINEs improve metabolomics peak identification as compared to general chemical databases whose results include irrelevant synthetic compounds. MINEs complement and expand on previous in silico generated compound databases that focus on human metabolism. We are actively developing the database; future versions of this resource will incorporate transformation rules for spontaneous chemical reactions and more advanced filtering and prioritization of candidate structures.« less

  4. Web Mining: Machine Learning for Web Applications.

    ERIC Educational Resources Information Center

    Chen, Hsinchun; Chau, Michael

    2004-01-01

    Presents an overview of machine learning research and reviews methods used for evaluating machine learning systems. Ways that machine-learning algorithms were used in traditional information retrieval systems in the "pre-Web" era are described, and the field of Web mining and how machine learning has been used in different Web mining…

  5. Comparative Data Mining Analysis for Information Retrieval of MODIS Images: Monitoring Lake Turbidity Changes at Lake Okeechobee, Florida

    EPA Science Inventory

    In the remote sensing field, a frequently recurring question is: Which computational intelligence or data mining algorithms are most suitable for the retrieval of essential information given that most natural systems exhibit very high non-linearity. Among potential candidates mig...

  6. NEW IMPROVEMENTS TO MFIRE TO ENHANCE FIRE MODELING CAPABILITIES.

    PubMed

    Zhou, L; Smith, A C; Yuan, L

    2016-06-01

    NIOSH's mine fire simulation program, MFIRE, is widely accepted as a standard for assessing and predicting the impact of a fire on the mine ventilation system and the spread of fire contaminants in coal and metal/nonmetal mines, which has been used by U.S. and international companies to simulate fires for planning and response purposes. MFIRE is a dynamic, transient-state, mine ventilation network simulation program that performs normal planning calculations. It can also be used to analyze ventilation networks under thermal and mechanical influence such as changes in ventilation parameters, external influences such as changes in temperature, and internal influences such as a fire. The program output can be used to analyze the effects of these influences on the ventilation system. Since its original development by Michigan Technological University for the Bureau of Mines in the 1970s, several updates have been released over the years. In 2012, NIOSH completed a major redesign and restructuring of the program with the release of MFIRE 3.0. MFIRE's outdated FORTRAN programming language was replaced with an object-oriented C++ language and packaged into a dynamic link library (DLL). However, the MFIRE 3.0 release made no attempt to change or improve the fire modeling algorithms inherited from its previous version, MFIRE 2.20. This paper reports on improvements that have been made to the fire modeling capabilities of MFIRE 3.0 since its release. These improvements include the addition of fire source models of the t-squared fire and heat release rate curve data file, the addition of a moving fire source for conveyor belt fire simulations, improvement of the fire location algorithm, and the identification and prediction of smoke rollback phenomena. All the improvements discussed in this paper will be termed as MFIRE 3.1 and released by NIOSH in the near future.

  7. Dynamic association rules for gene expression data analysis.

    PubMed

    Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung

    2015-10-14

    The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.

  8. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning

    PubMed Central

    Xu, Shengzhi; Cheng, Xiang; Li, Zhengyi; Xiong, Li

    2016-01-01

    In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of candidate sequences. If we could effectively reduce the number of unpromising candidate sequences, the utility and privacy tradeoff can be significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose a novel differentially private FSM algorithm, which is referred to as PFS2. The core of our algorithm is to utilize sample databases to further prune the candidate sequences generated based on the downward closure property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which sequences are potentially frequent. To improve the accuracy of such private estimations, a sequence shrinking method is proposed to enforce the length constraint on the sample databases. Moreover, to decrease the probability of misestimating frequent sequences as infrequent, a threshold relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we show that our PFS2 algorithm is ε-differentially private. Extensive experiments on real datasets illustrate that our PFS2 algorithm can privately find frequent sequences with high accuracy. PMID:26973430

  9. Using Geostationary Communications Satellites as a Sensor: Telemetry Search Algorithms

    NASA Astrophysics Data System (ADS)

    Cahoy, K.; Carlton, A.; Lohmeyer, W. Q.

    2014-12-01

    For decades, operators and manufacturers have collected large amounts of telemetry from geostationary (GEO) communications satellites to monitor system health and performance, yet this data is rarely mined for scientific purposes. The goal of this work is to mine data archives acquired from commercial operators using new algorithms that can detect when a space weather (or non-space weather) event of interest has occurred or is in progress. We have developed algorithms to statistically analyze power amplifier current and temperature telemetry and identify deviations from nominal operations or other trends of interest. We then examine space weather data to see what role, if any, it might have played. We also closely examine both long and short periods of time before an anomaly to determine whether or not the anomaly could have been predicted.

  10. Littoral assessment of mine burial signatures (LAMBS): buried landmine/background spectral-signature analyses

    NASA Astrophysics Data System (ADS)

    Kenton, Arthur C.; Geci, Duane M.; Ray, Kristofer J.; Thomas, Clayton M.; Salisbury, John W.; Mars, John C.; Crowley, James K.; Witherspoon, Ned H.; Holloway, John H., Jr.

    2004-09-01

    The objective of the Office of Naval Research (ONR) Rapid Overt Reconnaissance (ROR) program and the Airborne Littoral Reconnaissance Technologies (ALRT) project's LAMBS effort is to determine if electro-optical spectral discriminants exist that are useful for the detection of land mines in littoral regions. Statistically significant buried mine overburden and background signature data were collected over a wide spectral range (0.35 to 14 μm) to identify robust spectral features that might serve as discriminants for new airborne sensor concepts. LAMBS has expanded previously collected databases to littoral areas - primarily dry and wet sandy soils - where tidal, surf, and wind conditions can severely modify spectral signatures. At AeroSense 2003, we reported completion of three buried mine collections at an inland bay, Atlantic and Gulf of Mexico beach sites. We now report LAMBS spectral database analyses results using metrics which characterize the detection performance of general types of spectral detection algorithms. These metrics include mean contrast, spectral signal-to-clutter, covariance, information content, and spectral matched filter analyses. Detection performance of the buried land mines was analyzed with regard to burial age, background type, and environmental conditions. These analyses considered features observed due to particle size differences, surface roughness, surface moisture, and compositional differences.

  11. Microbial genotype-phenotype mapping by class association rule mining.

    PubMed

    Tamura, Makio; D'haeseleer, Patrik

    2008-07-01

    Microbial phenotypes are typically due to the concerted action of multiple gene functions, yet the presence of each gene may have only a weak correlation with the observed phenotype. Hence, it may be more appropriate to examine co-occurrence between sets of genes and a phenotype (multiple-to-one) instead of pairwise relations between a single gene and the phenotype. Here, we propose an efficient class association rule mining algorithm, netCAR, in order to extract sets of COGs (clusters of orthologous groups of proteins) associated with a phenotype from COG phylogenetic profiles and a phenotype profile. netCAR takes into account the phylogenetic co-occurrence graph between COGs to restrict hypothesis space, and uses mutual information to evaluate the biconditional relation. We examined the mining capability of pairwise and multiple-to-one association by using netCAR to extract COGs relevant to six microbial phenotypes (aerobic, anaerobic, facultative, endospore, motility and Gram negative) from 11,969 unique COG profiles across 155 prokaryotic organisms. With the same level of false discovery rate, multiple-to-one association can extract about 10 times more relevant COGs than one-to-one association. We also reveal various topologies of association networks among COGs (modules) from extracted multiple-to-one correlation rules relevant with the six phenotypes; including a well-connected network for motility, a star-shaped network for aerobic and intermediate topologies for the other phenotypes. netCAR outperforms a standard CAR mining algorithm, CARapriori, while requiring several orders of magnitude less computational time for extracting 3-COG sets. Source code of the Java implementation is available as Supplementary Material at the Bioinformatics online website, or upon request to the author. Supplementary data are available at Bioinformatics online.

  12. Emerging trends in geospatial artificial intelligence (geoAI): potential applications for environmental epidemiology.

    PubMed

    VoPham, Trang; Hart, Jaime E; Laden, Francine; Chiang, Yao-Yi

    2018-04-17

    Geospatial artificial intelligence (geoAI) is an emerging scientific discipline that combines innovations in spatial science, artificial intelligence methods in machine learning (e.g., deep learning), data mining, and high-performance computing to extract knowledge from spatial big data. In environmental epidemiology, exposure modeling is a commonly used approach to conduct exposure assessment to determine the distribution of exposures in study populations. geoAI technologies provide important advantages for exposure modeling in environmental epidemiology, including the ability to incorporate large amounts of big spatial and temporal data in a variety of formats; computational efficiency; flexibility in algorithms and workflows to accommodate relevant characteristics of spatial (environmental) processes including spatial nonstationarity; and scalability to model other environmental exposures across different geographic areas. The objectives of this commentary are to provide an overview of key concepts surrounding the evolving and interdisciplinary field of geoAI including spatial data science, machine learning, deep learning, and data mining; recent geoAI applications in research; and potential future directions for geoAI in environmental epidemiology.

  13. MINEs: open access databases of computationally predicted enzyme promiscuity products for untargeted metabolomics.

    PubMed

    Jeffryes, James G; Colastani, Ricardo L; Elbadawi-Sidhu, Mona; Kind, Tobias; Niehaus, Thomas D; Broadbelt, Linda J; Hanson, Andrew D; Fiehn, Oliver; Tyo, Keith E J; Henry, Christopher S

    2015-01-01

    In spite of its great promise, metabolomics has proven difficult to execute in an untargeted and generalizable manner. Liquid chromatography-mass spectrometry (LC-MS) has made it possible to gather data on thousands of cellular metabolites. However, matching metabolites to their spectral features continues to be a bottleneck, meaning that much of the collected information remains uninterpreted and that new metabolites are seldom discovered in untargeted studies. These challenges require new approaches that consider compounds beyond those available in curated biochemistry databases. Here we present Metabolic In silico Network Expansions (MINEs), an extension of known metabolite databases to include molecules that have not been observed, but are likely to occur based on known metabolites and common biochemical reactions. We utilize an algorithm called the Biochemical Network Integrated Computational Explorer (BNICE) and expert-curated reaction rules based on the Enzyme Commission classification system to propose the novel chemical structures and reactions that comprise MINE databases. Starting from the Kyoto Encyclopedia of Genes and Genomes (KEGG) COMPOUND database, the MINE contains over 571,000 compounds, of which 93% are not present in the PubChem database. However, these MINE compounds have on average higher structural similarity to natural products than compounds from KEGG or PubChem. MINE databases were able to propose annotations for 98.6% of a set of 667 MassBank spectra, 14% more than KEGG alone and equivalent to PubChem while returning far fewer candidates per spectra than PubChem (46 vs. 1715 median candidates). Application of MINEs to LC-MS accurate mass data enabled the identity of an unknown peak to be confidently predicted. MINE databases are freely accessible for non-commercial use via user-friendly web-tools at http://minedatabase.mcs.anl.gov and developer-friendly APIs. MINEs improve metabolomics peak identification as compared to general chemical databases whose results include irrelevant synthetic compounds. Furthermore, MINEs complement and expand on previous in silico generated compound databases that focus on human metabolism. We are actively developing the database; future versions of this resource will incorporate transformation rules for spontaneous chemical reactions and more advanced filtering and prioritization of candidate structures. Graphical abstractMINE database construction and access methods. The process of constructing a MINE database from the curated source databases is depicted on the left. The methods for accessing the database are shown on the right.

  14. Intelligent Scheduling for Underground Mobile Mining Equipment.

    PubMed

    Song, Zhen; Schunnesson, Håkan; Rinne, Mikael; Sturgul, John

    2015-01-01

    Many studies have been carried out and many commercial software applications have been developed to improve the performances of surface mining operations, especially for the loader-trucks cycle of surface mining. However, there have been quite few studies aiming to improve the mining process of underground mines. In underground mines, mobile mining equipment is mostly scheduled instinctively, without theoretical support for these decisions. Furthermore, in case of unexpected events, it is hard for miners to rapidly find solutions to reschedule and to adapt the changes. This investigation first introduces the motivation, the technical background, and then the objective of the study. A decision support instrument (i.e. schedule optimizer for mobile mining equipment) is proposed and described to address this issue. The method and related algorithms which are used in this instrument are presented and discussed. The proposed method was tested by using a real case of Kittilä mine located in Finland. The result suggests that the proposed method can considerably improve the working efficiency and reduce the working time of the underground mine.

  15. A review of contrast pattern based data mining

    NASA Astrophysics Data System (ADS)

    Zhu, Shiwei; Ju, Meilong; Yu, Junfeng; Cai, Binlei; Wang, Aiping

    2015-07-01

    Contrast pattern based data mining is concerned with the mining of patterns and models that contrast two or more datasets. Contrast patterns can describe similarities or differences between the datasets. They represent strong contrast knowledge and have been shown to be very successful for constructing accurate and robust clusters and classifiers. The increasing use of contrast pattern data mining has initiated a great deal of research and development attempts in the field of data mining. A comprehensive revision on the existing contrast pattern based data mining research is given in this paper. They are generally categorized into background and representation, definitions and mining algorithms, contrast pattern based classification, clustering, and other applications, the research trends in future. The primary of this paper is to server as a glossary for interested researchers to have an overall picture on the current contrast based data mining development and identify their potential research direction to future investigation.

  16. Systematic discovery of mutation-specific synthetic lethals by mining pan-cancer human primary tumor data

    NASA Astrophysics Data System (ADS)

    Sinha, Subarna; Thomas, Daniel; Chan, Steven; Gao, Yang; Brunen, Diede; Torabi, Damoun; Reinisch, Andreas; Hernandez, David; Chan, Andy; Rankin, Erinn B.; Bernards, Rene; Majeti, Ravindra; Dill, David L.

    2017-05-01

    Two genes are synthetically lethal (SL) when defects in both are lethal to a cell but a single defect is non-lethal. SL partners of cancer mutations are of great interest as pharmacological targets; however, identifying them by cell line-based methods is challenging. Here we develop MiSL (Mining Synthetic Lethals), an algorithm that mines pan-cancer human primary tumour data to identify mutation-specific SL partners for specific cancers. We apply MiSL to 12 different cancers and predict 145,891 SL partners for 3,120 mutations, including known mutation-specific SL partners. Comparisons with functional screens show that MiSL predictions are enriched for SLs in multiple cancers. We extensively validate a SL interaction identified by MiSL between the IDH1 mutation and ACACA in leukaemia using gene targeting and patient-derived xenografts. Furthermore, we apply MiSL to pinpoint genetic biomarkers for drug sensitivity. These results demonstrate that MiSL can accelerate precision oncology by identifying mutation-specific targets and biomarkers.

  17. Negative and Positive Association Rules Mining from Text Using Frequent and Infrequent Itemsets

    PubMed Central

    Mahmood, Sajid; Shahbaz, Muhammad; Guergachi, Aziz

    2014-01-01

    Association rule mining research typically focuses on positive association rules (PARs), generated from frequently occurring itemsets. However, in recent years, there has been a significant research focused on finding interesting infrequent itemsets leading to the discovery of negative association rules (NARs). The discovery of infrequent itemsets is far more difficult than their counterparts, that is, frequent itemsets. These problems include infrequent itemsets discovery and generation of accurate NARs, and their huge number as compared with positive association rules. In medical science, for example, one is interested in factors which can either adjudicate the presence of a disease or write-off of its possibility. The vivid positive symptoms are often obvious; however, negative symptoms are subtler and more difficult to recognize and diagnose. In this paper, we propose an algorithm for discovering positive and negative association rules among frequent and infrequent itemsets. We identify associations among medications, symptoms, and laboratory results using state-of-the-art data mining technology. PMID:24955429

  18. Development of Multiscale Biological Image Data Analysis: Review of 2006 International Workshop on Multiscale Biological Imaging, Data Mining and Informatics, Santa Barbara, USA (BII06)

    PubMed Central

    Auer, Manfred; Peng, Hanchuan; Singh, Ambuj

    2007-01-01

    The 2006 International Workshop on Multiscale Biological Imaging, Data Mining and Informatics was held at Santa Barbara, on Sept 7–8, 2006. Based on the presentations at the workshop, we selected and compiled this collection of research articles related to novel algorithms and enabling techniques for bio- and biomedical image analysis, mining, visualization, and biology applications. PMID:17634090

  19. Application of data mining techniques to explore predictors of HCC in Egyptian patients with HCV-related chronic liver disease.

    PubMed

    Omran, Dalia Abd El Hamid; Awad, AbuBakr Hussein; Mabrouk, Mahasen Abd El Rahman; Soliman, Ahmad Fouad; Aziz, Ashraf Omar Abdel

    2015-01-01

    Hepatocellular carcinoma (HCC) is the second most common malignancy in Egypt. Data mining is a method of predictive analysis which can explore tremendous volumes of information to discover hidden patterns and relationships. Our aim here was to develop a non-invasive algorithm for prediction of HCC. Such an algorithm should be economical, reliable, easy to apply and acceptable by domain experts. This cross-sectional study enrolled 315 patients with hepatitis C virus (HCV) related chronic liver disease (CLD); 135 HCC, 116 cirrhotic patients without HCC and 64 patients with chronic hepatitis C. Using data mining analysis, we constructed a decision tree learning algorithm to predict HCC. The decision tree algorithm was able to predict HCC with recall (sensitivity) of 83.5% and precession (specificity) of 83.3% using only routine data. The correctly classified instances were 259 (82.2%), and the incorrectly classified instances were 56 (17.8%). Out of 29 attributes, serum alpha fetoprotein (AFP), with an optimal cutoff value of ≥50.3 ng/ml was selected as the best predictor of HCC. To a lesser extent, male sex, presence of cirrhosis, AST>64U/L, and ascites were variables associated with HCC. Data mining analysis allows discovery of hidden patterns and enables the development of models to predict HCC, utilizing routine data as an alternative to CT and liver biopsy. This study has highlighted a new cutoff for AFP (≥50.3 ng/ml). Presence of a score of >2 risk variables (out of 5) can successfully predict HCC with a sensitivity of 96% and specificity of 82%.

  20. Real Alerts and Artifact Classification in Archived Multi-signal Vital Sign Monitoring Data—Implications for Mining Big Data — Implications for Mining Big Data

    PubMed Central

    Hravnak, Marilyn; Chen, Lujie; Dubrawski, Artur; Bose, Eliezer; Clermont, Gilles; Pinsky, Michael R.

    2015-01-01

    PURPOSE Huge hospital information system databases can be mined for knowledge discovery and decision support, but artifact in stored non-invasive vital sign (VS) high-frequency data streams limits its use. We used machine-learning (ML) algorithms trained on expert-labeled VS data streams to automatically classify VS alerts as real or artifact, thereby “cleaning” such data for future modeling. METHODS 634 admissions to a step-down unit had recorded continuous noninvasive VS monitoring data (heart rate [HR], respiratory rate [RR], peripheral arterial oxygen saturation [SpO2] at 1/20Hz., and noninvasive oscillometric blood pressure [BP]) Time data were across stability thresholds defined VS event epochs. Data were divided Block 1 as the ML training/cross-validation set and Block 2 the test set. Expert clinicians annotated Block 1 events as perceived real or artifact. After feature extraction, ML algorithms were trained to create and validate models automatically classifying events as real or artifact. The models were then tested on Block 2. RESULTS Block 1 yielded 812 VS events, with 214 (26%) judged by experts as artifact (RR 43%, SpO2 40%, BP 15%, HR 2%). ML algorithms applied to the Block 1 training/cross-validation set (10-fold cross-validation) gave area under the curve (AUC) scores of 0.97 RR, 0.91 BP and 0.76 SpO2. Performance when applied to Block 2 test data was AUC 0.94 RR, 0.84 BP and 0.72 SpO2). CONCLUSIONS ML-defined algorithms applied to archived multi-signal continuous VS monitoring data allowed accurate automated classification of VS alerts as real or artifact, and could support data mining for future model building. PMID:26438655

  1. Novel methods for detecting buried explosive devices

    NASA Astrophysics Data System (ADS)

    Kercel, Stephen W.; Burlage, Robert S.; Patek, David R.; Smith, Cyrus M.; Hibbs, Andrew D.; Rayner, Timothy J.

    1997-07-01

    Oak Ridge National Laboratory and Quantum Magnetics, Inc. are exploring novel landmine detection technologies. Technologies considered here include bioreporter bacteria, swept acoustic resonance, nuclear quadrupole resonance (NQR), and semiotic data fusion. Bioreporter bacteria look promising for third-world humanitarian applications; they are inexpensive, and deployment does not require high-tech methods. Swept acoustic resonance may be a useful adjunct to magnetometers in humanitarian demining. For military demining, NQR is a promising method for detecting explosive substances; of 50,000 substances that have been tested, one has an NQR signature that can be mistaken for RDX or TNT. For both military and commercial demining, sensor fusion entails two daunting tasks, identifying fusible features in both present-day and emerging technologies, and devising a fusion algorithm that runs in real-time on cheap hardware. Preliminary research in these areas is encouraging. A bioreporter bacterium for TNT detection is under development. Investigation has just started in swept acoustic resonance as an approach to a cheap mine detector for humanitarian use. Real-time wavelet processing appears to be a key to extending NQR bomb detection into mine detection, including TNT-based mines. Recent discoveries in semiotics may be the breakthrough that will lead to a robust fused detection scheme.

  2. Frequent Itemset Hiding Algorithm Using Frequent Pattern Tree Approach

    ERIC Educational Resources Information Center

    Alnatsheh, Rami

    2012-01-01

    A problem that has been the focus of much recent research in privacy preserving data-mining is the frequent itemset hiding (FIH) problem. Identifying itemsets that appear together frequently in customer transactions is a common task in association rule mining. Organizations that share data with business partners may consider some of the frequent…

  3. Detection and Evaluation of Cheating on College Exams Using Supervised Classification

    ERIC Educational Resources Information Center

    Cavalcanti, Elmano Ramalho; Pires, Carlos Eduardo; Cavalcanti, Elmano Pontes; Pires, Vládia Freire

    2012-01-01

    Text mining has been used for various purposes, such as document classification and extraction of domain-specific information from text. In this paper we present a study in which text mining methodology and algorithms were properly employed for academic dishonesty (cheating) detection and evaluation on open-ended college exams, based on document…

  4. Content-Based Indexing and Teaching Focus Mining for Lecture Videos

    ERIC Educational Resources Information Center

    Lin, Yu-Tzu; Yen, Bai-Jang; Chang, Chia-Hu; Lee, Greg C.; Lin, Yu-Chih

    2010-01-01

    Purpose: The purpose of this paper is to propose an indexing and teaching focus mining system for lecture videos recorded in an unconstrained environment. Design/methodology/approach: By applying the proposed algorithms in this paper, the slide structure can be reconstructed by extracting slide images from the video. Instead of applying…

  5. What Satisfies Students?: Mining Student-Opinion Data with Regression and Decision Tree Analysis

    ERIC Educational Resources Information Center

    Thomas, Emily H.; Galambos, Nora

    2004-01-01

    To investigate how students' characteristics and experiences affect satisfaction, this study uses regression and decision tree analysis with the CHAID algorithm to analyze student-opinion data. A data mining approach identifies the specific aspects of students' university experience that most influence three measures of general satisfaction. The…

  6. Using optical flow for the detection of floating mines in IR image sequences

    NASA Astrophysics Data System (ADS)

    Borghgraef, Alexander; Acheroy, Marc

    2006-09-01

    In the first Gulf War, unmoored floating mines proved to be a real hazard for shipping traffic. An automated system capable of detecting these and other free-floating small objects, using readily available sensors such as infra-red cameras, would prove to be a valuable mine-warfare asset, and could double as a collision avoidance mechanism, and a search-and-rescue aid. The noisy background provided by the sea surface, and occlusion by waves make it difficult to detect small floating objects using only algorithms based upon the intensity, size or shape of the target. This leads us to look at the sequence of images for temporal detection characteristics. The target's apparent motion is such a determinant, given the contrast between the bobbing motion of the floating object and the strong horizontal component present in the propagation of the wavefronts. We have applied the Proesmans optical flow algorithm to IR video footage of practice mines, in order to extract the motion characteristic and a threshold on the vertical motion characteristic is then imposed to detect the floating targets.

  7. Mining Hesitation Information by Vague Association Rules

    NASA Astrophysics Data System (ADS)

    Lu, An; Ng, Wilfred

    In many online shopping applications, such as Amazon and eBay, traditional Association Rule (AR) mining has limitations as it only deals with the items that are sold but ignores the items that are almost sold (for example, those items that are put into the basket but not checked out). We say that those almost sold items carry hesitation information, since customers are hesitating to buy them. The hesitation information of items is valuable knowledge for the design of good selling strategies. However, there is no conceptual model that is able to capture different statuses of hesitation information. Herein, we apply and extend vague set theory in the context of AR mining. We define the concepts of attractiveness and hesitation of an item, which represent the overall information of a customer's intent on an item. Based on the two concepts, we propose the notion of Vague Association Rules (VARs). We devise an efficient algorithm to mine the VARs. Our experiments show that our algorithm is efficient and the VARs capture more specific and richer information than do the traditional ARs.

  8. Service-based analysis of biological pathways

    PubMed Central

    Zheng, George; Bouguettaya, Athman

    2009-01-01

    Background Computer-based pathway discovery is concerned with two important objectives: pathway identification and analysis. Conventional mining and modeling approaches aimed at pathway discovery are often effective at achieving either objective, but not both. Such limitations can be effectively tackled leveraging a Web service-based modeling and mining approach. Results Inspired by molecular recognitions and drug discovery processes, we developed a Web service mining tool, named PathExplorer, to discover potentially interesting biological pathways linking service models of biological processes. The tool uses an innovative approach to identify useful pathways based on graph-based hints and service-based simulation verifying user's hypotheses. Conclusion Web service modeling of biological processes allows the easy access and invocation of these processes on the Web. Web service mining techniques described in this paper enable the discovery of biological pathways linking these process service models. Algorithms presented in this paper for automatically highlighting interesting subgraph within an identified pathway network enable the user to formulate hypothesis, which can be tested out using our simulation algorithm that are also described in this paper. PMID:19796403

  9. Real-time implementation of a multispectral mine target detection algorithm

    NASA Astrophysics Data System (ADS)

    Samson, Joseph W.; Witter, Lester J.; Kenton, Arthur C.; Holloway, John H., Jr.

    2003-09-01

    Spatial-spectral anomaly detection (the "RX Algorithm") has been exploited on the USMC's Coastal Battlefield Reconnaissance and Analysis (COBRA) Advanced Technology Demonstration (ATD) and several associated technology base studies, and has been found to be a useful method for the automated detection of surface-emplaced antitank land mines in airborne multispectral imagery. RX is a complex image processing algorithm that involves the direct spatial convolution of a target/background mask template over each multispectral image, coupled with a spatially variant background spectral covariance matrix estimation and inversion. The RX throughput on the ATD was about 38X real time using a single Sun UltraSparc system. A goal to demonstrate RX in real-time was begun in FY01. We now report the development and demonstration of a Field Programmable Gate Array (FPGA) solution that achieves a real-time implementation of the RX algorithm at video rates using COBRA ATD data. The approach uses an Annapolis Microsystems Firebird PMC card containing a Xilinx XCV2000E FPGA with over 2,500,000 logic gates and 18MBytes of memory. A prototype system was configured using a Tek Microsystems VME board with dual-PowerPC G4 processors and two PMC slots. The RX algorithm was translated from its C programming implementation into the VHDL language and synthesized into gates that were loaded into the FPGA. The VHDL/synthesizer approach allows key RX parameters to be quickly changed and a new implementation automatically generated. Reprogramming the FPGA is done rapidly and in-circuit. Implementation of the RX algorithm in a single FPGA is a major first step toward achieving real-time land mine detection.

  10. Biclustering Protein Complex Interactions with a Biclique FindingAlgorithm

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ding, Chris; Zhang, Anne Ya; Holbrook, Stephen

    2006-12-01

    Biclustering has many applications in text mining, web clickstream mining, and bioinformatics. When data entries are binary, the tightest biclusters become bicliques. We propose a flexible and highly efficient algorithm to compute bicliques. We first generalize the Motzkin-Straus formalism for computing the maximal clique from L{sub 1} constraint to L{sub p} constraint, which enables us to provide a generalized Motzkin-Straus formalism for computing maximal-edge bicliques. By adjusting parameters, the algorithm can favor biclusters with more rows less columns, or vice verse, thus increasing the flexibility of the targeted biclusters. We then propose an algorithm to solve the generalized Motzkin-Straus optimizationmore » problem. The algorithm is provably convergent and has a computational complexity of O(|E|) where |E| is the number of edges. It relies on a matrix vector multiplication and runs efficiently on most current computer architectures. Using this algorithm, we bicluster the yeast protein complex interaction network. We find that biclustering protein complexes at the protein level does not clearly reflect the functional linkage among protein complexes in many cases, while biclustering at protein domain level can reveal many underlying linkages. We show several new biologically significant results.« less

  11. PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting

    PubMed Central

    Jiang, Yexi; Tang, Mingjie; Yuan, Changan; Tang, Changjie

    2013-01-01

    Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmenting mainly focused on the issue of ameliorating precision instead of paying much attention to the efficiency. Moreover, the performance of these algorithms depends heavily on parameters, which are hard for the users to set. In this paper, we propose PRESEE (parameter-free, real-time, and scalable time-series stream segmenting algorithm), which greatly improves the efficiency of time-series stream segmenting. PRESEE is based on both MDL (minimum description length) and MML (minimum message length) methods, which could segment the data automatically. To evaluate the performance of PRESEE, we conduct several experiments on time-series streams of different types and compare it with the state-of-art algorithm. The empirical results show that PRESEE is very efficient for real-time stream datasets by improving segmenting speed nearly ten times. The novelty of this algorithm is further demonstrated by the application of PRESEE in segmenting real-time stream datasets from ChinaFLUX sensor networks data stream. PMID:23956693

  12. PRESEE: an MDL/MML algorithm to time-series stream segmenting.

    PubMed

    Xu, Kaikuo; Jiang, Yexi; Tang, Mingjie; Yuan, Changan; Tang, Changjie

    2013-01-01

    Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmenting mainly focused on the issue of ameliorating precision instead of paying much attention to the efficiency. Moreover, the performance of these algorithms depends heavily on parameters, which are hard for the users to set. In this paper, we propose PRESEE (parameter-free, real-time, and scalable time-series stream segmenting algorithm), which greatly improves the efficiency of time-series stream segmenting. PRESEE is based on both MDL (minimum description length) and MML (minimum message length) methods, which could segment the data automatically. To evaluate the performance of PRESEE, we conduct several experiments on time-series streams of different types and compare it with the state-of-art algorithm. The empirical results show that PRESEE is very efficient for real-time stream datasets by improving segmenting speed nearly ten times. The novelty of this algorithm is further demonstrated by the application of PRESEE in segmenting real-time stream datasets from ChinaFLUX sensor networks data stream.

  13. An algorithm of opinion leaders mining based on signed network

    NASA Astrophysics Data System (ADS)

    Cao, Linlin; Zheng, Mingchun; Zhang, Yuanyuan; Zhang, Fuming

    2018-04-01

    With the rapid development of mobile Internet, user gradually become the leader of social media, the abruptly rise of new media has changed the traditional information's dissemination pattern and regularity. There is new era significance of opinion leaders, gatekeepers in the classical theory of mass communication, and it has further expansion and extension to a certain extent. In the existing mining of opinion leaders, it is mainly from the research of network structure and user behavior without considering an important attribute: whether the user has a real impact. In this paper, we take the symbolic network as the research tool, by giving symbol which correspondingly represents support or oppose to the link about point of view relationship between users and combining traditional algorithms of mining with symbolism which can describe the change of view between users, we will get the opinion leader who has real impact on users, then the result is more accurate and effective.

  14. The association rules search of Indonesian university graduate’s data using FP-growth algorithm

    NASA Astrophysics Data System (ADS)

    Faza, S.; Rahmat, R. F.; Nababan, E. B.; Arisandi, D.; Effendi, S.

    2018-02-01

    The attribute varieties in university graduates data have caused frustrations to the institution in finding the combinations of attributes that often emerge and have high integration between attributes. Association rules mining is a data mining technique to determine the integration of the data or the way of a data set affects another set of data. By way of explanation, there are possibilities in finding the integration of data on a large scale. Frequent Pattern-Growth (FP-Growth) algorithm is one of the association rules mining technique to determine a frequent itemset in an FP-Tree data set. From the research on the search of university graduate’s association rules, it can be concluded that the most common attributes that have high integration between them are in the combination of State-owned High School outside Medan, regular university entrance exam, GPA of 3.00 to 3.49 and over 4-year-long study duration.

  15. An Adaptive Sensor Mining Framework for Pervasive Computing Applications

    NASA Astrophysics Data System (ADS)

    Rashidi, Parisa; Cook, Diane J.

    Analyzing sensor data in pervasive computing applications brings unique challenges to the KDD community. The challenge is heightened when the underlying data source is dynamic and the patterns change. We introduce a new adaptive mining framework that detects patterns in sensor data, and more importantly, adapts to the changes in the underlying model. In our framework, the frequent and periodic patterns of data are first discovered by the Frequent and Periodic Pattern Miner (FPPM) algorithm; and then any changes in the discovered patterns over the lifetime of the system are discovered by the Pattern Adaptation Miner (PAM) algorithm, in order to adapt to the changing environment. This framework also captures vital context information present in pervasive computing applications, such as the startup triggers and temporal information. In this paper, we present a description of our mining framework and validate the approach using data collected in the CASAS smart home testbed.

  16. Combined data mining/NIR spectroscopy for purity assessment of lime juice

    NASA Astrophysics Data System (ADS)

    Shafiee, Sahameh; Minaei, Saeid

    2018-06-01

    This paper reports the data mining study on the NIR spectrum of lime juice samples to determine their purity (natural or synthetic). NIR spectra for 72 pure and synthetic lime juice samples were recorded in reflectance mode. Sample outliers were removed using PCA analysis. Different data mining techniques for feature selection (Genetic Algorithm (GA)) and classification (including the radial basis function (RBF) network, Support Vector Machine (SVM), and Random Forest (RF) tree) were employed. Based on the results, SVM proved to be the most accurate classifier as it achieved the highest accuracy (97%) using the raw spectrum information. The classifier accuracy dropped to 93% when selected feature vector by GA search method was applied as classifier input. It can be concluded that some relevant features which produce good performance with the SVM classifier are removed by feature selection. Also, reduced spectra using PCA do not show acceptable performance (total accuracy of 66% by RBFNN), which indicates that dimensional reduction methods such as PCA do not always lead to more accurate results. These findings demonstrate the potential of data mining combination with near-infrared spectroscopy for monitoring lime juice quality in terms of natural or synthetic nature.

  17. Collaborative mining of graph patterns from multiple sources

    NASA Astrophysics Data System (ADS)

    Levchuk, Georgiy; Colonna-Romanoa, John

    2016-05-01

    Intelligence analysts require automated tools to mine multi-source data, including answering queries, learning patterns of life, and discovering malicious or anomalous activities. Graph mining algorithms have recently attracted significant attention in intelligence community, because the text-derived knowledge can be efficiently represented as graphs of entities and relationships. However, graph mining models are limited to use-cases involving collocated data, and often make restrictive assumptions about the types of patterns that need to be discovered, the relationships between individual sources, and availability of accurate data segmentation. In this paper we present a model to learn the graph patterns from multiple relational data sources, when each source might have only a fragment (or subgraph) of the knowledge that needs to be discovered, and segmentation of data into training or testing instances is not available. Our model is based on distributed collaborative graph learning, and is effective in situations when the data is kept locally and cannot be moved to a centralized location. Our experiments show that proposed collaborative learning achieves learning quality better than aggregated centralized graph learning, and has learning time comparable to traditional distributed learning in which a knowledge of data segmentation is needed.

  18. Rapid overt airborne reconnaissance (ROAR) for mines and obstacles in very shallow water, surf zone, and beach

    NASA Astrophysics Data System (ADS)

    Moran, Steven E.; Austin, William L.; Murray, James T.; Roddier, Nicolas A.; Bridges, Robert; Vercillo, Richard; Stettner, Roger; Phillips, Dave; Bisbee, Al; Witherspoon, Ned H.

    2003-09-01

    Under the Office of Naval Research's Organic Mine Countermeasures Future Naval Capabilities (OMCM FNC) program, Lite Cycles, Inc. is developing an innovative and highly compact airborne active sensor system for mine and obstacle detection in very shallow water (VSW), through the surf-zone (SZ) and onto the beach. The system uses an innovative LCI proprietary integrated scanner, detector, and telescope (ISDT) receiver architecture. The ISD tightly couples all receiver components and LIDAR electronics to achieve the system compaction required for tactical UAVintegration while providing a large aperture. It also includes an advanced compact multifunction laser transmitter; an industry-first high-resolution, compact 3-D camera, a scanning function for wide area search, and temporally displaced multiple looks on the fly over the ocean surface for clutter reduction. Additionally, the laser will provide time-multiplexed multi-color output to perform day/night multispectral imaging for beach surveillance. New processing algorithms for mine detection in the very challenging surf-zone clutter environment are under development, which offer the potential for significant processing gains in comparison to the legacy approaches. This paper reviews the legacy system approaches, describes the mission challenges, and provides an overview of the ROAR system architecture.

  19. Identification of Shearer Cutting Patterns Using Vibration Signals Based on a Least Squares Support Vector Machine with an Improved Fruit Fly Optimization Algorithm

    PubMed Central

    Si, Lei; Wang, Zhongbin; Liu, Xinhua; Tan, Chao; Liu, Ze; Xu, Jing

    2016-01-01

    Shearers play an important role in fully mechanized coal mining face and accurately identifying their cutting pattern is very helpful for improving the automation level of shearers and ensuring the safety of coal mining. The least squares support vector machine (LSSVM) has been proven to offer strong potential in prediction and classification issues, particularly by employing an appropriate meta-heuristic algorithm to determine the values of its two parameters. However, these meta-heuristic algorithms have the drawbacks of being hard to understand and reaching the global optimal solution slowly. In this paper, an improved fly optimization algorithm (IFOA) to optimize the parameters of LSSVM was presented and the LSSVM coupled with IFOA (IFOA-LSSVM) was used to identify the shearer cutting pattern. The vibration acceleration signals of five cutting patterns were collected and the special state features were extracted based on the ensemble empirical mode decomposition (EEMD) and the kernel function. Some examples on the IFOA-LSSVM model were further presented and the results were compared with LSSVM, PSO-LSSVM, GA-LSSVM and FOA-LSSVM models in detail. The comparison results indicate that the proposed approach was feasible, efficient and outperformed the others. Finally, an industrial application example at the coal mining face was demonstrated to specify the effect of the proposed system. PMID:26771615

  20. Applying the Precaution Adoption Process Model to the Acceptance of Mine Safety and Health Technologies

    PubMed Central

    2018-01-01

    Mineworkers are continually introduced to protective technologies on the job. Yet, their perceptions toward the technologies are often not addressed until they are actively trying to use them, which may halt safe technology adoption and associated work practices. This study explored management and worker perspectives toward three technologies to forecast adoption and behavioral intention on the job. Interviews and focus groups were conducted with 21 mineworkers and 19 mine managers to determine the adoption process stage algorithm for workers and managers, including perceived barriers to using new safety and health technologies. Differences between workers and managers were revealed in terms of readiness, perceptions, and initial trust in using technologies. Workers, whether they had or had not used a particular technology, still had negative perceptions toward its use in the initial introduction and integration at their mine site, indicating a lengthy time period needed for full adoption. The key finding from these results is that a carefully considered and extended introduction of technology for workers in Stage 3 (undecided to act) is most important to promote progression to Stage 5 (decided to act) and to avoid Stage 4 (decided not to act). In response, organizational management may need to account for workers’ particular stage algorithm, using the Precaution Adoption Process Model, to understand how to tailor messages about protective technologies, administer skill-based trainings and interventions that raise awareness and knowledge, and ultimately encourage safe adoption of associated work practices. PMID:29862314

  1. Applying the Precaution Adoption Process Model to the Acceptance of Mine Safety and Health Technologies.

    PubMed

    Haas, Emily J

    2018-03-01

    Mineworkers are continually introduced to protective technologies on the job. Yet, their perceptions toward the technologies are often not addressed until they are actively trying to use them, which may halt safe technology adoption and associated work practices. This study explored management and worker perspectives toward three technologies to forecast adoption and behavioral intention on the job. Interviews and focus groups were conducted with 21 mineworkers and 19 mine managers to determine the adoption process stage algorithm for workers and managers, including perceived barriers to using new safety and health technologies. Differences between workers and managers were revealed in terms of readiness, perceptions, and initial trust in using technologies. Workers, whether they had or had not used a particular technology, still had negative perceptions toward its use in the initial introduction and integration at their mine site, indicating a lengthy time period needed for full adoption. The key finding from these results is that a carefully considered and extended introduction of technology for workers in Stage 3 (undecided to act) is most important to promote progression to Stage 5 (decided to act) and to avoid Stage 4 (decided not to act). In response, organizational management may need to account for workers' particular stage algorithm, using the Precaution Adoption Process Model, to understand how to tailor messages about protective technologies, administer skill-based trainings and interventions that raise awareness and knowledge, and ultimately encourage safe adoption of associated work practices.

  2. Comparing digital data processing techniques for surface mine and reclamation monitoring

    NASA Technical Reports Server (NTRS)

    Witt, R. G.; Bly, B. G.; Campbell, W. J.; Bloemer, H. H. L.; Brumfield, J. O.

    1982-01-01

    The results of three techniques used for processing Landsat digital data are compared for their utility in delineating areas of surface mining and subsequent reclamation. An unsupervised clustering algorithm (ISOCLS), a maximum-likelihood classifier (CLASFY), and a hybrid approach utilizing canonical analysis (ISOCLS/KLTRANS/ISOCLS) were compared by means of a detailed accuracy assessment with aerial photography at NASA's Goddard Space Flight Center. Results show that the hybrid approach was superior to the traditional techniques in distinguishing strip mined and reclaimed areas.

  3. Post-launch validation of Multispectral Thermal Imager (MTI) data and algorithms

    NASA Astrophysics Data System (ADS)

    Garrett, Alfred J.; Kurzeja, Robert J.; O'Steen, B. L.; Parker, Matthew J.; Pendergast, Malcolm M.; Villa-Aleman, Eliel

    1999-10-01

    Sandia National Laboratories (SNL), Los Alamos National Laboratory (LANL) and the Savannah River Technology Center (SRTC) have developed a diverse group of algorithms for processing and analyzing the data that will be collected by the Multispectral Thermal Imager (MTI) after launch late in 1999. Each of these algorithms must be verified by comparison to independent surface and atmospheric measurements. SRTC has selected 13 sites in the continental U.S. for ground truth data collections. These sites include a high altitude cold water target (Crater Lake), cooling lakes and towers in the warm, humid southeastern U.S., Department of Energy (DOE) climate research sites, the NASA Stennis satellite Validation and Verification (V&V) target array, waste sites at the Savannah River Site, mining sites in the Four Corners area and dry lake beds in Nevada. SRTC has established mutually beneficial relationships with the organizations that manage these sites to make use of their operating and research data and to install additional instrumentation needed for MTI algorithm V&V.

  4. Optoelectronic instrumentation enhancement using data mining feedback for a 3D measurement system

    NASA Astrophysics Data System (ADS)

    Flores-Fuentes, Wendy; Sergiyenko, Oleg; Gonzalez-Navarro, Félix F.; Rivas-López, Moisés; Hernandez-Balbuena, Daniel; Rodríguez-Quiñonez, Julio C.; Tyrsa, Vera; Lindner, Lars

    2016-12-01

    3D measurement by a cyber-physical system based on optoelectronic scanning instrumentation has been enhanced by outliers and regression data mining feedback. The prototype has applications in (1) industrial manufacturing systems that include: robotic machinery, embedded vision, and motion control, (2) health care systems for measurement scanning, and (3) infrastructure by providing structural health monitoring. This paper presents new research performed in data processing of a 3D measurement vision sensing database. Outliers from multivariate data have been detected and removal to improve artificial intelligence regression algorithm results. Physical measurement error regression data has been used for 3D measurements error correction. Concluding, that the joint of physical phenomena, measurement and computation is an effectiveness action for feedback loops in the control of industrial, medical and civil tasks.

  5. Exploring Characterizations of Learning Object Repositories Using Data Mining Techniques

    NASA Astrophysics Data System (ADS)

    Segura, Alejandra; Vidal, Christian; Menendez, Victor; Zapata, Alfredo; Prieto, Manuel

    Learning object repositories provide a platform for the sharing of Web-based educational resources. As these repositories evolve independently, it is difficult for users to have a clear picture of the kind of contents they give access to. Metadata can be used to automatically extract a characterization of these resources by using machine learning techniques. This paper presents an exploratory study carried out in the contents of four public repositories that uses clustering and association rule mining algorithms to extract characterizations of repository contents. The results of the analysis include potential relationships between different attributes of learning objects that may be useful to gain an understanding of the kind of resources available and eventually develop search mechanisms that consider repository descriptions as a criteria in federated search.

  6. Mining Research on Vibration Signal Association Rules of Quayside Container Crane Hoisting Motor Based on Apriori Algorithm

    NASA Astrophysics Data System (ADS)

    Yang, Chencheng; Tang, Gang; Hu, Xiong

    2017-07-01

    Shore-hoisting motor in the daily work will produce a large number of vibration signal data,in order to analyze the correlation among the data and discover the fault and potential safety hazard of the motor, the data are discretized first, and then Apriori algorithm are used to mine the strong association rules among the data. The results show that the relationship between day 1 and day 16 is the most closely related, which can guide the staff to analyze the work of these two days of motor to find and solve the problem of fault and safety.

  7. Data mining in bioinformatics using Weka.

    PubMed

    Frank, Eibe; Hall, Mark; Trigg, Len; Holmes, Geoffrey; Witten, Ian H

    2004-10-12

    The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. http://www.cs.waikato.ac.nz/ml/weka.

  8. Intelligent On-Board Processing in the Sensor Web

    NASA Astrophysics Data System (ADS)

    Tanner, S.

    2005-12-01

    Most existing sensing systems are designed as passive, independent observers. They are rarely aware of the phenomena they observe, and are even less likely to be aware of what other sensors are observing within the same environment. Increasingly, intelligent processing of sensor data is taking place in real-time, using computing resources on-board the sensor or the platform itself. One can imagine a sensor network consisting of intelligent and autonomous space-borne, airborne, and ground-based sensors. These sensors will act independently of one another, yet each will be capable of both publishing and receiving sensor information, observations, and alerts among other sensors in the network. Furthermore, these sensors will be capable of acting upon this information, perhaps altering acquisition properties of their instruments, changing the location of their platform, or updating processing strategies for their own observations to provide responsive information or additional alerts. Such autonomous and intelligent sensor networking capabilities provide significant benefits for collections of heterogeneous sensors within any environment. They are crucial for multi-sensor observations and surveillance, where real-time communication with external components and users may be inhibited, and the environment may be hostile. In all environments, mission automation and communication capabilities among disparate sensors will enable quicker response to interesting, rare, or unexpected events. Additionally, an intelligent network of heterogeneous sensors provides the advantage that all of the sensors can benefit from the unique capabilities of each sensor in the network. The University of Alabama in Huntsville (UAH) is developing a unique approach to data processing, integration and mining through the use of the Adaptive On-Board Data Processing (AODP) framework. AODP is a key foundation technology for autonomous internetworking capabilities to support situational awareness by sensors and their on-board processes. The two primary research areas for this project are (1) the on-board processing and communications framework itself, and (2) data mining algorithms targeted to the needs and constraints of the on-board environment. The team is leveraging its experience in on-board processing, data mining, custom data processing, and sensor network design. Several unique UAH-developed technologies are employed in the AODP project, including EVE, an EnVironmEnt for on-board processing, and the data mining tools included in the Algorithm Development and Mining (ADaM) toolkit.

  9. Statistical algorithms improve accuracy of gene fusion detection

    PubMed Central

    Hsieh, Gillian; Bierman, Rob; Szabo, Linda; Lee, Alex Gia; Freeman, Donald E.; Watson, Nathaniel; Sweet-Cordero, E. Alejandro

    2017-01-01

    Abstract Gene fusions are known to play critical roles in tumor pathogenesis. Yet, sensitive and specific algorithms to detect gene fusions in cancer do not currently exist. In this paper, we present a new statistical algorithm, MACHETE (Mismatched Alignment CHimEra Tracking Engine), which achieves highly sensitive and specific detection of gene fusions from RNA-Seq data, including the highest Positive Predictive Value (PPV) compared to the current state-of-the-art, as assessed in simulated data. We show that the best performing published algorithms either find large numbers of fusions in negative control data or suffer from low sensitivity detecting known driving fusions in gold standard settings, such as EWSR1-FLI1. As proof of principle that MACHETE discovers novel gene fusions with high accuracy in vivo, we mined public data to discover and subsequently PCR validate novel gene fusions missed by other algorithms in the ovarian cancer cell line OVCAR3. These results highlight the gains in accuracy achieved by introducing statistical models into fusion detection, and pave the way for unbiased discovery of potentially driving and druggable gene fusions in primary tumors. PMID:28541529

  10. Systematic Review of Data Mining Applications in Patient-Centered Mobile-Based Information Systems.

    PubMed

    Fallah, Mina; Niakan Kalhori, Sharareh R

    2017-10-01

    Smartphones represent a promising technology for patient-centered healthcare. It is claimed that data mining techniques have improved mobile apps to address patients' needs at subgroup and individual levels. This study reviewed the current literature regarding data mining applications in patient-centered mobile-based information systems. We systematically searched PubMed, Scopus, and Web of Science for original studies reported from 2014 to 2016. After screening 226 records at the title/abstract level, the full texts of 92 relevant papers were retrieved and checked against inclusion criteria. Finally, 30 papers were included in this study and reviewed. Data mining techniques have been reported in development of mobile health apps for three main purposes: data analysis for follow-up and monitoring, early diagnosis and detection for screening purpose, classification/prediction of outcomes, and risk calculation (n = 27); data collection (n = 3); and provision of recommendations (n = 2). The most accurate and frequently applied data mining method was support vector machine; however, decision tree has shown superior performance to enhance mobile apps applied for patients' self-management. Embedded data-mining-based feature in mobile apps, such as case detection, prediction/classification, risk estimation, or collection of patient data, particularly during self-management, would save, apply, and analyze patient data during and after care. More intelligent methods, such as artificial neural networks, fuzzy logic, and genetic algorithms, and even the hybrid methods may result in more patients-centered recommendations, providing education, guidance, alerts, and awareness of personalized output.

  11. Image/Time Series Mining Algorithms: Applications to Developmental Biology, Document Processing and Data Streams

    ERIC Educational Resources Information Center

    Tataw, Oben Moses

    2013-01-01

    Interdisciplinary research in computer science requires the development of computational techniques for practical application in different domains. This usually requires careful integration of different areas of technical expertise. This dissertation presents image and time series analysis algorithms, with practical interdisciplinary applications…

  12. Application of decision tree model for the ground subsidence hazard mapping near abandoned underground coal mines.

    PubMed

    Lee, Saro; Park, Inhye

    2013-09-30

    Subsidence of ground caused by underground mines poses hazards to human life and property. This study analyzed the hazard to ground subsidence using factors that can affect ground subsidence and a decision tree approach in a geographic information system (GIS). The study area was Taebaek, Gangwon-do, Korea, where many abandoned underground coal mines exist. Spatial data, topography, geology, and various ground-engineering data for the subsidence area were collected and compiled in a database for mapping ground-subsidence hazard (GSH). The subsidence area was randomly split 50/50 for training and validation of the models. A data-mining classification technique was applied to the GSH mapping, and decision trees were constructed using the chi-squared automatic interaction detector (CHAID) and the quick, unbiased, and efficient statistical tree (QUEST) algorithms. The frequency ratio model was also applied to the GSH mapping for comparing with probabilistic model. The resulting GSH maps were validated using area-under-the-curve (AUC) analysis with the subsidence area data that had not been used for training the model. The highest accuracy was achieved by the decision tree model using CHAID algorithm (94.01%) comparing with QUEST algorithms (90.37%) and frequency ratio model (86.70%). These accuracies are higher than previously reported results for decision tree. Decision tree methods can therefore be used efficiently for GSH analysis and might be widely used for prediction of various spatial events. Copyright © 2013. Published by Elsevier Ltd.

  13. Data Mining Tools Make Flights Safer, More Efficient

    NASA Technical Reports Server (NTRS)

    2014-01-01

    A small data mining team at Ames Research Center developed a set of algorithms ideal for combing through flight data to find anomalies. Dallas-based Southwest Airlines Co. signed a Space Act Agreement with Ames in 2011 to access the tools, helping the company refine its safety practices, improve its safety reviews, and increase flight efficiencies.

  14. What Satisfies Students? Mining Student-Opinion Data with Regression and Decision-Tree Analysis. AIR 2002 Forum Paper.

    ERIC Educational Resources Information Center

    Thomas, Emily H.; Galambos, Nora

    To investigate how students' characteristics and experiences affect satisfaction, this study used regression and decision-tree analysis with the CHAID algorithm to analyze student opinion data from a sample of 1,783 college students. A data-mining approach identifies the specific aspects of students' university experience that most influence three…

  15. Effect of Temporal Relationships in Associative Rule Mining for Web Log Data

    PubMed Central

    Mohd Khairudin, Nazli; Mustapha, Aida

    2014-01-01

    The advent of web-based applications and services has created such diverse and voluminous web log data stored in web servers, proxy servers, client machines, or organizational databases. This paper attempts to investigate the effect of temporal attribute in relational rule mining for web log data. We incorporated the characteristics of time in the rule mining process and analysed the effect of various temporal parameters. The rules generated from temporal relational rule mining are then compared against the rules generated from the classical rule mining approach such as the Apriori and FP-Growth algorithms. The results showed that by incorporating the temporal attribute via time, the number of rules generated is subsequently smaller but is comparable in terms of quality. PMID:24587757

  16. Electromagnetic Induction Spectroscopy for the Detection of Subsurface Targets

    DTIC Science & Technology

    2012-12-01

    curves of the proposed method and that of Fails et al.. For the kNN ROC curve, k = 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81...et al. [6] and Ramachandran et al. [7] both demonstrated success in detecting mines using the k-nearest-neighbor ( kNN ) algorithm based on the EMI...error is also included in the feature vector. The kNN labels an unknown target based on the closest targets in a training set. Collins et al. [2] and

  17. Automated detection of hospital outbreaks: A systematic review of methods.

    PubMed

    Leclère, Brice; Buckeridge, David L; Boëlle, Pierre-Yves; Astagneau, Pascal; Lepelletier, Didier

    2017-01-01

    Several automated algorithms for epidemiological surveillance in hospitals have been proposed. However, the usefulness of these methods to detect nosocomial outbreaks remains unclear. The goal of this review was to describe outbreak detection algorithms that have been tested within hospitals, consider how they were evaluated, and synthesize their results. We developed a search query using keywords associated with hospital outbreak detection and searched the MEDLINE database. To ensure the highest sensitivity, no limitations were initially imposed on publication languages and dates, although we subsequently excluded studies published before 2000. Every study that described a method to detect outbreaks within hospitals was included, without any exclusion based on study design. Additional studies were identified through citations in retrieved studies. Twenty-nine studies were included. The detection algorithms were grouped into 5 categories: simple thresholds (n = 6), statistical process control (n = 12), scan statistics (n = 6), traditional statistical models (n = 6), and data mining methods (n = 4). The evaluation of the algorithms was often solely descriptive (n = 15), but more complex epidemiological criteria were also investigated (n = 10). The performance measures varied widely between studies: e.g., the sensitivity of an algorithm in a real world setting could vary between 17 and 100%. Even if outbreak detection algorithms are useful complementary tools for traditional surveillance, the heterogeneity in results among published studies does not support quantitative synthesis of their performance. A standardized framework should be followed when evaluating outbreak detection methods to allow comparison of algorithms across studies and synthesis of results.

  18. Computational Search for Strong Topological Insulators: An Exercise in Data Mining and Electronic Structure

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Klintenberg, M.; Haraldsen, Jason T.; Balatsky, Alexander V.

    In this paper, we report a data-mining investigation for the search of topological insulators by examining individual electronic structures for over 60,000 materials. Using a data-mining algorithm, we survey changes in band inversion with and without spin-orbit coupling by screening the calculated electronic band structure for a small gap and a change concavity at high-symmetry points. Overall, we were able to identify a number of topological candidates with varying structures and composition. Lastly, our overall goal is expand the realm of predictive theory into the determination of new and exotic complex materials through the data mining of electronic structure.

  19. Computational Search for Strong Topological Insulators: An Exercise in Data Mining and Electronic Structure

    DOE PAGES

    Klintenberg, M.; Haraldsen, Jason T.; Balatsky, Alexander V.

    2014-06-19

    In this paper, we report a data-mining investigation for the search of topological insulators by examining individual electronic structures for over 60,000 materials. Using a data-mining algorithm, we survey changes in band inversion with and without spin-orbit coupling by screening the calculated electronic band structure for a small gap and a change concavity at high-symmetry points. Overall, we were able to identify a number of topological candidates with varying structures and composition. Lastly, our overall goal is expand the realm of predictive theory into the determination of new and exotic complex materials through the data mining of electronic structure.

  20. An Energy-Efficient Underground Localization System Based on Heterogeneous Wireless Networks

    PubMed Central

    Yuan, Yazhou; Chen, Cailian; Guan, Xinping; Yang, Qiuling

    2015-01-01

    A precision positioning system with energy efficiency is of great necessity for guaranteeing personnel safety in underground mines. The location information of the miners' should be transmitted to the control center timely and reliably; therefore, a heterogeneous network with the backbone based on high speed Industrial Ethernet is deployed. Since the mobile wireless nodes are working in an irregular tunnel, a specific wireless propagation model cannot fit all situations. In this paper, an underground localization system is designed to enable the adaptation to kinds of harsh tunnel environments, but also to reduce the energy consumption and thus prolong the lifetime of the network. Three key techniques are developed and implemented to improve the system performance, including a step counting algorithm with accelerometers, a power control algorithm and an adaptive packets scheduling scheme. The simulation study and experimental results show the effectiveness of the proposed algorithms and the implementation. PMID:26016918

  1. On-line estimation of nonlinear physical systems

    USGS Publications Warehouse

    Christakos, G.

    1988-01-01

    Recursive algorithms for estimating states of nonlinear physical systems are presented. Orthogonality properties are rediscovered and the associated polynomials are used to linearize state and observation models of the underlying random processes. This requires some key hypotheses regarding the structure of these processes, which may then take account of a wide range of applications. The latter include streamflow forecasting, flood estimation, environmental protection, earthquake engineering, and mine planning. The proposed estimation algorithm may be compared favorably to Taylor series-type filters, nonlinear filters which approximate the probability density by Edgeworth or Gram-Charlier series, as well as to conventional statistical linearization-type estimators. Moreover, the method has several advantages over nonrecursive estimators like disjunctive kriging. To link theory with practice, some numerical results for a simulated system are presented, in which responses from the proposed and extended Kalman algorithms are compared. ?? 1988 International Association for Mathematical Geology.

  2. Decision tree methods: applications for classification and prediction.

    PubMed

    Song, Yan-Yan; Lu, Ying

    2015-04-25

    Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.

  3. Intelligent Scheduling for Underground Mobile Mining Equipment

    PubMed Central

    Song, Zhen; Schunnesson, Håkan; Rinne, Mikael; Sturgul, John

    2015-01-01

    Many studies have been carried out and many commercial software applications have been developed to improve the performances of surface mining operations, especially for the loader-trucks cycle of surface mining. However, there have been quite few studies aiming to improve the mining process of underground mines. In underground mines, mobile mining equipment is mostly scheduled instinctively, without theoretical support for these decisions. Furthermore, in case of unexpected events, it is hard for miners to rapidly find solutions to reschedule and to adapt the changes. This investigation first introduces the motivation, the technical background, and then the objective of the study. A decision support instrument (i.e. schedule optimizer for mobile mining equipment) is proposed and described to address this issue. The method and related algorithms which are used in this instrument are presented and discussed. The proposed method was tested by using a real case of Kittilä mine located in Finland. The result suggests that the proposed method can considerably improve the working efficiency and reduce the working time of the underground mine. PMID:26098934

  4. A data mining methodology for predicting early stage Parkinson’s disease using non-invasive, high-dimensional gait sensor data

    PubMed Central

    Tucker, Conrad; Han, Yixiang; Nembhard, Harriet Black; Lewis, Mechelle; Lee, Wang-Chien; Sterling, Nicholas W; Huang, Xuemei

    2017-01-01

    Parkinson’s disease (PD) is the second most common neurological disorder after Alzheimer’s disease. Key clinical features of PD are motor-related and are typically assessed by healthcare providers based on qualitative visual inspection of a patient’s movement/gait/posture. More advanced diagnostic techniques such as computed tomography scans that measure brain function, can be cost prohibitive and may expose patients to radiation and other harmful effects. To mitigate these challenges, and open a pathway to remote patient-physician assessment, the authors of this work propose a data mining driven methodology that uses low cost, non-invasive sensors to model and predict the presence (or lack therefore) of PD movement abnormalities and model clinical subtypes. The study presented here evaluates the discriminative ability of non-invasive hardware and data mining algorithms to classify PD cases and controls. A 10-fold cross validation approach is used to compare several data mining algorithms in order to determine that which provides the most consistent results when varying the subject gait data. Next, the predictive accuracy of the data mining model is quantified by testing it against unseen data captured from a test pool of subjects. The proposed methodology demonstrates the feasibility of using non-invasive, low cost, hardware and data mining models to monitor the progression of gait features outside of the traditional healthcare facility, which may ultimately lead to earlier diagnosis of emerging neurological diseases. PMID:29541376

  5. A hybrid monkey search algorithm for clustering analysis.

    PubMed

    Chen, Xin; Zhou, Yongquan; Luo, Qifang

    2014-01-01

    Clustering is a popular data analysis and data mining technique. The k-means clustering algorithm is one of the most commonly used methods. However, it highly depends on the initial solution and is easy to fall into local optimum solution. In view of the disadvantages of the k-means method, this paper proposed a hybrid monkey algorithm based on search operator of artificial bee colony algorithm for clustering analysis and experiment on synthetic and real life datasets to show that the algorithm has a good performance than that of the basic monkey algorithm for clustering analysis.

  6. An novel frequent probability pattern mining algorithm based on circuit simulation method in uncertain biological networks.

    PubMed

    He, Jieyue; Wang, Chunyan; Qiu, Kunpu; Zhong, Wei

    2014-01-01

    Motif mining has always been a hot research topic in bioinformatics. Most of current research on biological networks focuses on exact motif mining. However, due to the inevitable experimental error and noisy data, biological network data represented as the probability model could better reflect the authenticity and biological significance, therefore, it is more biological meaningful to discover probability motif in uncertain biological networks. One of the key steps in probability motif mining is frequent pattern discovery which is usually based on the possible world model having a relatively high computational complexity. In this paper, we present a novel method for detecting frequent probability patterns based on circuit simulation in the uncertain biological networks. First, the partition based efficient search is applied to the non-tree like subgraph mining where the probability of occurrence in random networks is small. Then, an algorithm of probability isomorphic based on circuit simulation is proposed. The probability isomorphic combines the analysis of circuit topology structure with related physical properties of voltage in order to evaluate the probability isomorphism between probability subgraphs. The circuit simulation based probability isomorphic can avoid using traditional possible world model. Finally, based on the algorithm of probability subgraph isomorphism, two-step hierarchical clustering method is used to cluster subgraphs, and discover frequent probability patterns from the clusters. The experiment results on data sets of the Protein-Protein Interaction (PPI) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae show that the proposed method can efficiently discover the frequent probability subgraphs. The discovered subgraphs in our study contain all probability motifs reported in the experiments published in other related papers. The algorithm of probability graph isomorphism evaluation based on circuit simulation method excludes most of subgraphs which are not probability isomorphism and reduces the search space of the probability isomorphism subgraphs using the mismatch values in the node voltage set. It is an innovative way to find the frequent probability patterns, which can be efficiently applied to probability motif discovery problems in the further studies.

  7. An novel frequent probability pattern mining algorithm based on circuit simulation method in uncertain biological networks

    PubMed Central

    2014-01-01

    Background Motif mining has always been a hot research topic in bioinformatics. Most of current research on biological networks focuses on exact motif mining. However, due to the inevitable experimental error and noisy data, biological network data represented as the probability model could better reflect the authenticity and biological significance, therefore, it is more biological meaningful to discover probability motif in uncertain biological networks. One of the key steps in probability motif mining is frequent pattern discovery which is usually based on the possible world model having a relatively high computational complexity. Methods In this paper, we present a novel method for detecting frequent probability patterns based on circuit simulation in the uncertain biological networks. First, the partition based efficient search is applied to the non-tree like subgraph mining where the probability of occurrence in random networks is small. Then, an algorithm of probability isomorphic based on circuit simulation is proposed. The probability isomorphic combines the analysis of circuit topology structure with related physical properties of voltage in order to evaluate the probability isomorphism between probability subgraphs. The circuit simulation based probability isomorphic can avoid using traditional possible world model. Finally, based on the algorithm of probability subgraph isomorphism, two-step hierarchical clustering method is used to cluster subgraphs, and discover frequent probability patterns from the clusters. Results The experiment results on data sets of the Protein-Protein Interaction (PPI) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae show that the proposed method can efficiently discover the frequent probability subgraphs. The discovered subgraphs in our study contain all probability motifs reported in the experiments published in other related papers. Conclusions The algorithm of probability graph isomorphism evaluation based on circuit simulation method excludes most of subgraphs which are not probability isomorphism and reduces the search space of the probability isomorphism subgraphs using the mismatch values in the node voltage set. It is an innovative way to find the frequent probability patterns, which can be efficiently applied to probability motif discovery problems in the further studies. PMID:25350277

  8. A systematic review of data mining and machine learning for air pollution epidemiology.

    PubMed

    Bellinger, Colin; Mohomed Jabbar, Mohomed Shazan; Zaïane, Osmar; Osornio-Vargas, Alvaro

    2017-11-28

    Data measuring airborne pollutants, public health and environmental factors are increasingly being stored and merged. These big datasets offer great potential, but also challenge traditional epidemiological methods. This has motivated the exploration of alternative methods to make predictions, find patterns and extract information. To this end, data mining and machine learning algorithms are increasingly being applied to air pollution epidemiology. We conducted a systematic literature review on the application of data mining and machine learning methods in air pollution epidemiology. We carried out our search process in PubMed, the MEDLINE database and Google Scholar. Research articles applying data mining and machine learning methods to air pollution epidemiology were queried and reviewed. Our search queries resulted in 400 research articles. Our fine-grained analysis employed our inclusion/exclusion criteria to reduce the results to 47 articles, which we separate into three primary areas of interest: 1) source apportionment; 2) forecasting/prediction of air pollution/quality or exposure; and 3) generating hypotheses. Early applications had a preference for artificial neural networks. In more recent work, decision trees, support vector machines, k-means clustering and the APRIORI algorithm have been widely applied. Our survey shows that the majority of the research has been conducted in Europe, China and the USA, and that data mining is becoming an increasingly common tool in environmental health. For potential new directions, we have identified that deep learning and geo-spacial pattern mining are two burgeoning areas of data mining that have good potential for future applications in air pollution epidemiology. We carried out a systematic review identifying the current trends, challenges and new directions to explore in the application of data mining methods to air pollution epidemiology. This work shows that data mining is increasingly being applied in air pollution epidemiology. The potential to support air pollution epidemiology continues to grow with advancements in data mining related to temporal and geo-spacial mining, and deep learning. This is further supported by new sensors and storage mediums that enable larger, better quality data. This suggests that many more fruitful applications can be expected in the future.

  9. Estimating gene function with least squares nonnegative matrix factorization.

    PubMed

    Wang, Guoli; Ochs, Michael F

    2007-01-01

    Nonnegative matrix factorization is a machine learning algorithm that has extracted information from data in a number of fields, including imaging and spectral analysis, text mining, and microarray data analysis. One limitation with the method for linking genes through microarray data in order to estimate gene function is the high variance observed in transcription levels between different genes. Least squares nonnegative matrix factorization uses estimates of the uncertainties on the mRNA levels for each gene in each condition, to guide the algorithm to a local minimum in normalized chi2, rather than a Euclidean distance or divergence between the reconstructed data and the data itself. Herein, application of this method to microarray data is demonstrated in order to predict gene function.

  10. Novel E-Field Sensor for Projectile Detection

    DTIC Science & Technology

    2012-10-22

    aircrafts. They used an array of three plate induction sensors and a simple algorithm to deter mine the direction of the planes [9]. In more recent...publications [10, 11, 12] researchers present increasingly more advanced algorithms and sensors. The techniques developed thus far have not received...the electric field pulse is being detected by a group of sensors in array with known distances between the sensors, so triangulation algorithms could

  11. Solving LP Relaxations of Large-Scale Precedence Constrained Problems

    NASA Astrophysics Data System (ADS)

    Bienstock, Daniel; Zuckerberg, Mark

    We describe new algorithms for solving linear programming relaxations of very large precedence constrained production scheduling problems. We present theory that motivates a new set of algorithmic ideas that can be employed on a wide range of problems; on data sets arising in the mining industry our algorithms prove effective on problems with many millions of variables and constraints, obtaining provably optimal solutions in a few minutes of computation.

  12. Applied Graph-Mining Algorithms to Study Biomolecular Interaction Networks

    PubMed Central

    2014-01-01

    Protein-protein interaction (PPI) networks carry vital information on the organization of molecular interactions in cellular systems. The identification of functionally relevant modules in PPI networks is one of the most important applications of biological network analysis. Computational analysis is becoming an indispensable tool to understand large-scale biomolecular interaction networks. Several types of computational methods have been developed and employed for the analysis of PPI networks. Of these computational methods, graph comparison and module detection are the two most commonly used strategies. This review summarizes current literature on graph kernel and graph alignment methods for graph comparison strategies, as well as module detection approaches including seed-and-extend, hierarchical clustering, optimization-based, probabilistic, and frequent subgraph methods. Herein, we provide a comprehensive review of the major algorithms employed under each theme, including our recently published frequent subgraph method, for detecting functional modules commonly shared across multiple cancer PPI networks. PMID:24800226

  13. Design of Compressed Sensing Algorithm for Coal Mine IoT Moving Measurement Data Based on a Multi-Hop Network and Total Variation.

    PubMed

    Wang, Gang; Zhao, Zhikai; Ning, Yongjie

    2018-05-28

    As the application of a coal mine Internet of Things (IoT), mobile measurement devices, such as intelligent mine lamps, cause moving measurement data to be increased. How to transmit these large amounts of mobile measurement data effectively has become an urgent problem. This paper presents a compressed sensing algorithm for the large amount of coal mine IoT moving measurement data based on a multi-hop network and total variation. By taking gas data in mobile measurement data as an example, two network models for the transmission of gas data flow, namely single-hop and multi-hop transmission modes, are investigated in depth, and a gas data compressed sensing collection model is built based on a multi-hop network. To utilize the sparse characteristics of gas data, the concept of total variation is introduced and a high-efficiency gas data compression and reconstruction method based on Total Variation Sparsity based on Multi-Hop (TVS-MH) is proposed. According to the simulation results, by using the proposed method, the moving measurement data flow from an underground distributed mobile network can be acquired and transmitted efficiently.

  14. Detection of bulk explosives using the GPR only portion of the HSTAMIDS system

    NASA Astrophysics Data System (ADS)

    Tabony, Joshua; Carlson, Douglas O.; Duvoisin, Herbert A., III; Torres-Rosario, Juan

    2010-04-01

    The legacy AN/PSS-14 (Army-Navy Portable Special Search-14) Handheld Mine Detecting Set (also called HSTAMIDS for Handheld Standoff Mine Detection System) has proven itself over the last 7 years as the state-of-the-art in land mine detection, both for the US Army and for Humanitarian Demining groups. Its dual GPR (Ground Penetrating Radar) and MD (Metal Detection) sensor has provided receiver operating characteristic curves (probability of detection or Pd versus false alarm rate or FAR) that routinely set the mark for such devices. Since its inception and type-classification in 2003 as the US (United States) Army standard, the desire for use of the AN/PSS-14 against alternate threats - such as bulk explosives - has recently become paramount. To this end, L-3 CyTerra has developed and tested bulk explosive detection and discrimination algorithms using only the Stepped Frequency Continuous Wave (SFCW) Ground Penetrating Radar (GPR) portion of the system, versus the fused version that is used to optimally detect land mines. Performance of the new bulk explosive algorithm against representative zero-metal bulk explosive target and clutter emplacements is depicted, with the utility to the operator also described.

  15. Mining dynamic noteworthy functions in software execution sequences.

    PubMed

    Zhang, Bing; Huang, Guoyan; Wang, Yuqian; He, Haitao; Ren, Jiadong

    2017-01-01

    As the quality of crucial entities can directly affect that of software, their identification and protection become an important premise for effective software development, management, maintenance and testing, which thus contribute to improving the software quality and its attack-defending ability. Most analysis and evaluation on important entities like codes-based static structure analysis are on the destruction of the actual software running. In this paper, from the perspective of software execution process, we proposed an approach to mine dynamic noteworthy functions (DNFM)in software execution sequences. First, according to software decompiling and tracking stack changes, the execution traces composed of a series of function addresses were acquired. Then these traces were modeled as execution sequences and then simplified so as to get simplified sequences (SFS), followed by the extraction of patterns through pattern extraction (PE) algorithm from SFS. After that, evaluating indicators inner-importance and inter-importance were designed to measure the noteworthiness of functions in DNFM algorithm. Finally, these functions were sorted by their noteworthiness. Comparison and contrast were conducted on the experiment results from two traditional complex network-based node mining methods, namely PageRank and DegreeRank. The results show that the DNFM method can mine noteworthy functions in software effectively and precisely.

  16. Autonomous underwater vehicle adaptive path planning for target classification

    NASA Astrophysics Data System (ADS)

    Edwards, Joseph R.; Schmidt, Henrik

    2002-11-01

    Autonomous underwater vehicles (AUVs) are being rapidly developed to carry sensors into the sea in ways that have previously not been possible. The full use of the vehicles, however, is still not near realization due to lack of the true vehicle autonomy that is promised in the label (AUV). AUVs today primarily attempt to follow as closely as possible a preplanned trajectory. The key to increasing the autonomy of the AUV is to provide the vehicle with a means to make decisions based on its sensor receptions. The current work examines the use of active sonar returns from mine-like objects (MLOs) as a basis for sensor-based adaptive path planning, where the path planning objective is to discriminate between real mines and rocks. Once a target is detected in the mine hunting phase, the mine classification phase is initialized with a derivative cost function to emphasize signal differences and enhance classification capability. The AUV moves adaptively to minimize the cost function. The algorithm is verified using at-sea data derived from the joint MIT/SACLANTCEN GOATS experiments and advanced acoustic simulation using SEALAB. The mission oriented operating system (MOOS) real-time simulator is then used to test the onboard implementation of the algorithm.

  17. An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

    PubMed

    Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P

    2007-03-15

    Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.

  18. Careflow Mining Techniques to Explore Type 2 Diabetes Evolution.

    PubMed

    Dagliati, Arianna; Tibollo, Valentina; Cogni, Giulia; Chiovato, Luca; Bellazzi, Riccardo; Sacchi, Lucia

    2018-03-01

    In this work we describe the application of a careflow mining algorithm to detect the most frequent patterns of care in a type 2 diabetes patients cohort. The applied method enriches the detected patterns with clinical data to define temporal phenotypes across the studied population. Novel phenotypes are discovered from heterogeneous data of 424 Italian patients, and compared in terms of metabolic control and complications. Results show that careflow mining can help to summarize the complex evolution of the disease into meaningful patterns, which are also significant from a clinical point of view.

  19. The Mine Locomotive Wireless Network Strategy Based on Successive Interference Cancellation

    PubMed Central

    Wu, Liaoyuan; Han, Jianghong; Wei, Xing; Shi, Lei; Ding, Xu

    2015-01-01

    We consider a wireless network strategy based on successive interference cancellation (SIC) for mine locomotives. We firstly build the original mathematical model for the strategy which is a non-convex model. Then, we examine this model intensively, and figure out that there are certain regulations embedded in it. Based on these findings, we are able to reformulate the model into a new form and design a simple algorithm which can assign each locomotive with a proper transmitting scheme during the whole schedule procedure. Simulation results show that the outcomes obtained through this algorithm are improved by around 50% compared with those that do not apply the SIC technique. PMID:26569240

  20. Understanding Human Motion Skill with Peak Timing Synergy

    NASA Astrophysics Data System (ADS)

    Ueno, Ken; Furukawa, Koichi

    The careful observation of motion phenomena is important in understanding the skillful human motion. However, this is a difficult task due to the complexities in timing when dealing with the skilful control of anatomical structures. To investigate the dexterity of human motion, we decided to concentrate on timing with respect to motion, and we have proposed a method to extract the peak timing synergy from multivariate motion data. The peak timing synergy is defined as a frequent ordered graph with time stamps, which has nodes consisting of turning points in motion waveforms. A proposed algorithm, PRESTO automatically extracts the peak timing synergy. PRESTO comprises the following 3 processes: (1) detecting peak sequences with polygonal approximation; (2) generating peak-event sequences; and (3) finding frequent peak-event sequences using a sequential pattern mining method, generalized sequential patterns (GSP). Here, we measured right arm motion during the task of cello bowing and prepared a data set of the right shoulder and arm motion. We successfully extracted the peak timing synergy on cello bowing data set using the PRESTO algorithm, which consisted of common skills among cellists and personal skill differences. To evaluate the sequential pattern mining algorithm GSP in PRESTO, we compared the peak timing synergy by using GSP algorithm and the one by using filtering by reciprocal voting (FRV) algorithm as a non time-series method. We found that the support is 95 - 100% in GSP, while 83 - 96% in FRV and that the results by GSP are better than the one by FRV in the reproducibility of human motion. Therefore we show that sequential pattern mining approach is more effective to extract the peak timing synergy than non-time series analysis approach.

  1. Machine learning for prediction of 30-day mortality after ST elevation myocardial infraction: An Acute Coronary Syndrome Israeli Survey data mining study.

    PubMed

    Shouval, Roni; Hadanny, Amir; Shlomo, Nir; Iakobishvili, Zaza; Unger, Ron; Zahger, Doron; Alcalai, Ronny; Atar, Shaul; Gottlieb, Shmuel; Matetzky, Shlomi; Goldenberg, Ilan; Beigel, Roy

    2017-11-01

    Risk scores for prediction of mortality 30-days following a ST-segment elevation myocardial infarction (STEMI) have been developed using a conventional statistical approach. To evaluate an array of machine learning (ML) algorithms for prediction of mortality at 30-days in STEMI patients and to compare these to the conventional validated risk scores. This was a retrospective, supervised learning, data mining study. Out of a cohort of 13,422 patients from the Acute Coronary Syndrome Israeli Survey (ACSIS) registry, 2782 patients fulfilled inclusion criteria and 54 variables were considered. Prediction models for overall mortality 30days after STEMI were developed using 6 ML algorithms. Models were compared to each other and to the Global Registry of Acute Coronary Events (GRACE) and Thrombolysis In Myocardial Infarction (TIMI) scores. Depending on the algorithm, using all available variables, prediction models' performance measured in an area under the receiver operating characteristic curve (AUC) ranged from 0.64 to 0.91. The best models performed similarly to the Global Registry of Acute Coronary Events (GRACE) score (0.87 SD 0.06) and outperformed the Thrombolysis In Myocardial Infarction (TIMI) score (0.82 SD 0.06, p<0.05). Performance of most algorithms plateaued when introduced with 15 variables. Among the top predictors were creatinine, Killip class on admission, blood pressure, glucose level, and age. We present a data mining approach for prediction of mortality post-ST-segment elevation myocardial infarction. The algorithms selected showed competence in prediction across an increasing number of variables. ML may be used for outcome prediction in complex cardiology settings. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.

  2. Establishing Reliable miRNA-Cancer Association Network Based on Text-Mining Method

    PubMed Central

    Yang, Zhaowan; Fang, Ming; Zhang, Libin; Zhou, Yanhong

    2014-01-01

    Associating microRNAs (miRNAs) with cancers is an important step of understanding the mechanisms of cancer pathogenesis and finding novel biomarkers for cancer therapies. In this study, we constructed a miRNA-cancer association network (miCancerna) based on more than 1,000 miRNA-cancer associations detected from millions of abstracts with the text-mining method, including 226 miRNA families and 20 common cancers. We further prioritized cancer-related miRNAs at the network level with the random-walk algorithm, achieving a relatively higher performance than previous miRNA disease networks. Finally, we examined the top 5 candidate miRNAs for each kind of cancer and found that 71% of them are confirmed experimentally. miCancerna would be an alternative resource for the cancer-related miRNA identification. PMID:24895499

  3. Adaptive optimization as a design and management methodology for coal-mining enterprise in uncertain and volatile market environment - the conceptual framework

    NASA Astrophysics Data System (ADS)

    Mikhalchenko, V. V.; Rubanik, Yu T.

    2016-10-01

    The work is devoted to the problem of cost-effective adaptation of coal mines to the volatile and uncertain market conditions. Conceptually it can be achieved through alignment of the dynamic characteristics of the coal mining system and power spectrum of market demand for coal product. In practical terms, this ensures the viability and competitiveness of coal mines. Transformation of dynamic characteristics is to be done by changing the structure of production system as well as corporate, logistics and management processes. The proposed methods and algorithms of control are aimed at the development of the theoretical foundations of adaptive optimization as basic methodology for coal mine enterprise management in conditions of high variability and uncertainty of economic and natural environment. Implementation of the proposed methodology requires a revision of the basic principles of open coal mining enterprises design.

  4. Characterization of gut microbiota profiles in coronary artery disease patients using data mining analysis of terminal restriction fragment length polymorphism: gut microbiota could be a diagnostic marker of coronary artery disease.

    PubMed

    Emoto, Takuo; Yamashita, Tomoya; Kobayashi, Toshio; Sasaki, Naoto; Hirota, Yushi; Hayashi, Tomohiro; So, Anna; Kasahara, Kazuyuki; Yodoi, Keiko; Matsumoto, Takuya; Mizoguchi, Taiji; Ogawa, Wataru; Hirata, Ken-Ichi

    2017-01-01

    The association between atherosclerosis and gut microbiota has been attracting increased attention. We previously demonstrated a possible link between gut microbiota and coronary artery disease. Our aim of this study was to clarify the gut microbiota profiles in coronary artery disease patients using data mining analysis of terminal restriction fragment length polymorphism (T-RFLP). This study included 39 coronary artery disease (CAD) patients and 30 age- and sex- matched no-CAD controls (Ctrls) with coronary risk factors. Bacterial DNA was extracted from their fecal samples and analyzed by T-RFLP and data mining analysis using the classification and regression algorithm. Five additional CAD patients were newly recruited to confirm the reliability of this analysis. Data mining analysis could divide the composition of gut microbiota into 2 characteristic nodes. The CAD group was classified into 4 CAD pattern nodes (35/39 = 90 %), while the Ctrl group was classified into 3 Ctrl pattern nodes (28/30 = 93 %). Five additional CAD samples were applied to the same dividing model, which could validate the accuracy to predict the risk of CAD by data mining analysis. We could demonstrate that operational taxonomic unit 853 (OTU853), OTU657, and OTU990 were determined important both by the data mining method and by the usual statistical comparison. We classified the gut microbiota profiles in coronary artery disease patients using data mining analysis of T-RFLP data and demonstrated the possibility that gut microbiota is a diagnostic marker of suffering from CAD.

  5. How Analysts Cognitively “Connect the Dots”

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bradel, Lauren; Self, Jessica S.; Endert, Alexander

    2013-06-04

    As analysts attempt to make sense of a collection of documents, such as intelligence analysis reports, they may wish to “connect the dots” between pieces of information that may initially seem unrelated. This process of synthesizing information between information requires users to make connections between pairs of documents, creating a conceptual story. We conducted a user study to analyze the process by which users connect pairs of documents and how they spatially arrange information. Users created conceptual stories that connected the dots using organizational strategies that ranged in complexity. We propose taxonomies for cognitive connections and physical structures used whenmore » trying to “connect the dots” between two documents. We compared the user-created stories with a data-mining algorithm that constructs chains of documents using co-occurrence metrics. Using the insight gained into the storytelling process, we offer design considerations for the existing data mining algorithm and corresponding tools to combine the power of data mining and the complex cognitive processing of analysts.« less

  6. High-Performance Signal Detection for Adverse Drug Events using MapReduce Paradigm.

    PubMed

    Fan, Kai; Sun, Xingzhi; Tao, Ying; Xu, Linhao; Wang, Chen; Mao, Xianling; Peng, Bo; Pan, Yue

    2010-11-13

    Post-marketing pharmacovigilance is important for public health, as many Adverse Drug Events (ADEs) are unknown when those drugs were approved for marketing. However, due to the large number of reported drugs and drug combinations, detecting ADE signals by mining these reports is becoming a challenging task in terms of computational complexity. Recently, a parallel programming model, MapReduce has been introduced by Google to support large-scale data intensive applications. In this study, we proposed a MapReduce-based algorithm, for common ADE detection approach, Proportional Reporting Ratio (PRR), and tested it in mining spontaneous ADE reports from FDA. The purpose is to investigate the possibility of using MapReduce principle to speed up biomedical data mining tasks using this pharmacovigilance case as one specific example. The results demonstrated that MapReduce programming model could improve the performance of common signal detection algorithm for pharmacovigilance in a distributed computation environment at approximately liner speedup rates.

  7. A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences.

    PubMed

    Xue, Yun; Liao, Zhengling; Li, Meihang; Luo, Jie; Kuang, Qiuhua; Hu, Xiaohui; Li, Tiechen

    2015-01-01

    Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.

  8. Statistically significant relational data mining :

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Berry, Jonathan W.; Leung, Vitus Joseph; Phillips, Cynthia Ann

    This report summarizes the work performed under the project (3z(BStatitically significant relational data mining.(3y (BThe goal of the project was to add more statistical rigor to the fairly ad hoc area of data mining on graphs. Our goal was to develop better algorithms and better ways to evaluate algorithm quality. We concetrated on algorithms for community detection, approximate pattern matching, and graph similarity measures. Approximate pattern matching involves finding an instance of a relatively small pattern, expressed with tolerance, in a large graph of data observed with uncertainty. This report gathers the abstracts and references for the eight refereed publicationsmore » that have appeared as part of this work. We then archive three pieces of research that have not yet been published. The first is theoretical and experimental evidence that a popular statistical measure for comparison of community assignments favors over-resolved communities over approximations to a ground truth. The second are statistically motivated methods for measuring the quality of an approximate match of a small pattern in a large graph. The third is a new probabilistic random graph model. Statisticians favor these models for graph analysis. The new local structure graph model overcomes some of the issues with popular models such as exponential random graph models and latent variable models.« less

  9. Process Mining for Individualized Behavior Modeling Using Wireless Tracking in Nursing Homes

    PubMed Central

    Fernández-Llatas, Carlos; Benedi, José-Miguel; García-Gómez, Juan M.; Traver, Vicente

    2013-01-01

    The analysis of human behavior patterns is increasingly used for several research fields. The individualized modeling of behavior using classical techniques requires too much time and resources to be effective. A possible solution would be the use of pattern recognition techniques to automatically infer models to allow experts to understand individual behavior. However, traditional pattern recognition algorithms infer models that are not readily understood by human experts. This limits the capacity to benefit from the inferred models. Process mining technologies can infer models as workflows, specifically designed to be understood by experts, enabling them to detect specific behavior patterns in users. In this paper, the eMotiva process mining algorithms are presented. These algorithms filter, infer and visualize workflows. The workflows are inferred from the samples produced by an indoor location system that stores the location of a resident in a nursing home. The visualization tool is able to compare and highlight behavior patterns in order to facilitate expert understanding of human behavior. This tool was tested with nine real users that were monitored for a 25-week period. The results achieved suggest that the behavior of users is continuously evolving and changing and that this change can be measured, allowing for behavioral change detection. PMID:24225907

  10. Building Higher-Order Markov Chain Models with EXCEL

    ERIC Educational Resources Information Center

    Ching, Wai-Ki; Fung, Eric S.; Ng, Michael K.

    2004-01-01

    Categorical data sequences occur in many applications such as forecasting, data mining and bioinformatics. In this note, we present higher-order Markov chain models for modelling categorical data sequences with an efficient algorithm for solving the model parameters. The algorithm can be implemented easily in a Microsoft EXCEL worksheet. We give a…

  11. Techniques of Acceleration for Association Rule Induction with Pseudo Artificial Life Algorithm

    NASA Astrophysics Data System (ADS)

    Kanakubo, Masaaki; Hagiwara, Masafumi

    Frequent patterns mining is one of the important problems in data mining. Generally, the number of potential rules grows rapidly as the size of database increases. It is therefore hard for a user to extract the association rules. To avoid such a difficulty, we propose a new method for association rule induction with pseudo artificial life approach. The proposed method is to decide whether there exists an item set which contains N or more items in two transactions. If it exists, a series of item sets which are contained in the part of transactions will be recorded. The iteration of this step contributes to the extraction of association rules. It is not necessary to calculate the huge number of candidate rules. In the evaluation test, we compared the extracted association rules using our method with the rules using other algorithms like Apriori algorithm. As a result of the evaluation using huge retail market basket data, our method is approximately 10 and 20 times faster than the Apriori algorithm and many its variants.

  12. Machine learning for a Toolkit for Image Mining

    NASA Technical Reports Server (NTRS)

    Delanoy, Richard L.

    1995-01-01

    A prototype user environment is described that enables a user with very limited computer skills to collaborate with a computer algorithm to develop search tools (agents) that can be used for image analysis, creating metadata for tagging images, searching for images in an image database on the basis of image content, or as a component of computer vision algorithms. Agents are learned in an ongoing, two-way dialogue between the user and the algorithm. The user points to mistakes made in classification. The algorithm, in response, attempts to discover which image attributes are discriminating between objects of interest and clutter. It then builds a candidate agent and applies it to an input image, producing an 'interest' image highlighting features that are consistent with the set of objects and clutter indicated by the user. The dialogue repeats until the user is satisfied. The prototype environment, called the Toolkit for Image Mining (TIM) is currently capable of learning spectral and textural patterns. Learning exhibits rapid convergence to reasonable levels of performance and, when thoroughly trained, Fo appears to be competitive in discrimination accuracy with other classification techniques.

  13. [Vegetation spatial and temporal dynamic characteristics based on NDVI time series trajectories in grassland opencast coal mining].

    PubMed

    Jia, Duo; Wang, Cang Jiao; Mu, Shou Guo; Zhao, Hua

    2017-06-18

    The spatiotemporal dynamic patterns of vegetation in mining area are still unclear. This study utilized time series trajectory segmentation algorithm to fit Landsat NDVI time series which generated from fusion images at the most prosperous period of growth based on ESTARFM algorithm. Combining with the shape features of the fitted trajectory, this paper extracted five vegetation dynamic patterns including pre-disturbance type, continuous disturbance type, stabilization after disturbance type, stabilization between disturbance and recovery type, and recovery after disturbance type. The result indicated that recovery after disturbance type was the dominant vegetation change pattern among the five types of vegetation dynamic pattern, which accounted for 55.2% of the total number of pixels. The follows were stabilization after disturbance type and continuous disturbance type, accounting for 25.6% and 11.0%, respectively. The pre-disturbance type and stabilization between disturbance and recovery type accounted for 3.5% and 4.7%, respectively. Vegetation disturbance mainly occurred from 2004 to 2009 in Shengli mining area. The onset time of stable state was 2008 and the spatial locations mainlydistributed in open-pit stope and waste dump. The reco-very state mainly started since the year of 2008 and 2010, while the areas were small and mainly distributed at the periphery of open-pit stope and waste dump. Duration of disturbance was mainly 1 year. The duration of stable period usually sustained 7 years. The duration of recovery state of the type of stabilization between disturbances continued 2 to 5 years, while the type of recovery after disturbance often sustained 8 years.

  14. Electric and magnetic galvanic distortion decomposition of tensor CSAMT data. Application to data from the Buchans Mine (Newfoundland, Canada)

    NASA Astrophysics Data System (ADS)

    Garcia, Xavier; Boerner, David; Pedersen, Laust B.

    2003-09-01

    We have developed a Marquardt-Levenberg inversion algorithm incorporating the effects of near-surface galvanic distortion into the electromagnetic (EM) response of a layered earth model. Different tests on synthetic model responses suggest that for the grounded source method, the magnetic distortion does not vanish for low frequencies. Including this effect is important, although to date it has been neglected. We have inverted 10 stations of controlled-source audio-magnetotellurics (CSAMT) data recorded near the Buchans Mine, Newfoundland, Canada. The Buchans Mine was one of the richest massive sulphide deposits in the world, and is situated in a highly resistive volcanogenic environment, substantially modified by thrust faulting. Preliminary work in the area demonstrated that the EM fields observed at adjacent stations show large differences due to the existence of mineralized fracture zones and variable overburden thickness. Our inversion results suggest a three-layered model that is appropriate for the Buchans Mine. The resistivity model correlates with the seismic reflection interpretation that documents the existence of two thrust packages. The distortion parameters obtained from the inversion concur with the synthetic studies that galvanic magnetic distortion is required to interpret the Buchans data since the magnetic component of the galvanic distortion does not vanish at low frequency.

  15. [Analysis of on medication rules for Qi-deficiency and blood-stasis syndrome of chronic heart failure based on data mining technology].

    PubMed

    Wang, Qian; Yao, Geng-Zhen; Pan, Guang-Ming; Huang, Jing-Yi; An, Yi-Pei; Zou, Xu

    2017-01-01

    To analyze the medication features and the regularity of prescriptions of traditional Chinese medicine in treating patients with Qi-deficiency and blood-stasis syndrome of chronic heart failure based on modern literature. In this article, CNKI Chinese academic journal database, Wanfang Chinese academic journal database and VIP Chinese periodical database were all searched from January 2000 to December 2015 for the relevant literature on traditional Chinese medicine treatment for Qi-deficiency and blood-stasis syndrome of chronic heart failure. Then a normalized database was established for further data mining and analysis. Subsequently, the medication features and the regularity of prescriptions were mined by using traditional Chinese medicine inheritance support system(V2.5), association rules, improved mutual information algorithm, complex system entropy clustering and other mining methods. Finally, a total of 171 articles were included, involving 171 prescriptions, 140 kinds of herbs, with a total frequency of 1 772 for the herbs. As a result, 19 core prescriptions and 7 new prescriptions were mined. The most frequently used herbs included Huangqi(Astragali Radix), Danshen(Salviae Miltiorrhizae Radix et Rhizoma), Fuling(Poria), Renshen(Ginseng Radix et Rhizoma), Tinglizi(Semen Lepidii), Baizhu(Atractylodis Macrocephalae Rhizoma), and Guizhi(Cinnamomum Ramulus). The core prescriptions were composed of Huangqi(Astragali Radix), Danshen(Salviae Miltiorrhizae Radix et Rhizoma) and Fuling(Poria), etc. The high frequent herbs and core prescriptions not only highlight the medication features of Qi-invigorating and blood-circulating therapy, but also reflect the regularity of prescriptions of blood-circulating, Yang-warming, and urination-promoting therapy based on syndrome differentiation. Moreover, the mining of the new prescriptions provide new reference and inspiration for clinical treatment of various accompanying symptoms of chronic heart failure. In conclusion, this article provides new reference for traditional Chinese medicine in the treatment of chronic heart failure. Copyright© by the Chinese Pharmaceutical Association.

  16. Automated detection of hospital outbreaks: A systematic review of methods

    PubMed Central

    Buckeridge, David L.; Lepelletier, Didier

    2017-01-01

    Objectives Several automated algorithms for epidemiological surveillance in hospitals have been proposed. However, the usefulness of these methods to detect nosocomial outbreaks remains unclear. The goal of this review was to describe outbreak detection algorithms that have been tested within hospitals, consider how they were evaluated, and synthesize their results. Methods We developed a search query using keywords associated with hospital outbreak detection and searched the MEDLINE database. To ensure the highest sensitivity, no limitations were initially imposed on publication languages and dates, although we subsequently excluded studies published before 2000. Every study that described a method to detect outbreaks within hospitals was included, without any exclusion based on study design. Additional studies were identified through citations in retrieved studies. Results Twenty-nine studies were included. The detection algorithms were grouped into 5 categories: simple thresholds (n = 6), statistical process control (n = 12), scan statistics (n = 6), traditional statistical models (n = 6), and data mining methods (n = 4). The evaluation of the algorithms was often solely descriptive (n = 15), but more complex epidemiological criteria were also investigated (n = 10). The performance measures varied widely between studies: e.g., the sensitivity of an algorithm in a real world setting could vary between 17 and 100%. Conclusion Even if outbreak detection algorithms are useful complementary tools for traditional surveillance, the heterogeneity in results among published studies does not support quantitative synthesis of their performance. A standardized framework should be followed when evaluating outbreak detection methods to allow comparison of algorithms across studies and synthesis of results. PMID:28441422

  17. Method for Location of An External Dump in Surface Mining Using the A-Star Algorithm

    NASA Astrophysics Data System (ADS)

    Zajączkowski, Maciej; Kasztelewicz, Zbigniew; Sikora, Mateusz

    2014-10-01

    The construction of a surface mine always involves the necessity of accessing deposits through the removal of the residual overburden above. In the beginning phase of exploitation, the masses of overburden are located outside the perimeters of the excavation site, on the external dump, until the moment of internal dumping. In the case of lignite surface mines, these dumps can cover a ground surface of several dozen to a few thousand hectares. This results from a high concentration of lignite extraction, counted in millions of Mg per year, and the relatively large depth of its residual deposits. Determining the best place for the location of an external dump requires a detailed analysis of existing options, followed by a choice of the most favorable one. This article, using the case study of an open-cast lignite mine, presents the selection method for an external dump location based on graph theory and the A-star algorithm. This algorithm, based on the spatial distribution of individual intersections on the graph, seeks specified graph states, continually expanding them with additional elementary fields until the required surface area for the external dump - defined by the lowest value of the occupied site - is achieved. To do this, it is necessary to accurately identify the factors affecting the choice of dump location. On such a basis, it is then possible to specify the target function, which reflects the individual costs of dump construction on a given site. This is discussed further in chapter 3. The area of potential dump location has been divided into elementary fields, each represented by a corresponding geometrical locus. Ascribed to this locus, in addition to its geodesic coordinates, are the appropriate attributes reflecting the degree of development of its elementary field. These tasks can be carried out automatically thanks to the integration of the method with the system of geospatial data management for the given area. The collection of loci, together with geodesic coordinates, constitutes the points on the graph used during exploration. This is done using the A-star algorithm, which uses a heuristic function, allowing it to identify the optimal solution; therefore, the collection of elementary fields, which occupy the potential construction area of a dump, characterized by the lowest value representing the cost of occupation and dumping of overburden in the area. The precision of the boundary, generated by the algorithm, is dependent on the established size of the elementary field, and should be refined each time by the designer of the surface mine. This article presents the application of the above method of dump location using the example of "Tomisławice," a lignite surface mine owned by PAK KWB Konin S. A. The method made it possible to identify the most favorable dump location on the northeast side of the initial pit, within 2 kilometers of its surrounding area (discussed further in chapter 3). This method is universal in nature and, after certain modifications, can be implemented for other surface mines as well.

  18. A review of reverse vaccinology approaches for the development of vaccines against ticks and tick borne diseases.

    PubMed

    Lew-Tabor, A E; Rodriguez Valle, M

    2016-06-01

    The field of reverse vaccinology developed as an outcome of the genome sequence revolution. Following the introduction of live vaccinations in the western world by Edward Jenner in 1798 and the coining of the phrase 'vaccine', in 1881 Pasteur developed a rational design for vaccines. Pasteur proposed that in order to make a vaccine that one should 'isolate, inactivate and inject the microorganism' and these basic rules of vaccinology were largely followed for the next 100 years leading to the elimination of several highly infectious diseases. However, new technologies were needed to conquer many pathogens which could not be eliminated using these traditional technologies. Thus increasingly, computers were used to mine genome sequences to rationally design recombinant vaccines. Several vaccines for bacterial and viral diseases (i.e. meningococcus and HIV) have been developed, however the on-going challenge for parasite vaccines has been due to their comparatively larger genomes. Understanding the immune response is important in reverse vaccinology studies as this knowledge will influence how the genome mining is to be conducted. Vaccine candidates for anaplasmosis, cowdriosis, theileriosis, leishmaniasis, malaria, schistosomiasis, and the cattle tick have been identified using reverse vaccinology approaches. Some challenges for parasite vaccine development include the ability to address antigenic variability as well the understanding of the complex interplay between antibody, mucosal and/or T cell immune responses. To understand the complex parasite interactions with the livestock host, there is the limitation where algorithms for epitope mining using the human genome cannot directly be adapted for bovine, for example the prediction of peptide binding to major histocompatibility complex motifs. As the number of genomes for both hosts and parasites increase, the development of new algorithms for pan-genomic mining will continue to impact the future of parasite and ricketsial (and other tick borne pathogens) disease vaccine development. Copyright © 2015 Elsevier GmbH. All rights reserved.

  19. Modified Bat Algorithm for Feature Selection with the Wisconsin Diagnosis Breast Cancer (WDBC) Dataset

    PubMed

    Jeyasingh, Suganthi; Veluchamy, Malathi

    2017-05-01

    Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution License

  20. Coal and Open-pit surface mining impacts on American Lands (COAL)

    NASA Astrophysics Data System (ADS)

    Brown, T. A.; McGibbney, L. J.

    2017-12-01

    Mining is known to cause environmental degradation, but software tools to identify its impacts are lacking. However, remote sensing, spectral reflectance, and geographic data are readily available, and high-performance cloud computing resources exist for scientific research. Coal and Open-pit surface mining impacts on American Lands (COAL) provides a suite of algorithms and documentation to leverage these data and resources to identify evidence of mining and correlate it with environmental impacts over time.COAL was originally developed as a 2016 - 2017 senior capstone collaboration between scientists at the NASA Jet Propulsion Laboratory (JPL) and computer science students at Oregon State University (OSU). The COAL team implemented a free and open-source software library called "pycoal" in the Python programming language which facilitated a case study of the effects of coal mining on water resources. Evidence of acid mine drainage associated with an open-pit coal mine in New Mexico was derived by correlating imaging spectrometer data from the JPL Airborne Visible/InfraRed Imaging Spectrometer - Next Generation (AVIRIS-NG), spectral reflectance data published by the USGS Spectroscopy Laboratory in the USGS Digital Spectral Library 06, and GIS hydrography data published by the USGS National Geospatial Program in The National Map. This case study indicated that the spectral and geospatial algorithms developed by COAL can be used successfully to analyze the environmental impacts of mining activities.Continued development of COAL has been promoted by a Startup allocation award of high-performance computing resources from the Extreme Science and Engineering Discovery Environment (XSEDE). These resources allow the team to undertake further benchmarking, evaluation, and experimentation using multiple XSEDE resources. The opportunity to use computational infrastructure of this caliber will further enable the development of a science gateway to continue foundational COAL research.This work documents the original design and development of COAL and provides insight into continuing research efforts which have potential applications beyond the project to environmental data science and other fields.

  1. Information mining in remote sensing imagery

    NASA Astrophysics Data System (ADS)

    Li, Jiang

    The volume of remotely sensed imagery continues to grow at an enormous rate due to the advances in sensor technology, and our capability for collecting and storing images has greatly outpaced our ability to analyze and retrieve information from the images. This motivates us to develop image information mining techniques, which is very much an interdisciplinary endeavor drawing upon expertise in image processing, databases, information retrieval, machine learning, and software design. This dissertation proposes and implements an extensive remote sensing image information mining (ReSIM) system prototype for mining useful information implicitly stored in remote sensing imagery. The system consists of three modules: image processing subsystem, database subsystem, and visualization and graphical user interface (GUI) subsystem. Land cover and land use (LCLU) information corresponding to spectral characteristics is identified by supervised classification based on support vector machines (SVM) with automatic model selection, while textural features that characterize spatial information are extracted using Gabor wavelet coefficients. Within LCLU categories, textural features are clustered using an optimized k-means clustering approach to acquire search efficient space. The clusters are stored in an object-oriented database (OODB) with associated images indexed in an image database (IDB). A k-nearest neighbor search is performed using a query-by-example (QBE) approach. Furthermore, an automatic parametric contour tracing algorithm and an O(n) time piecewise linear polygonal approximation (PLPA) algorithm are developed for shape information mining of interesting objects within the image. A fuzzy object-oriented database based on the fuzzy object-oriented data (FOOD) model is developed to handle the fuzziness and uncertainty. Three specific applications are presented: integrated land cover and texture pattern mining, shape information mining for change detection of lakes, and fuzzy normalized difference vegetation index (NDVI) pattern mining. The study results show the effectiveness of the proposed system prototype and the potentials for other applications in remote sensing.

  2. Analysis of miRNA expression profile based on SVM algorithm

    NASA Astrophysics Data System (ADS)

    Ting-ting, Dai; Chang-ji, Shan; Yan-shou, Dong; Yi-duo, Bian

    2018-05-01

    Based on mirna expression spectrum data set, a new data mining algorithm - tSVM - KNN (t statistic with support vector machine - k nearest neighbor) is proposed. the idea of the algorithm is: firstly, the feature selection of the data set is carried out by the unified measurement method; Secondly, SVM - KNN algorithm, which combines support vector machine (SVM) and k - nearest neighbor (k - nearest neighbor) is used as classifier. Simulation results show that SVM - KNN algorithm has better classification ability than SVM and KNN alone. Tsvm - KNN algorithm only needs 5 mirnas to obtain 96.08 % classification accuracy in terms of the number of mirna " tags" and recognition accuracy. compared with similar algorithms, tsvm - KNN algorithm has obvious advantages.

  3. An improved clustering algorithm based on reverse learning in intelligent transportation

    NASA Astrophysics Data System (ADS)

    Qiu, Guoqing; Kou, Qianqian; Niu, Ting

    2017-05-01

    With the development of artificial intelligence and data mining technology, big data has gradually entered people's field of vision. In the process of dealing with large data, clustering is an important processing method. By introducing the reverse learning method in the clustering process of PAM clustering algorithm, to further improve the limitations of one-time clustering in unsupervised clustering learning, and increase the diversity of clustering clusters, so as to improve the quality of clustering. The algorithm analysis and experimental results show that the algorithm is feasible.

  4. Data Reduction Algorithm Using Nonnegative Matrix Factorization with Nonlinear Constraints

    NASA Astrophysics Data System (ADS)

    Sembiring, Pasukat

    2017-12-01

    Processing ofdata with very large dimensions has been a hot topic in recent decades. Various techniques have been proposed in order to execute the desired information or structure. Non- Negative Matrix Factorization (NMF) based on non-negatives data has become one of the popular methods for shrinking dimensions. The main strength of this method is non-negative object, the object model by a combination of some basic non-negative parts, so as to provide a physical interpretation of the object construction. The NMF is a dimension reduction method thathasbeen used widely for numerous applications including computer vision,text mining, pattern recognitions,and bioinformatics. Mathematical formulation for NMF did not appear as a convex optimization problem and various types of algorithms have been proposed to solve the problem. The Framework of Alternative Nonnegative Least Square(ANLS) are the coordinates of the block formulation approaches that have been proven reliable theoretically and empirically efficient. This paper proposes a new algorithm to solve NMF problem based on the framework of ANLS.This algorithm inherits the convergenceproperty of the ANLS framework to nonlinear constraints NMF formulations.

  5. Text Mining in Cancer Gene and Pathway Prioritization

    PubMed Central

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes. PMID:25392685

  6. Text mining in cancer gene and pathway prioritization.

    PubMed

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.

  7. Utilization of volume correlation filters for underwater mine identification in LIDAR imagery

    NASA Astrophysics Data System (ADS)

    Walls, Bradley

    2008-04-01

    Underwater mine identification persists as a critical technology pursued aggressively by the Navy for fleet protection. As such, new and improved techniques must continue to be developed in order to provide measurable increases in mine identification performance and noticeable reductions in false alarm rates. In this paper we show how recent advances in the Volume Correlation Filter (VCF) developed for ground based LIDAR systems can be adapted to identify targets in underwater LIDAR imagery. Current automated target recognition (ATR) algorithms for underwater mine identification employ spatial based three-dimensional (3D) shape fitting of models to LIDAR data to identify common mine shapes consisting of the box, cylinder, hemisphere, truncated cone, wedge, and annulus. VCFs provide a promising alternative to these spatial techniques by correlating 3D models against the 3D rendered LIDAR data.

  8. MONGKIE: an integrated tool for network analysis and visualization for multi-omics data.

    PubMed

    Jang, Yeongjun; Yu, Namhee; Seo, Jihae; Kim, Sun; Lee, Sanghyuk

    2016-03-18

    Network-based integrative analysis is a powerful technique for extracting biological insights from multilayered omics data such as somatic mutations, copy number variations, and gene expression data. However, integrated analysis of multi-omics data is quite complicated and can hardly be done in an automated way. Thus, a powerful interactive visual mining tool supporting diverse analysis algorithms for identification of driver genes and regulatory modules is much needed. Here, we present a software platform that integrates network visualization with omics data analysis tools seamlessly. The visualization unit supports various options for displaying multi-omics data as well as unique network models for describing sophisticated biological networks such as complex biomolecular reactions. In addition, we implemented diverse in-house algorithms for network analysis including network clustering and over-representation analysis. Novel functions include facile definition and optimized visualization of subgroups, comparison of a series of data sets in an identical network by data-to-visual mapping and subsequent overlaying function, and management of custom interaction networks. Utility of MONGKIE for network-based visual data mining of multi-omics data was demonstrated by analysis of the TCGA glioblastoma data. MONGKIE was developed in Java based on the NetBeans plugin architecture, thus being OS-independent with intrinsic support of module extension by third-party developers. We believe that MONGKIE would be a valuable addition to network analysis software by supporting many unique features and visualization options, especially for analysing multi-omics data sets in cancer and other diseases. .

  9. A Comparison of different learning models used in Data Mining for Medical Data

    NASA Astrophysics Data System (ADS)

    Srimani, P. K.; Koti, Manjula Sanjay

    2011-12-01

    The present study aims at investigating the different Data mining learning models for different medical data sets and to give practical guidelines to select the most appropriate algorithm for a specific medical data set. In practical situations, it is absolutely necessary to take decisions with regard to the appropriate models and parameters for diagnosis and prediction problems. Learning models and algorithms are widely implemented for rule extraction and the prediction of system behavior. In this paper, some of the well-known Machine Learning(ML) systems are investigated for different methods and are tested on five medical data sets. The practical criteria for evaluating different learning models are presented and the potential benefits of the proposed methodology for diagnosis and learning are suggested.

  10. Information pricing based on trusted system

    NASA Astrophysics Data System (ADS)

    Liu, Zehua; Zhang, Nan; Han, Hongfeng

    2018-05-01

    Personal information has become a valuable commodity in today's society. So our goal aims to develop a price point and a pricing system to be realistic. First of all, we improve the existing BLP system to prevent cascading incidents, design a 7-layer model. Through the cost of encryption in each layer, we develop PI price points. Besides, we use association rules mining algorithms in data mining algorithms to calculate the importance of information in order to optimize informational hierarchies of different attribute types when located within a multi-level trusted system. Finally, we use normal distribution model to predict encryption level distribution for users in different classes and then calculate information prices through a linear programming model with the help of encryption level distribution above.

  11. A comparative assessment of GIS-based data mining models and a novel ensemble model in groundwater well potential mapping

    NASA Astrophysics Data System (ADS)

    Naghibi, Seyed Amir; Moghaddam, Davood Davoodi; Kalantar, Bahareh; Pradhan, Biswajeet; Kisi, Ozgur

    2017-05-01

    In recent years, application of ensemble models has been increased tremendously in various types of natural hazard assessment such as landslides and floods. However, application of this kind of robust models in groundwater potential mapping is relatively new. This study applied four data mining algorithms including AdaBoost, Bagging, generalized additive model (GAM), and Naive Bayes (NB) models to map groundwater potential. Then, a novel frequency ratio data mining ensemble model (FREM) was introduced and evaluated. For this purpose, eleven groundwater conditioning factors (GCFs), including altitude, slope aspect, slope angle, plan curvature, stream power index (SPI), river density, distance from rivers, topographic wetness index (TWI), land use, normalized difference vegetation index (NDVI), and lithology were mapped. About 281 well locations with high potential were selected. Wells were randomly partitioned into two classes for training the models (70% or 197) and validating them (30% or 84). AdaBoost, Bagging, GAM, and NB algorithms were employed to get groundwater potential maps (GPMs). The GPMs were categorized into potential classes using natural break method of classification scheme. In the next stage, frequency ratio (FR) value was calculated for the output of the four aforementioned models and were summed, and finally a GPM was produced using FREM. For validating the models, area under receiver operating characteristics (ROC) curve was calculated. The ROC curve for prediction dataset was 94.8, 93.5, 92.6, 92.0, and 84.4% for FREM, Bagging, AdaBoost, GAM, and NB models, respectively. The results indicated that FREM had the best performance among all the models. The better performance of the FREM model could be related to reduction of over fitting and possible errors. Other models such as AdaBoost, Bagging, GAM, and NB also produced acceptable performance in groundwater modelling. The GPMs produced in the current study may facilitate groundwater exploitation by determining high and very high groundwater potential zones.

  12. An unsupervised classification approach for analysis of Landsat data to monitor land reclamation in Belmont county, Ohio

    NASA Technical Reports Server (NTRS)

    Brumfield, J. O.; Bloemer, H. H. L.; Campbell, W. J.

    1981-01-01

    Two unsupervised classification procedures for analyzing Landsat data used to monitor land reclamation in a surface mining area in east central Ohio are compared for agreement with data collected from the corresponding locations on the ground. One procedure is based on a traditional unsupervised-clustering/maximum-likelihood algorithm sequence that assumes spectral groupings in the Landsat data in n-dimensional space; the other is based on a nontraditional unsupervised-clustering/canonical-transformation/clustering algorithm sequence that not only assumes spectral groupings in n-dimensional space but also includes an additional feature-extraction technique. It is found that the nontraditional procedure provides an appreciable improvement in spectral groupings and apparently increases the level of accuracy in the classification of land cover categories.

  13. Privacy Preserving Technique for Euclidean Distance Based Mining Algorithms Using a Wavelet Related Transform

    NASA Astrophysics Data System (ADS)

    Kadampur, Mohammad Ali; D. v. L. N., Somayajulu

    Privacy preserving data mining is an art of knowledge discovery without revealing the sensitive data of the data set. In this paper a data transformation technique using wavelets is presented for privacy preserving data mining. Wavelets use well known energy compaction approach during data transformation and only the high energy coefficients are published to the public domain instead of the actual data proper. It is found that the transformed data preserves the Eucleadian distances and the method can be used in privacy preserving clustering. Wavelets offer the inherent improved time complexity.

  14. SOMA: A Proposed Framework for Trend Mining in Large UK Diabetic Retinopathy Temporal Databases

    NASA Astrophysics Data System (ADS)

    Somaraki, Vassiliki; Harding, Simon; Broadbent, Deborah; Coenen, Frans

    In this paper, we present SOMA, a new trend mining framework; and Aretaeus, the associated trend mining algorithm. The proposed framework is able to detect different kinds of trends within longitudinal datasets. The prototype trends are defined mathematically so that they can be mapped onto the temporal patterns. Trends are defined and generated in terms of the frequency of occurrence of pattern changes over time. To evaluate the proposed framework the process was applied to a large collection of medical records, forming part of the diabetic retinopathy screening programme at the Royal Liverpool University Hospital.

  15. Supporting Solar Physics Research via Data Mining

    NASA Astrophysics Data System (ADS)

    Angryk, Rafal; Banda, J.; Schuh, M.; Ganesan Pillai, K.; Tosun, H.; Martens, P.

    2012-05-01

    In this talk we will briefly introduce three pillars of data mining (i.e. frequent patterns discovery, classification, and clustering), and discuss some possible applications of known data mining techniques which can directly benefit solar physics research. In particular, we plan to demonstrate applicability of frequent patterns discovery methods for the verification of hypotheses about co-occurrence (in space and time) of filaments and sigmoids. We will also show how classification/machine learning algorithms can be utilized to verify human-created software modules to discover individual types of solar phenomena. Finally, we will discuss applicability of clustering techniques to image data processing.

  16. An Ensemble Approach in Converging Contents of LMS and KMS

    ERIC Educational Resources Information Center

    Sabitha, A. Sai; Mehrotra, Deepti; Bansal, Abhay

    2017-01-01

    Currently the challenges in e-Learning are converging the learning content from various sources and managing them within e-learning practices. Data mining learning algorithms can be used and the contents can be converged based on the Metadata of the objects. Ensemble methods use multiple learning algorithms and it can be used to converge the…

  17. A Bayesian Scoring Technique for Mining Predictive and Non-Spurious Rules

    PubMed Central

    Batal, Iyad; Cooper, Gregory; Hauskrecht, Milos

    2015-01-01

    Rule mining is an important class of data mining methods for discovering interesting patterns in data. The success of a rule mining method heavily depends on the evaluation function that is used to assess the quality of the rules. In this work, we propose a new rule evaluation score - the Predictive and Non-Spurious Rules (PNSR) score. This score relies on Bayesian inference to evaluate the quality of the rules and considers the structure of the rules to filter out spurious rules. We present an efficient algorithm for finding rules with high PNSR scores. The experiments demonstrate that our method is able to cover and explain the data with a much smaller rule set than existing methods. PMID:25938136

  18. A Bayesian Scoring Technique for Mining Predictive and Non-Spurious Rules.

    PubMed

    Batal, Iyad; Cooper, Gregory; Hauskrecht, Milos

    Rule mining is an important class of data mining methods for discovering interesting patterns in data. The success of a rule mining method heavily depends on the evaluation function that is used to assess the quality of the rules. In this work, we propose a new rule evaluation score - the Predictive and Non-Spurious Rules (PNSR) score. This score relies on Bayesian inference to evaluate the quality of the rules and considers the structure of the rules to filter out spurious rules. We present an efficient algorithm for finding rules with high PNSR scores. The experiments demonstrate that our method is able to cover and explain the data with a much smaller rule set than existing methods.

  19. Constraint-based Data Mining

    NASA Astrophysics Data System (ADS)

    Boulicaut, Jean-Francois; Jeudy, Baptiste

    Knowledge Discovery in Databases (KDD) is a complex interactive process. The promising theoretical framework of inductive databases considers this is essentially a querying process. It is enabled by a query language which can deal either with raw data or patterns which hold in the data. Mining patterns turns to be the so-called inductive query evaluation process for which constraint-based Data Mining techniques have to be designed. An inductive query specifies declaratively the desired constraints and algorithms are used to compute the patterns satisfying the constraints in the data. We survey important results of this active research domain. This chapter emphasizes a real breakthrough for hard problems concerning local pattern mining under various constraints and it points out the current directions of research as well.

  20. Clustering box office movie with Partition Around Medoids (PAM) Algorithm based on Text Mining of Indonesian subtitle

    NASA Astrophysics Data System (ADS)

    Alfarizy, A. D.; Indahwati; Sartono, B.

    2017-03-01

    Indonesia is the largest Hollywood movie industry target market in Southeast Asia in 2015. Hollywood movies distributed in Indonesia targeted people in all range of ages including children. Low awareness of guiding children while watching movies make them could watch any rated films even the unsuitable ones for their ages. Even after being translated into Bahasa and passed the censorship phase, words that uncomfortable for children to watch still exist. The purpose of this research is to cluster box office Hollywood movies based on Indonesian subtitle, revenue, IMDb user rating and genres as one of the reference for adults to choose right movies for their children to watch. Text mining is used to extract words from the subtitles and count the frequency for three group of words (bad words, sexual words and terror words), while Partition Around Medoids (PAM) Algorithm with Gower similarity coefficient as proximity matrix is used as clustering method. We clustered 624 movies from 2006 until first half of 2016 from IMDb. Cluster with highest silhouette coefficient value (0.36) is the one with 5 clusters. Animation, Adventure and Comedy movies with high revenue like in cluster 5 is recommended for children to watch, while Comedy movies with high revenue like in cluster 4 should be avoided to watch.

  1. Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT

    PubMed Central

    Wang, Dantong; Fong, Simon; Wong, Raymond K.; Mohammed, Sabah; Fiaidhi, Jinan; Wong, Kelvin K. L.

    2017-01-01

    Outlier detection in bioinformatics data streaming mining has received significant attention by research communities in recent years. The problems of how to distinguish noise from an exception and deciding whether to discard it or to devise an extra decision path for accommodating it are causing dilemma. In this paper, we propose a novel algorithm called ODR with incrementally Optimized Very Fast Decision Tree (ODR-ioVFDT) for taking care of outliers in the progress of continuous data learning. By using an adaptive interquartile-range based identification method, a tolerance threshold is set. It is then used to judge if a data of exceptional value should be included for training or otherwise. This is different from the traditional outlier detection/removal approaches which are two separate steps in processing through the data. The proposed algorithm is tested using datasets of five bioinformatics scenarios and comparing the performance of our model and other ones without ODR. The results show that ODR-ioVFDT has better performance in classification accuracy, kappa statistics, and time consumption. The ODR-ioVFDT applied onto bioinformatics streaming data processing for detecting and quantifying the information of life phenomena, states, characters, variables and components of the organism can help to diagnose and treat disease more effectively. PMID:28230161

  2. minepath.org: a free interactive pathway analysis web server.

    PubMed

    Koumakis, Lefteris; Roussos, Panos; Potamias, George

    2017-07-03

    ( www.minepath.org ) is a web-based platform that elaborates on, and radically extends the identification of differentially expressed sub-paths in molecular pathways. Besides the network topology, the underlying MinePath algorithmic processes exploit exact gene-gene molecular relationships (e.g. activation, inhibition) and are able to identify differentially expressed pathway parts. Each pathway is decomposed into all its constituent sub-paths, which in turn are matched with corresponding gene expression profiles. The highly ranked, and phenotype inclined sub-paths are kept. Apart from the pathway analysis algorithm, the fundamental innovation of the MinePath web-server concerns its advanced visualization and interactive capabilities. To our knowledge, this is the first pathway analysis server that introduces and offers visualization of the underlying and active pathway regulatory mechanisms instead of genes. Other features include live interaction, immediate visualization of functional sub-paths per phenotype and dynamic linked annotations for the engaged genes and molecular relations. The user can download not only the results but also the corresponding web viewer framework of the performed analysis. This feature provides the flexibility to immediately publish results without publishing source/expression data, and get all the functionality of a web based pathway analysis viewer. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. Literature Mining of Pathogenesis-Related Proteins in Human Pathogens for Database Annotation

    DTIC Science & Technology

    2009-10-01

    person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control...submission and for literature mining result display with automatically tagged abstracts. I. Literature data sets for machine learning algorithm training...mass spectrometry) proteomics data from Burkholderia strains. • Task1 ( M13 -15): Preliminary analysis of the Burkholderia proteomic space

  4. Mining dynamic noteworthy functions in software execution sequences

    PubMed Central

    Huang, Guoyan; Wang, Yuqian; He, Haitao; Ren, Jiadong

    2017-01-01

    As the quality of crucial entities can directly affect that of software, their identification and protection become an important premise for effective software development, management, maintenance and testing, which thus contribute to improving the software quality and its attack-defending ability. Most analysis and evaluation on important entities like codes-based static structure analysis are on the destruction of the actual software running. In this paper, from the perspective of software execution process, we proposed an approach to mine dynamic noteworthy functions (DNFM)in software execution sequences. First, according to software decompiling and tracking stack changes, the execution traces composed of a series of function addresses were acquired. Then these traces were modeled as execution sequences and then simplified so as to get simplified sequences (SFS), followed by the extraction of patterns through pattern extraction (PE) algorithm from SFS. After that, evaluating indicators inner-importance and inter-importance were designed to measure the noteworthiness of functions in DNFM algorithm. Finally, these functions were sorted by their noteworthiness. Comparison and contrast were conducted on the experiment results from two traditional complex network-based node mining methods, namely PageRank and DegreeRank. The results show that the DNFM method can mine noteworthy functions in software effectively and precisely. PMID:28278276

  5. Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO).

    PubMed

    Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier

    2017-08-01

    A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  6. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets.

    PubMed

    Salem, Saeed; Ozcaglar, Cagri

    2014-01-01

    Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways.

  7. Large Scale Frequent Pattern Mining using MPI One-Sided Model

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vishnu, Abhinav; Agarwal, Khushbu

    In this paper, we propose a work-stealing runtime --- Library for Work Stealing LibWS --- using MPI one-sided model for designing scalable FP-Growth --- {\\em de facto} frequent pattern mining algorithm --- on large scale systems. LibWS provides locality efficient and highly scalable work-stealing techniques for load balancing on a variety of data distributions. We also propose a novel communication algorithm for FP-growth data exchange phase, which reduces the communication complexity from state-of-the-art O(p) to O(f + p/f) for p processes and f frequent attributed-ids. FP-Growth is implemented using LibWS and evaluated on several work distributions and support counts. Anmore » experimental evaluation of the FP-Growth on LibWS using 4096 processes on an InfiniBand Cluster demonstrates excellent efficiency for several work distributions (87\\% efficiency for Power-law and 91% for Poisson). The proposed distributed FP-Tree merging algorithm provides 38x communication speedup on 4096 cores.« less

  8. Text Mining Metal-Organic Framework Papers.

    PubMed

    Park, Sanghoon; Kim, Baekjun; Choi, Sihoon; Boyd, Peter G; Smit, Berend; Kim, Jihan

    2018-02-26

    We have developed a simple text mining algorithm that allows us to identify surface area and pore volumes of metal-organic frameworks (MOFs) using manuscript html files as inputs. The algorithm searches for common units (e.g., m 2 /g, cm 3 /g) associated with these two quantities to facilitate the search. From the sample set data of over 200 MOFs, the algorithm managed to identify 90% and 88.8% of the correct surface area and pore volume values. Further application to a test set of randomly chosen MOF html files yielded 73.2% and 85.1% accuracies for the two respective quantities. Most of the errors stem from unorthodox sentence structures that made it difficult to identify the correct data as well as bolded notations of MOFs (e.g., 1a) that made it difficult identify its real name. These types of tools will become useful when it comes to discovering structure-property relationships among MOFs as well as collecting a large set of data for references.

  9. Text grouping in patent analysis using adaptive K-means clustering algorithm

    NASA Astrophysics Data System (ADS)

    Shanie, Tiara; Suprijadi, Jadi; Zulhanif

    2017-03-01

    Patents are one of the Intellectual Property. Analyzing patent is one requirement in knowing well the development of technology in each country and in the world now. This study uses the patent document coming from the Espacenet server about Green Tea. Patent documents related to the technology in the field of tea is still widespread, so it will be difficult for users to information retrieval (IR). Therefore, it is necessary efforts to categorize documents in a specific group of related terms contained therein. This study uses titles patent text data with the proposed Green Tea in Statistical Text Mining methods consists of two phases: data preparation and data analysis stage. The data preparation phase uses Text Mining methods and data analysis stage is done by statistics. Statistical analysis in this study using a cluster analysis algorithm, the Adaptive K-Means Clustering Algorithm. Results from this study showed that based on the maximum value Silhouette, generate 87 clusters associated fifteen terms therein that can be utilized in the process of information retrieval needs.

  10. Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease.

    PubMed

    Small, Aeron M; Kiss, Daniel H; Zlatsin, Yevgeny; Birtwell, David L; Williams, Heather; Guerraty, Marie A; Han, Yuchi; Anwaruddin, Saif; Holmes, John H; Chirinos, Julio A; Wilensky, Robert L; Giri, Jay; Rader, Daniel J

    2017-08-01

    Interrogation of the electronic health record (EHR) using billing codes as a surrogate for diagnoses of interest has been widely used for clinical research. However, the accuracy of this methodology is variable, as it reflects billing codes rather than severity of disease, and depends on the disease and the accuracy of the coding practitioner. Systematic application of text mining to the EHR has had variable success for the detection of cardiovascular phenotypes. We hypothesize that the application of text mining algorithms to cardiovascular procedure reports may be a superior method to identify patients with cardiovascular conditions of interest. We adapted the Oracle product Endeca, which utilizes text mining to identify terms of interest from a NoSQL-like database, for purposes of searching cardiovascular procedure reports and termed the tool "PennSeek". We imported 282,569 echocardiography reports representing 81,164 individuals and 27,205 cardiac catheterization reports representing 14,567 individuals from non-searchable databases into PennSeek. We then applied clinical criteria to these reports in PennSeek to identify patients with trileaflet aortic stenosis (TAS) and coronary artery disease (CAD). Accuracy of patient identification by text mining through PennSeek was compared with ICD-9 billing codes. Text mining identified 7115 patients with TAS and 9247 patients with CAD. ICD-9 codes identified 8272 patients with TAS and 6913 patients with CAD. 4346 patients with AS and 6024 patients with CAD were identified by both approaches. A randomly selected sample of 200-250 patients uniquely identified by text mining was compared with 200-250 patients uniquely identified by billing codes for both diseases. We demonstrate that text mining was superior, with a positive predictive value (PPV) of 0.95 compared to 0.53 by ICD-9 for TAS, and a PPV of 0.97 compared to 0.86 for CAD. These results highlight the superiority of text mining algorithms applied to electronic cardiovascular procedure reports in the identification of phenotypes of interest for cardiovascular research. Copyright © 2017. Published by Elsevier Inc.

  11. Application of XGBoost algorithm in hourly PM2.5 concentration prediction

    NASA Astrophysics Data System (ADS)

    Pan, Bingyue

    2018-02-01

    In view of prediction techniques of hourly PM2.5 concentration in China, this paper applied the XGBoost(Extreme Gradient Boosting) algorithm to predict hourly PM2.5 concentration. The monitoring data of air quality in Tianjin city was analyzed by using XGBoost algorithm. The prediction performance of the XGBoost method is evaluated by comparing observed and predicted PM2.5 concentration using three measures of forecast accuracy. The XGBoost method is also compared with the random forest algorithm, multiple linear regression, decision tree regression and support vector machines for regression models using computational results. The results demonstrate that the XGBoost algorithm outperforms other data mining methods.

  12. Inferring Intra-Community Microbial Interaction Patterns from Metagenomic Datasets Using Associative Rule Mining Techniques

    PubMed Central

    Mande, Sharmila S.

    2016-01-01

    The nature of inter-microbial metabolic interactions defines the stability of microbial communities residing in any ecological niche. Deciphering these interaction patterns is crucial for understanding the mode/mechanism(s) through which an individual microbial community transitions from one state to another (e.g. from a healthy to a diseased state). Statistical correlation techniques have been traditionally employed for mining microbial interaction patterns from taxonomic abundance data corresponding to a given microbial community. In spite of their efficiency, these correlation techniques can capture only 'pair-wise interactions'. Moreover, their emphasis on statistical significance can potentially result in missing out on several interactions that are relevant from a biological standpoint. This study explores the applicability of one of the earliest association rule mining algorithm i.e. the 'Apriori algorithm' for deriving 'microbial association rules' from the taxonomic profile of given microbial community. The classical Apriori approach derives association rules by analysing patterns of co-occurrence/co-exclusion between various '(subsets of) features/items' across various samples. Using real-world microbiome data, the efficiency/utility of this rule mining approach in deciphering multiple (biologically meaningful) association patterns between 'subsets/subgroups' of microbes (constituting microbiome samples) is demonstrated. As an example, association rules derived from publicly available gut microbiome datasets indicate an association between a group of microbes (Faecalibacterium, Dorea, and Blautia) that are known to have mutualistic metabolic associations among themselves. Application of the rule mining approach on gut microbiomes (sourced from the Human Microbiome Project) further indicated similar microbial association patterns in gut microbiomes irrespective of the gender of the subjects. A Linux implementation of the Association Rule Mining (ARM) software (customised for deriving 'microbial association rules' from microbiome data) is freely available for download from the following link: http://metagenomics.atc.tcs.com/arm. PMID:27124399

  13. Predicting missing values in a home care database using an adaptive uncertainty rule method.

    PubMed

    Konias, S; Gogou, G; Bamidis, P D; Vlahavas, I; Maglaveras, N

    2005-01-01

    Contemporary literature illustrates an abundance of adaptive algorithms for mining association rules. However, most literature is unable to deal with the peculiarities, such as missing values and dynamic data creation, that are frequently encountered in fields like medicine. This paper proposes an uncertainty rule method that uses an adaptive threshold for filling missing values in newly added records. A new approach for mining uncertainty rules and filling missing values is proposed, which is in turn particularly suitable for dynamic databases, like the ones used in home care systems. In this study, a new data mining method named FiMV (Filling Missing Values) is illustrated based on the mined uncertainty rules. Uncertainty rules have quite a similar structure to association rules and are extracted by an algorithm proposed in previous work, namely AURG (Adaptive Uncertainty Rule Generation). The main target was to implement an appropriate method for recovering missing values in a dynamic database, where new records are continuously added, without needing to specify any kind of thresholds beforehand. The method was applied to a home care monitoring system database. Randomly, multiple missing values for each record's attributes (rate 5-20% by 5% increments) were introduced in the initial dataset. FiMV demonstrated 100% completion rates with over 90% success in each case, while usual approaches, where all records with missing values are ignored or thresholds are required, experienced significantly reduced completion and success rates. It is concluded that the proposed method is appropriate for the data-cleaning step of the Knowledge Discovery process in databases. The latter, containing much significance for the output efficiency of any data mining technique, can improve the quality of the mined information.

  14. Efficient discovery of risk patterns in medical data.

    PubMed

    Li, Jiuyong; Fu, Ada Wai-chee; Fahey, Paul

    2009-01-01

    This paper studies a problem of efficiently discovering risk patterns in medical data. Risk patterns are defined by a statistical metric, relative risk, which has been widely used in epidemiological research. To avoid fruitless search in the complete exploration of risk patterns, we define optimal risk pattern set to exclude superfluous patterns, i.e. complicated patterns with lower relative risk than their corresponding simpler form patterns. We prove that mining optimal risk pattern sets conforms an anti-monotone property that supports an efficient mining algorithm. We propose an efficient algorithm for mining optimal risk pattern sets based on this property. We also propose a hierarchical structure to present discovered patterns for the easy perusal by domain experts. The proposed approach is compared with two well-known rule discovery methods, decision tree and association rule mining approaches on benchmark data sets and applied to a real world application. The proposed method discovers more and better quality risk patterns than a decision tree approach. The decision tree method is not designed for such applications and is inadequate for pattern exploring. The proposed method does not discover a large number of uninteresting superfluous patterns as an association mining approach does. The proposed method is more efficient than an association rule mining method. A real world case study shows that the method reveals some interesting risk patterns to medical practitioners. The proposed method is an efficient approach to explore risk patterns. It quickly identifies cohorts of patients that are vulnerable to a risk outcome from a large data set. The proposed method is useful for exploratory study on large medical data to generate and refine hypotheses. The method is also useful for designing medical surveillance systems.

  15. Inferring Intra-Community Microbial Interaction Patterns from Metagenomic Datasets Using Associative Rule Mining Techniques.

    PubMed

    Tandon, Disha; Haque, Mohammed Monzoorul; Mande, Sharmila S

    2016-01-01

    The nature of inter-microbial metabolic interactions defines the stability of microbial communities residing in any ecological niche. Deciphering these interaction patterns is crucial for understanding the mode/mechanism(s) through which an individual microbial community transitions from one state to another (e.g. from a healthy to a diseased state). Statistical correlation techniques have been traditionally employed for mining microbial interaction patterns from taxonomic abundance data corresponding to a given microbial community. In spite of their efficiency, these correlation techniques can capture only 'pair-wise interactions'. Moreover, their emphasis on statistical significance can potentially result in missing out on several interactions that are relevant from a biological standpoint. This study explores the applicability of one of the earliest association rule mining algorithm i.e. the 'Apriori algorithm' for deriving 'microbial association rules' from the taxonomic profile of given microbial community. The classical Apriori approach derives association rules by analysing patterns of co-occurrence/co-exclusion between various '(subsets of) features/items' across various samples. Using real-world microbiome data, the efficiency/utility of this rule mining approach in deciphering multiple (biologically meaningful) association patterns between 'subsets/subgroups' of microbes (constituting microbiome samples) is demonstrated. As an example, association rules derived from publicly available gut microbiome datasets indicate an association between a group of microbes (Faecalibacterium, Dorea, and Blautia) that are known to have mutualistic metabolic associations among themselves. Application of the rule mining approach on gut microbiomes (sourced from the Human Microbiome Project) further indicated similar microbial association patterns in gut microbiomes irrespective of the gender of the subjects. A Linux implementation of the Association Rule Mining (ARM) software (customised for deriving 'microbial association rules' from microbiome data) is freely available for download from the following link: http://metagenomics.atc.tcs.com/arm.

  16. Unsupervised classification of multivariate geostatistical data: Two algorithms

    NASA Astrophysics Data System (ADS)

    Romary, Thomas; Ors, Fabien; Rivoirard, Jacques; Deraisme, Jacques

    2015-12-01

    With the increasing development of remote sensing platforms and the evolution of sampling facilities in mining and oil industry, spatial datasets are becoming increasingly large, inform a growing number of variables and cover wider and wider areas. Therefore, it is often necessary to split the domain of study to account for radically different behaviors of the natural phenomenon over the domain and to simplify the subsequent modeling step. The definition of these areas can be seen as a problem of unsupervised classification, or clustering, where we try to divide the domain into homogeneous domains with respect to the values taken by the variables in hand. The application of classical clustering methods, designed for independent observations, does not ensure the spatial coherence of the resulting classes. Image segmentation methods, based on e.g. Markov random fields, are not adapted to irregularly sampled data. Other existing approaches, based on mixtures of Gaussian random functions estimated via the expectation-maximization algorithm, are limited to reasonable sample sizes and a small number of variables. In this work, we propose two algorithms based on adaptations of classical algorithms to multivariate geostatistical data. Both algorithms are model free and can handle large volumes of multivariate, irregularly spaced data. The first one proceeds by agglomerative hierarchical clustering. The spatial coherence is ensured by a proximity condition imposed for two clusters to merge. This proximity condition relies on a graph organizing the data in the coordinates space. The hierarchical algorithm can then be seen as a graph-partitioning algorithm. Following this interpretation, a spatial version of the spectral clustering algorithm is also proposed. The performances of both algorithms are assessed on toy examples and a mining dataset.

  17. VRLane: a desktop virtual safety management program for underground coal mine

    NASA Astrophysics Data System (ADS)

    Li, Mei; Chen, Jingzhu; Xiong, Wei; Zhang, Pengpeng; Wu, Daozheng

    2008-10-01

    VR technologies, which generate immersive, interactive, and three-dimensional (3D) environments, are seldom applied to coal mine safety work management. In this paper, a new method that combined the VR technologies with underground mine safety management system was explored. A desktop virtual safety management program for underground coal mine, called VRLane, was developed. The paper mainly concerned about the current research advance in VR, system design, key techniques and system application. Two important techniques were introduced in the paper. Firstly, an algorithm was designed and implemented, with which the 3D laneway models and equipment models can be built on the basis of the latest mine 2D drawings automatically, whereas common VR programs established 3D environment by using 3DS Max or the other 3D modeling software packages with which laneway models were built manually and laboriously. Secondly, VRLane realized system integration with underground industrial automation. VRLane not only described a realistic 3D laneway environment, but also described the status of the coal mining, with functions of displaying the run states and related parameters of equipment, per-alarming the abnormal mining events, and animating mine cars, mine workers, or long-wall shearers. The system, with advantages of cheap, dynamic, easy to maintenance, provided a useful tool for safety production management in coal mine.

  18. Integration of Landsat-based disturbance maps in the Landscape Change Monitoring System (LCMS)

    NASA Astrophysics Data System (ADS)

    Healey, S. P.; Cohen, W. B.; Eidenshink, J. C.; Hernandez, A. J.; Huang, C.; Kennedy, R. E.; Moisen, G. G.; Schroeder, T. A.; Stehman, S.; Steinwand, D.; Vogelmann, J. E.; Woodcock, C.; Yang, L.; Yang, Z.; Zhu, Z.

    2013-12-01

    Land cover change can have a profound effect upon an area's natural resources and its role in biogeochemical and hydrological cycles. Many land cover changes processes are sensitive to climate, including: fire; storm damage, and insect activity. Monitoring of both past and ongoing land cover change is critical, particularly as we try to understand the impact of a changing climate on the natural systems we manage. The Landsat series of satellites, which initially launched in 1972, has allowed land observation at spatial and spectral resolutions appropriate for identification of many types of land cover change. Over the years, and particularly since the opening of the Landsat archive in 2008, many approaches have been developed to meet individual monitoring needs. Algorithms vary by the cover type targeted, the rate of change sought, and the period between observations. The Landscape Change Monitoring System (LCMS) is envisioned as a sustained, inter-agency monitoring program that brings together and operationally provides the best available land cover change maps over the United States. Expanding upon the successful USGS/Forest Service Monitoring Trends in Burn Severity project, LCMS is designed to serve a variety of research and management communities. The LCMS Science Team is currently assessing the relative strengths of a variety of leading change detection approaches, primarily emphasizing Landsat observations. Using standardized image pre-processing methods, maps produced by these algorithms have been compared at intensive validation sites across the country. Additionally, LCMS has taken steps toward a data-mining framework, in which ensembles of algorithm outputs are used with non-parametric models to create integrated predictions of change across a variety of scenarios and change dynamics. We present initial findings from the LCMS Science Team, including validation results from individual algorithms and assessment of initial 'integrated' products from the data-mining framework. It is anticipated that these results will directly impact land change information that will in the future be routinely available across the country through LCMS. With a baseline observation period of more than 40 years and a national scope, this data should shed light upon how trends in disturbance may be linked to climatic changes.

  19. Distributed data mining on grids: services, tools, and applications.

    PubMed

    Cannataro, Mario; Congiusta, Antonio; Pugliese, Andrea; Talia, Domenico; Trunfio, Paolo

    2004-12-01

    Data mining algorithms are widely used today for the analysis of large corporate and scientific datasets stored in databases and data archives. Industry, science, and commerce fields often need to analyze very large datasets maintained over geographically distributed sites by using the computational power of distributed and parallel systems. The grid can play a significant role in providing an effective computational support for distributed knowledge discovery applications. For the development of data mining applications on grids we designed a system called Knowledge Grid. This paper describes the Knowledge Grid framework and presents the toolset provided by the Knowledge Grid for implementing distributed knowledge discovery. The paper discusses how to design and implement data mining applications by using the Knowledge Grid tools starting from searching grid resources, composing software and data components, and executing the resulting data mining process on a grid. Some performance results are also discussed.

  20. Using remote sensing imagery to monitoring sea surface pollution cause by abandoned gold-copper mine

    NASA Astrophysics Data System (ADS)

    Kao, H. M.; Ren, H.; Lee, Y. T.

    2010-08-01

    The Chinkuashih Benshen mine was the largest gold-copper mine in Taiwan before the owner had abandoned the mine in 1987. However, even the mine had been closed, the mineral still interacts with rain and underground water and flowed into the sea. The polluted sea surface had appeared yellow, green and even white color, and the pollutants had carried by the coast current. In this study, we used the optical satellite images to monitoring the sea surface. Several image processing algorithms are employed especial the subpixel technique and linear mixture model to estimate the concentration of pollutants. The change detection approach is also applied to track them. We also conduct the chemical analysis of the polluted water to provide the ground truth validation. By the correlation analysis between the satellite observation and the ground truth chemical analysis, an effective approach to monitoring water pollution could be established.

  1. Diamond Eye: a distributed architecture for image data mining

    NASA Astrophysics Data System (ADS)

    Burl, Michael C.; Fowlkes, Charless; Roden, Joe; Stechert, Andre; Mukhtar, Saleem

    1999-02-01

    Diamond Eye is a distributed software architecture, which enables users (scientists) to analyze large image collections by interacting with one or more custom data mining servers via a Java applet interface. Each server is coupled with an object-oriented database and a computational engine, such as a network of high-performance workstations. The database provides persistent storage and supports querying of the 'mined' information. The computational engine provides parallel execution of expensive image processing, object recognition, and query-by-content operations. Key benefits of the Diamond Eye architecture are: (1) the design promotes trial evaluation of advanced data mining and machine learning techniques by potential new users (all that is required is to point a web browser to the appropriate URL), (2) software infrastructure that is common across a range of science mining applications is factored out and reused, and (3) the system facilitates closer collaborations between algorithm developers and domain experts.

  2. Land mine detection using multispectral image fusion

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Clark, G.A.; Sengupta, S.K.; Aimonetti, W.D.

    1995-03-29

    Our system fuses information contained in registered images from multiple sensors to reduce the effects of clutter and improve the ability to detect surface and buried land mines. The sensor suite currently consists of a camera that acquires images in six bands (400nm, 500nm, 600nm, 700nm, 800nm and 900nm). Past research has shown that it is extremely difficult to distinguish land mines from background clutter in images obtained from a single sensor. It is hypothesized, however, that information fused from a suite of various sensors is likely to provide better detection reliability, because the suite of sensors detects a varietymore » of physical properties that are more separable in feature space. The materials surrounding the mines can include natural materials (soil, rocks, foliage, water, etc.) and some artifacts. We use a supervised learning pattern recognition approach to detecting the metal and plastic land mines. The overall process consists of four main parts: Preprocessing, feature extraction, feature selection, and classification. These parts are used in a two step process to classify a subimage. We extract features from the images, and use feature selection algorithms to select only the most important features according to their contribution to correct detections. This allows us to save computational complexity and determine which of the spectral bands add value to the detection system. The most important features from the various sensors are fused using a supervised learning pattern classifier (the probabilistic neural network). We present results of experiments to detect land mines from real data collected from an airborne platform, and evaluate the usefulness of fusing feature information from multiple spectral bands.« less

  3. An Approach to Realizing Process Control for Underground Mining Operations of Mobile Machines

    PubMed Central

    Song, Zhen; Schunnesson, Håkan; Rinne, Mikael; Sturgul, John

    2015-01-01

    The excavation and production in underground mines are complicated processes which consist of many different operations. The process of underground mining is considerably constrained by the geometry and geology of the mine. The various mining operations are normally performed in series at each working face. The delay of a single operation will lead to a domino effect, thus delay the starting time for the next process and the completion time of the entire process. This paper presents a new approach to the process control for underground mining operations, e.g. drilling, bolting, mucking. This approach can estimate the working time and its probability for each operation more efficiently and objectively by improving the existing PERT (Program Evaluation and Review Technique) and CPM (Critical Path Method). If the delay of the critical operation (which is on a critical path) inevitably affects the productivity of mined ore, the approach can rapidly assign mucking machines new jobs to increase this amount at a maximum level by using a new mucking algorithm under external constraints. PMID:26062092

  4. An Approach to Realizing Process Control for Underground Mining Operations of Mobile Machines.

    PubMed

    Song, Zhen; Schunnesson, Håkan; Rinne, Mikael; Sturgul, John

    2015-01-01

    The excavation and production in underground mines are complicated processes which consist of many different operations. The process of underground mining is considerably constrained by the geometry and geology of the mine. The various mining operations are normally performed in series at each working face. The delay of a single operation will lead to a domino effect, thus delay the starting time for the next process and the completion time of the entire process. This paper presents a new approach to the process control for underground mining operations, e.g. drilling, bolting, mucking. This approach can estimate the working time and its probability for each operation more efficiently and objectively by improving the existing PERT (Program Evaluation and Review Technique) and CPM (Critical Path Method). If the delay of the critical operation (which is on a critical path) inevitably affects the productivity of mined ore, the approach can rapidly assign mucking machines new jobs to increase this amount at a maximum level by using a new mucking algorithm under external constraints.

  5. A Comparative Study of Classification and Regression Algorithms for Modelling Students' Academic Performance

    ERIC Educational Resources Information Center

    Strecht, Pedro; Cruz, Luís; Soares, Carlos; Mendes-Moreira, João; Abreu, Rui

    2015-01-01

    Predicting the success or failure of a student in a course or program is a problem that has recently been addressed using data mining techniques. In this paper we evaluate some of the most popular classification and regression algorithms on this problem. We address two problems: prediction of approval/failure and prediction of grade. The former is…

  6. Clustering Educational Digital Library Usage Data: A Comparison of Latent Class Analysis and K-Means Algorithms

    ERIC Educational Resources Information Center

    Xu, Beijie; Recker, Mimi; Qi, Xiaojun; Flann, Nicholas; Ye, Lei

    2013-01-01

    This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect (IA.usu.edu). Using a multi-faceted approach and multiple data…

  7. Assessing Class-Wide Consistency and Randomness in Responses to True or False Questions Administered Online

    ERIC Educational Resources Information Center

    Pawl, Andrew; Teodorescu, Raluca E.; Peterson, Joseph D.

    2013-01-01

    We have developed simple data-mining algorithms to assess the consistency and the randomness of student responses to problems consisting of multiple true or false statements. In this paper we describe the algorithms and use them to analyze data from introductory physics courses. We investigate statements that emerge as outliers because the class…

  8. Data Mining for Anomaly Detection

    NASA Technical Reports Server (NTRS)

    Biswas, Gautam; Mack, Daniel; Mylaraswamy, Dinkar; Bharadwaj, Raj

    2013-01-01

    The Vehicle Integrated Prognostics Reasoner (VIPR) program describes methods for enhanced diagnostics as well as a prognostic extension to current state of art Aircraft Diagnostic and Maintenance System (ADMS). VIPR introduced a new anomaly detection function for discovering previously undetected and undocumented situations, where there are clear deviations from nominal behavior. Once a baseline (nominal model of operations) is established, the detection and analysis is split between on-aircraft outlier generation and off-aircraft expert analysis to characterize and classify events that may not have been anticipated by individual system providers. Offline expert analysis is supported by data curation and data mining algorithms that can be applied in the contexts of supervised learning methods and unsupervised learning. In this report, we discuss efficient methods to implement the Kolmogorov complexity measure using compression algorithms, and run a systematic empirical analysis to determine the best compression measure. Our experiments established that the combination of the DZIP compression algorithm and CiDM distance measure provides the best results for capturing relevant properties of time series data encountered in aircraft operations. This combination was used as the basis for developing an unsupervised learning algorithm to define "nominal" flight segments using historical flight segments.

  9. Literature mining, gene-set enrichment and pathway analysis for target identification in Behçet's disease.

    PubMed

    Wilson, Paul; Larminie, Christopher; Smith, Rona

    2016-01-01

    To use literature mining to catalogue Behçet's associated genes, and advanced computational methods to improve the understanding of the pathways and signalling mechanisms that lead to the typical clinical characteristics of Behçet's patients. To extend this technique to identify potential treatment targets for further experimental validation. Text mining methods combined with gene enrichment tools, pathway analysis and causal analysis algorithms. This approach identified 247 human genes associated with Behçet's disease and the resulting disease map, comprising 644 nodes and 19220 edges, captured important details of the relationships between these genes and their associated pathways, as described in diverse data repositories. Pathway analysis has identified how Behçet's associated genes are likely to participate in innate and adaptive immune responses. Causal analysis algorithms have identified a number of potential therapeutic strategies for further investigation. Computational methods have captured pertinent features of the prominent disease characteristics presented in Behçet's disease and have highlighted NOD2, ICOS and IL18 signalling as potential therapeutic strategies.

  10. Reducing side effects of hiding sensitive itemsets in privacy preserving data mining.

    PubMed

    Lin, Chun-Wei; Hong, Tzung-Pei; Hsu, Hung-Chuan

    2014-01-01

    Data mining is traditionally adopted to retrieve and analyze knowledge from large amounts of data. Private or confidential data may be sanitized or suppressed before it is shared or published in public. Privacy preserving data mining (PPDM) has thus become an important issue in recent years. The most general way of PPDM is to sanitize the database to hide the sensitive information. In this paper, a novel hiding-missing-artificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. The transaction with the maximal ratio of sensitive to nonsensitive one is thus selected to be entirely deleted. Three side effects of hiding failures, missing itemsets, and artificial itemsets are considered to evaluate whether the transactions are required to be deleted for hiding sensitive itemsets. Three weights are also assigned as the importance to three factors, which can be set according to the requirement of users. Experiments are then conducted to show the performance of the proposed algorithm in execution time, number of deleted transactions, and number of side effects.

  11. Reducing Side Effects of Hiding Sensitive Itemsets in Privacy Preserving Data Mining

    PubMed Central

    Lin, Chun-Wei; Hong, Tzung-Pei; Hsu, Hung-Chuan

    2014-01-01

    Data mining is traditionally adopted to retrieve and analyze knowledge from large amounts of data. Private or confidential data may be sanitized or suppressed before it is shared or published in public. Privacy preserving data mining (PPDM) has thus become an important issue in recent years. The most general way of PPDM is to sanitize the database to hide the sensitive information. In this paper, a novel hiding-missing-artificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. The transaction with the maximal ratio of sensitive to nonsensitive one is thus selected to be entirely deleted. Three side effects of hiding failures, missing itemsets, and artificial itemsets are considered to evaluate whether the transactions are required to be deleted for hiding sensitive itemsets. Three weights are also assigned as the importance to three factors, which can be set according to the requirement of users. Experiments are then conducted to show the performance of the proposed algorithm in execution time, number of deleted transactions, and number of side effects. PMID:24982932

  12. Activity Recognition for Personal Time Management

    NASA Astrophysics Data System (ADS)

    Prekopcsák, Zoltán; Soha, Sugárka; Henk, Tamás; Gáspár-Papanek, Csaba

    We describe an accelerometer based activity recognition system for mobile phones with a special focus on personal time management. We compare several data mining algorithms for the automatic recognition task in the case of single user and multiuser scenario, and improve accuracy with heuristics and advanced data mining methods. The results show that daily activities can be recognized with high accuracy and the integration with the RescueTime software can give good insights for personal time management.

  13. Evolutionary Data Mining Approach to Creating Digital Logic

    DTIC Science & Technology

    2010-01-01

    To deal with this problem a genetic program (GP) based data mining ( DM ) procedure has been invented (Smith 2005). A genetic program is an algorithm...that can operate on the variables. When a GP was used as a DM function in the past to automatically create fuzzy decision trees, the Report...rules represents an approach to the determining the effect of linguistic imprecision, i.e., the inability of experts to provide crisp rules. The

  14. Knowledge-guided mutation in classification rules for autism treatment efficacy.

    PubMed

    Engle, Kelley; Rada, Roy

    2017-03-01

    Data mining methods in biomedical research might benefit by combining genetic algorithms with domain-specific knowledge. The objective of this research is to show how the evolution of treatment rules for autism might be guided. The semantic distance between two concepts in the taxonomy is measured by the number of relationships separating the concepts in the taxonomy. The hypothesis is that replacing a concept in a treatment rule will change the accuracy of the rule in direct proportion to the semantic distance between the concepts. The method uses a patient database and autism taxonomies. Treatment rules are developed with an algorithm that exploits the taxonomies. The results support the hypothesis. This research should both advance the understanding of autism data mining in particular and of knowledge-guided evolutionary search in biomedicine in general.

  15. Effective search for stable segregation configurations at grain boundaries with data-mining techniques

    NASA Astrophysics Data System (ADS)

    Kiyohara, Shin; Mizoguchi, Teruyasu

    2018-03-01

    Grain boundary segregation of dopants plays a crucial role in materials properties. To investigate the dopant segregation behavior at the grain boundary, an enormous number of combinations have to be considered in the segregation of multiple dopants at the complex grain boundary structures. Here, two data mining techniques, the random-forests regression and the genetic algorithm, were applied to determine stable segregation sites at grain boundaries efficiently. Using the random-forests method, a predictive model was constructed from 2% of the segregation configurations and it has been shown that this model could determine the stable segregation configurations. Furthermore, the genetic algorithm also successfully determined the most stable segregation configuration with great efficiency. We demonstrate that these approaches are quite effective to investigate the dopant segregation behaviors at grain boundaries.

  16. Reducing The Risk Of Fires In Conveyor Transport

    NASA Astrophysics Data System (ADS)

    Cheremushkina, M. S.; Poddubniy, D. A.

    2017-01-01

    The paper deals with the actual problem of increasing the safety of operation of belt conveyors in mines. Was developed the control algorithm that meets the technical requirements of the mine belt conveyors, reduces the risk of fires of conveyors belt, and enables energy and resource savings taking into account random sort of traffic. The most effective method of decision such tasks is the construction of control systems with the use of variable speed drives for asynchronous motors. Was designed the mathematical model of the system "variable speed multiengine drive - conveyor - control system of conveyors", that takes into account the dynamic processes occurring in the elements of the transport system, provides an assessment of the energy efficiency of application the developed algorithms, which allows to reduce the dynamic overload in the belt to (15-20)%.

  17. Using random forest for the risk assessment of coal-floor water inrush in Panjiayao Coal Mine, northern China

    NASA Astrophysics Data System (ADS)

    Zhao, Dekang; Wu, Qiang; Cui, Fangpeng; Xu, Hua; Zeng, Yifan; Cao, Yufei; Du, Yuanze

    2018-04-01

    Coal-floor water-inrush incidents account for a large proportion of coal mine disasters in northern China, and accurate risk assessment is crucial for safe coal production. A novel and promising assessment model for water inrush is proposed based on random forest (RF), which is a powerful intelligent machine-learning algorithm. RF has considerable advantages, including high classification accuracy and the capability to evaluate the importance of variables; in particularly, it is robust in dealing with the complicated and non-linear problems inherent in risk assessment. In this study, the proposed model is applied to Panjiayao Coal Mine, northern China. Eight factors were selected as evaluation indices according to systematic analysis of the geological conditions and a field survey of the study area. Risk assessment maps were generated based on RF, and the probabilistic neural network (PNN) model was also used for risk assessment as a comparison. The results demonstrate that the two methods are consistent in the risk assessment of water inrush at the mine, and RF shows a better performance compared to PNN with an overall accuracy higher by 6.67%. It is concluded that RF is more practicable to assess the water-inrush risk than PNN. The presented method will be helpful in avoiding water inrush and also can be extended to various engineering applications.

  18. Stopping Antidepressants and Anxiolytics as Major Concerns Reported in Online Health Communities: A Text Mining Approach.

    PubMed

    Abbe, Adeline; Falissard, Bruno

    2017-10-23

    Internet is a particularly dynamic way to quickly capture the perceptions of a population in real time. Complementary to traditional face-to-face communication, online social networks help patients to improve self-esteem and self-help. The aim of this study was to use text mining on material from an online forum exploring patients' concerns about treatment (antidepressants and anxiolytics). Concerns about treatment were collected from discussion titles in patients' online community related to antidepressants and anxiolytics. To examine the content of these titles automatically, we used text mining methods, such as word frequency in a document-term matrix and co-occurrence of words using a network analysis. It was thus possible to identify topics discussed on the forum. The forum included 2415 discussions on antidepressants and anxiolytics over a period of 3 years. After a preprocessing step, the text mining algorithm identified the 99 most frequently occurring words in titles, among which were escitalopram, withdrawal, antidepressant, venlafaxine, paroxetine, and effect. Patients' concerns were related to antidepressant withdrawal, the need to share experience about symptoms, effects, and questions on weight gain with some drugs. Patients' expression on the Internet is a potential additional resource in addressing patients' concerns about treatment. Patient profiles are close to that of patients treated in psychiatry. ©Adeline Abbe, Bruno Falissard. Originally published in JMIR Mental Health (http://mental.jmir.org), 23.10.2017.

  19. Issues of data governance associated with data mining in medical research: experiences from an empirical study.

    PubMed

    Nahar, Jesmin; Imam, Tasadduq; Tickle, Kevin S; Garcia-Alonso, Debora

    2013-01-01

    This chapter is a review of data mining techniques used in medical research. It will cover the existing applications of these techniques in the identification of diseases, and also present the authors' research experiences in medical disease diagnosis and analysis. A computational diagnosis approach can have a significant impact on accurate diagnosis and result in time and cost effective solutions. The chapter will begin with an overview of computational intelligence concepts, followed by details on different classification algorithms. Use of association learning, a well recognised data mining procedure, will also be discussed. Many of the datasets considered in existing medical data mining research are imbalanced, and the chapter focuses on this issue as well. Lastly, the chapter outlines the need of data governance in this research domain.

  20. Attributed community mining using joint general non-negative matrix factorization with graph Laplacian

    NASA Astrophysics Data System (ADS)

    Chen, Zigang; Li, Lixiang; Peng, Haipeng; Liu, Yuhong; Yang, Yixian

    2018-04-01

    Community mining for complex social networks with link and attribute information plays an important role according to different application needs. In this paper, based on our proposed general non-negative matrix factorization (GNMF) algorithm without dimension matching constraints in our previous work, we propose the joint GNMF with graph Laplacian (LJGNMF) to implement community mining of complex social networks with link and attribute information according to different application needs. Theoretical derivation result shows that the proposed LJGNMF is fully compatible with previous methods of integrating traditional NMF and symmetric NMF. In addition, experimental results show that the proposed LJGNMF can meet the needs of different community minings by adjusting its parameters, and the effect is better than traditional NMF in the community vertices attributes entropy.

  1. Breast Imaging in the Era of Big Data: Structured Reporting and Data Mining.

    PubMed

    Margolies, Laurie R; Pandey, Gaurav; Horowitz, Eliot R; Mendelson, David S

    2016-02-01

    The purpose of this article is to describe structured reporting and the development of large databases for use in data mining in breast imaging. The results of millions of breast imaging examinations are reported with structured tools based on the BI-RADS lexicon. Much of these data are stored in accessible media. Robust computing power creates great opportunity for data scientists and breast imagers to collaborate to improve breast cancer detection and optimize screening algorithms. Data mining can create knowledge, but the questions asked and their complexity require extremely powerful and agile databases. New data technologies can facilitate outcomes research and precision medicine.

  2. K-Nearest Neighbor Algorithm Optimization in Text Categorization

    NASA Astrophysics Data System (ADS)

    Chen, Shufeng

    2018-01-01

    K-Nearest Neighbor (KNN) classification algorithm is one of the simplest methods of data mining. It has been widely used in classification, regression and pattern recognition. The traditional KNN method has some shortcomings such as large amount of sample computation and strong dependence on the sample library capacity. In this paper, a method of representative sample optimization based on CURE algorithm is proposed. On the basis of this, presenting a quick algorithm QKNN (Quick k-nearest neighbor) to find the nearest k neighbor samples, which greatly reduces the similarity calculation. The experimental results show that this algorithm can effectively reduce the number of samples and speed up the search for the k nearest neighbor samples to improve the performance of the algorithm.

  3. The improved Apriori algorithm based on matrix pruning and weight analysis

    NASA Astrophysics Data System (ADS)

    Lang, Zhenhong

    2018-04-01

    This paper uses the matrix compression algorithm and weight analysis algorithm for reference and proposes an improved matrix pruning and weight analysis Apriori algorithm. After the transactional database is scanned for only once, the algorithm will construct the boolean transaction matrix. Through the calculation of one figure in the rows and columns of the matrix, the infrequent item set is pruned, and a new candidate item set is formed. Then, the item's weight and the transaction's weight as well as the weight support for items are calculated, thus the frequent item sets are gained. The experimental result shows that the improved Apriori algorithm not only reduces the number of repeated scans of the database, but also improves the efficiency of data correlation mining.

  4. Zebra Crossing Spotter: Automatic Population of Spatial Databases for Increased Safety of Blind Travelers

    PubMed Central

    Ahmetovic, Dragan; Manduchi, Roberto; Coughlan, James M.; Mascetti, Sergio

    2016-01-01

    In this paper we propose a computer vision-based technique that mines existing spatial image databases for discovery of zebra crosswalks in urban settings. Knowing the location of crosswalks is critical for a blind person planning a trip that includes street crossing. By augmenting existing spatial databases (such as Google Maps or OpenStreetMap) with this information, a blind traveler may make more informed routing decisions, resulting in greater safety during independent travel. Our algorithm first searches for zebra crosswalks in satellite images; all candidates thus found are validated against spatially registered Google Street View images. This cascaded approach enables fast and reliable discovery and localization of zebra crosswalks in large image datasets. While fully automatic, our algorithm could also be complemented by a final crowdsourcing validation stage for increased accuracy. PMID:26824080

  5. Aggregated Indexing of Biomedical Time Series Data

    PubMed Central

    Woodbridge, Jonathan; Mortazavi, Bobak; Sarrafzadeh, Majid; Bui, Alex A.T.

    2016-01-01

    Remote and wearable medical sensing has the potential to create very large and high dimensional datasets. Medical time series databases must be able to efficiently store, index, and mine these datasets to enable medical professionals to effectively analyze data collected from their patients. Conventional high dimensional indexing methods are a two stage process. First, a superset of the true matches is efficiently extracted from the database. Second, supersets are pruned by comparing each of their objects to the query object and rejecting any objects falling outside a predetermined radius. This pruning stage heavily dominates the computational complexity of most conventional search algorithms. Therefore, indexing algorithms can be significantly improved by reducing the amount of pruning. This paper presents an online algorithm to aggregate biomedical times series data to significantly reduce the search space (index size) without compromising the quality of search results. This algorithm is built on the observation that biomedical time series signals are composed of cyclical and often similar patterns. This algorithm takes in a stream of segments and groups them to highly concentrated collections. Locality Sensitive Hashing (LSH) is used to reduce the overall complexity of the algorithm, allowing it to run online. The output of this aggregation is used to populate an index. The proposed algorithm yields logarithmic growth of the index (with respect to the total number of objects) while keeping sensitivity and specificity simultaneously above 98%. Both memory and runtime complexities of time series search are improved when using aggregated indexes. In addition, data mining tasks, such as clustering, exhibit runtimes that are orders of magnitudes faster when run on aggregated indexes. PMID:27617298

  6. DTFP-Growth: Dynamic Threshold-Based FP-Growth Rule Mining Algorithm Through Integrating Gene Expression, Methylation, and Protein-Protein Interaction Profiles.

    PubMed

    Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan; Mallik, Saurav; Bhadra, Tapas; Mukherji, Ayan

    2018-04-01

    Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.

  7. Real alerts and artifact classification in archived multi-signal vital sign monitoring data: implications for mining big data.

    PubMed

    Hravnak, Marilyn; Chen, Lujie; Dubrawski, Artur; Bose, Eliezer; Clermont, Gilles; Pinsky, Michael R

    2016-12-01

    Huge hospital information system databases can be mined for knowledge discovery and decision support, but artifact in stored non-invasive vital sign (VS) high-frequency data streams limits its use. We used machine-learning (ML) algorithms trained on expert-labeled VS data streams to automatically classify VS alerts as real or artifact, thereby "cleaning" such data for future modeling. 634 admissions to a step-down unit had recorded continuous noninvasive VS monitoring data [heart rate (HR), respiratory rate (RR), peripheral arterial oxygen saturation (SpO 2 ) at 1/20 Hz, and noninvasive oscillometric blood pressure (BP)]. Time data were across stability thresholds defined VS event epochs. Data were divided Block 1 as the ML training/cross-validation set and Block 2 the test set. Expert clinicians annotated Block 1 events as perceived real or artifact. After feature extraction, ML algorithms were trained to create and validate models automatically classifying events as real or artifact. The models were then tested on Block 2. Block 1 yielded 812 VS events, with 214 (26 %) judged by experts as artifact (RR 43 %, SpO 2 40 %, BP 15 %, HR 2 %). ML algorithms applied to the Block 1 training/cross-validation set (tenfold cross-validation) gave area under the curve (AUC) scores of 0.97 RR, 0.91 BP and 0.76 SpO 2 . Performance when applied to Block 2 test data was AUC 0.94 RR, 0.84 BP and 0.72 SpO 2 . ML-defined algorithms applied to archived multi-signal continuous VS monitoring data allowed accurate automated classification of VS alerts as real or artifact, and could support data mining for future model building.

  8. Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm.

    PubMed

    Tchagang, Alain B; Phan, Sieu; Famili, Fazel; Shearer, Heather; Fobert, Pierre; Huang, Yi; Zou, Jitao; Huang, Daiqing; Cutler, Adrian; Liu, Ziying; Pan, Youlian

    2012-04-04

    Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space. We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (Plasmodium chabaudi), systemic acquired resistance in Arabidopsis thaliana, similarities and differences between inner and outer cotyledon in Brassica napus during seed development, and to Brassica napus whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples. Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.

  9. Efficient hiding of confidential high-utility itemsets with minimal side effects

    NASA Astrophysics Data System (ADS)

    Lin, Jerry Chun-Wei; Hong, Tzung-Pei; Fournier-Viger, Philippe; Liu, Qiankun; Wong, Jia-Wei; Zhan, Justin

    2017-11-01

    Privacy preserving data mining (PPDM) is an emerging research problem that has become critical in the last decades. PPDM consists of hiding sensitive information to ensure that it cannot be discovered by data mining algorithms. Several PPDM algorithms have been developed. Most of them are designed for hiding sensitive frequent itemsets or association rules. Hiding sensitive information in a database can have several side effects such as hiding other non-sensitive information and introducing redundant information. Finding the set of itemsets or transactions to be sanitised that minimises side effects is an NP-hard problem. In this paper, a genetic algorithm (GA) using transaction deletion is designed to hide sensitive high-utility itemsets for PPUM. A flexible fitness function with three adjustable weights is used to evaluate the goodness of each chromosome for hiding sensitive high-utility itemsets. To speed up the evolution process, the pre-large concept is adopted in the designed algorithm. It reduces the number of database scans required for verifying the goodness of an evaluated chromosome. Substantial experiments are conducted to compare the performance of the designed GA approach (with/without the pre-large concept), with a GA-based approach relying on transaction insertion and a non-evolutionary algorithm, in terms of execution time, side effects, database integrity and utility integrity. Results demonstrate that the proposed algorithm hides sensitive high-utility itemsets with fewer side effects than previous studies, while preserving high database and utility integrity.

  10. Real-time intelligent decision making with data mining

    NASA Astrophysics Data System (ADS)

    Gupta, Deepak P.; Gopalakrishnan, Bhaskaran

    2004-03-01

    Database mining, widely known as knowledge discovery and data mining (KDD), has attracted lot of attention in recent years. With the rapid growth of databases in commercial, industrial, administrative and other applications, it is necessary and interesting to extract knowledge automatically from huge amount of data. Almost all the organizations are generating data and information at an unprecedented rate and they need to get some useful information from this data. Data mining is the extraction of non-trivial, previously unknown and potentially useful patterns, trends, dependence and correlation known as association rules among data values in large databases. In last ten to fifteen years, data mining spread out from one company to the other to help them understand more about customers' aspect of quality and response and also distinguish the customers they want from those they do not. A credit-card company found that customers who complete their applications in pencil rather than pen are more likely to default. There is a program that identifies callers by purchase history. The bigger the spender, the quicker the call will be answered. If you feel your call is being answered in the order in which it was received, think again. Many algorithms assume that data is static in nature and mine the rules and relations in that data. But for a dynamic database e.g. in most of the manufacturing industries, the rules and relations thus developed among the variables/items no longer hold true. A simple approach may be to mine the associations among the variables after every fixed period of time. But again, how much the length of this period should be, is a question to be answered. The next problem with the static data mining is that some of the relationships that might be of interest from one period to the other may be lost after a new set of data is used. To reflect the effect of new data set and current status of the association rules where some of the strong rules might become weak and vice versa, there is a need to develop an efficient algorithm to adapt to the current patterns and associations. Some work has been done in developing the association rules for incremental database but to the best of the author"s knowledge no work has been done to do the same for periodic cause and effect analysis for online association rules in manufacturing industries. The present research attempts to answer these questions and develop an algorithm that can display the association rules online, find the periodic patterns in the data and detect the root cause of the problem.

  11. Development of Test Rig for Robotization of Mining Technological Processes - Oversized Rock Breaking Process Case

    NASA Astrophysics Data System (ADS)

    Pawel, Stefaniak; Jacek, Wodecki; Jakubiak, Janusz; Zimroz, Radoslaw

    2017-12-01

    Production chain (PCh) in underground copper ore mine consists of several subprocesses. From our perspective implementation of so called ZEPA approach (Zero Entry Production Area) might be very interesting [16]. In practice, it leads to automation/robotization of subprocesses in production area. In this paper was investigated a specific part of PCh i.e. a place when cyclic transport by LHDs is replaced with continuous transport by conveying system. Such place is called dumping point. The objective of dumping points with screen is primary classification of the material (into coarse and fine material) and breaking oversized rocks with hydraulic hammer. Current challenges for the underground mining include e.g. safety improvement as well as production optimization related to bottlenecks, stoppages and operational efficiency of the machines. As a first step, remote control of the hydraulic hammer has been introduced, which not only transferred the operator to safe workplace, but also allowed for more comfortable work environment and control over multiple technical objects by a single person. Today literature analysis shows that current mining industry around the world is oriented to automation and robotization of mining processes and reveals technological readiness for 4th industrial revolution. The paper is focused on preliminary analysis of possibilities for the use of the robotic system to rock-breaking process. Prototype test rig has been proposed and experimental works have been carried out. Automatic algorithms for detection of oversized rocks, crushing them as well as sweeping and loosening of material have been formulated. Obviously many simplifications have been assumed. Some near future works have been proposed.

  12. A Testbed Demonstration of an Intelligent Archive in a Knowledge Building System

    NASA Technical Reports Server (NTRS)

    Ramapriyan, Hampapuram; Isaac, David; Morse, Steve; Yang, Wenli; Bonnlander, Brian; McConaughy, Gail; Di, Liping; Danks, David

    2005-01-01

    The last decade's influx of raw data and derived geophysical parameters from several Earth observing satellites to NASA data centers has created a data-rich environment for Earth science research and applications. While advances in hardware and information management have made it possible to archive petabytes of data and distribute terabytes of data daily to a broad community of users, further progress is necessary in the transformation of data into information, and information into knowledge that can be used in particular applications in order to realize the full potential of these valuable datasets. In examining what is needed to enable this progress in the data provider environment that exists today and is expected to evolve in the next several years, we arrived at the concept of an Intelligent Archive in context of a Knowledge Building System (IA/KBS). Our prior work and associated papers investigated usage scenarios, required capabilities, system architecture, data volume issues, and supporting technologies. We identified six key capabilities of an IA/KBS: Virtual Product Generation, Significant Event Detection, Automated Data Quality Assessment, Large-Scale Data Mining, Dynamic Feedback Loop, and Data Discovery and Efficient Requesting. Among these capabilities, large-scale data mining is perceived by many in the community to be an area of technical risk. One of the main reasons for this is that standard data mining research and algorithms operate on datasets that are several orders of magnitude smaller than the actual sizes of datasets maintained by realistic earth science data archives. Therefore, we defined a test-bed activity to implement a large-scale data mining algorithm in a pseudo-operational scale environment and to examine any issues involved. The application chosen for applying the data mining algorithm is wildfire prediction over the continental U.S. This paper reports a number of observations based on our experience with this test-bed. While proof-of-concept for data mining scalability and utility has been a major goal for the research reported here, it was not the only one. The other five capabilities of an WKBS named above have been considered as well, and an assessment of the implications of our experience for these other areas will also be presented. The lessons learned through the testbed effort and presented in this paper will benefit technologists, scientists, and system operators as they consider introducing IA/KBS capabilities into production systems.

  13. Illustration Watermarking for Digital Images: An Investigation of Hierarchical Signal Inheritances for Nested Object-based Embedding

    DTIC Science & Technology

    2007-02-23

    approach for signal-level watermark inheritance. 15. SUBJECT TERMS EOARD, Steganography , Image Fusion, Data Mining, Image ...in watermarking algorithms , a program interface and protocol has been de - veloped, which allows control of the embedding and retrieval processes by the...watermarks in an image . Watermarking algorithm (DLL) Watermarking editor (Delphi) - User marks all objects: ci - class information oi - object instance

  14. Mining (except Oil and Gas) Sector (NAICS 212)

    EPA Pesticide Factsheets

    EPA Regulatory and enforcement information for the mining sector, including metal mining & nonmetallic mineral mining and quarrying. Includes information about asbestos, coal mining, mountaintop mining, Clean Water Act section 404, and abandoned mine lands

  15. Reconstruction of the inner structure of small scale mining waste dumps by combining GPR and ERTdata.

    NASA Astrophysics Data System (ADS)

    Kniess, Rudolf; Martin, Tina

    2015-04-01

    Two abandoned small waste dumps in the west of the Harz mountains (Germany) were analysed using ground penetrating radar (GPR) and electrical resistivity tomography (ERT). Aim of the project (ROBEHA, funded by the German Federal Ministry of Education and Research (033R105)) is the assessment of the recycling potential of the mining residues taking into account environmental risks of reworking the dump site. One task of the geophysical prospection is the investigation of the inner structure of the mining dump. This is important for the estimation of the approximate volume of potentially reusable mining deposits within the waste dump. The two investigated dump sites are different in age and therefore differ in their structure. The older residues (< 1930) consist of ore processing waste from density separation (stamp mill sand). The younger dump site descends from comprises slag dump waste. The layer of fine grained residues at the first dump site is less than 6 m thick and the slag layer is less than 2 m thick. Both sites are partially overlain by forest or grassland vegetation and characterized by topographical irregularities. Due to the inhomogeneity of the sites we applied electrical resistivity tomography (ERT) and ground penetrating radar (GPR) for detailed investigation. Using ERT we could distinguish various layers within the mining dumps. The resistivities of the dumped material differ from the bedrock resistivities at both sites. The GPR measurements show near surface layer boundaries down to 3 - 4 m. In consecutive campaigns 100 MHz and 200 MHz antennas were used. The GPR results (layer boundaries) were included into the ERT inversion algorithm to enable more precise and stable resistivity models. This needs some special preprocessing steps. The 3D-Position of every electrode from ERT measurement and the GPR antenna position on the surface require an accuracy of less than 1cm. At some points, the layer boundaries and radar wave velocities can be calibrated with borehole stratigraphic data from a mineralogical drilling campaign. This is important for a precise time-depth conversion of reflectors from GPR measurement. This reflectors were taken from radargram and have been adopted as resistivity boundary in the start model of the geoelectric inversion algorithm.

  16. Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

    DOE PAGES

    Rosen, Gail L.; Polikar, Robi; Caseiro, Diamantino A.; ...

    2011-01-01

    High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for theirmore » ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.« less

  17. Data Mining for ISHM of Liquid Rocket Propulsion Status Update

    NASA Technical Reports Server (NTRS)

    Srivastava, Ashok; Schwabacher, Mark; Oza, Nijunj; Martin, Rodney; Watson, Richard; Matthews, Bryan

    2006-01-01

    This document consists of presentation slides that review the current status of data mining to support the work with the Integrated Systems Health Management (ISHM) for the systems associated with Liquid Rocket Propulsion. The aim of this project is to have test stand data from Rocketdyne to design algorithms that will aid in the early detection of impending failures during operation. These methods will be extended and improved for future platforms (i.e., CEV/CLV).

  18. Application of Composite Indices for Improving Joint Detection Capabilities of Instrumented Roof Bolt Drills in Underground Mining and Construction

    NASA Astrophysics Data System (ADS)

    Liu, Wenpeng; Rostami, Jamal; Elsworth, Derek; Ray, Asok

    2018-03-01

    Roof bolts are the dominant method of ground support in mining and tunneling applications, and the concept of using drilling parameters from the bolter for ground characterization has been studied for a few decades. This refers to the use of drilling data to identify geological features in the ground including joints and voids, as well as rock classification. Rock mass properties, including distribution of joints/voids and strengths of rock layers, are critical factors for proper design of ground support to avoid instability. The goal of this research was to improve the capability and sensitivity of joint detection programs based on the updated pattern recognition algorithms in sensing joints with smaller than 3.175 mm (0.125 in.) aperture while reducing the number of false alarms, and discriminating rock layers with different strengths. A set of concrete blocks with different strengths were used to simulate various rock layers, where the gap between the blocks would represent the joints in laboratory tests. Data obtained from drilling through these blocks were analyzed to improve the reliability and precision of joint detection systems. While drilling parameters can be used to detect the gaps, due to low accuracy of the results, new composite indices have been introduced and used in the analysis to improve the detection rates. This paper briefly discusses ongoing research on joint detection by using drilling parameters collected from a roof bolter in a controlled environment. The performances of the new algorithms for joint detection are also examined by comparing their ability to identify existing joints and reducing false alarms.

  19. Automatable algorithms to identify nonmedical opioid use using electronic data: a systematic review.

    PubMed

    Canan, Chelsea; Polinski, Jennifer M; Alexander, G Caleb; Kowal, Mary K; Brennan, Troyen A; Shrank, William H

    2017-11-01

    Improved methods to identify nonmedical opioid use can help direct health care resources to individuals who need them. Automated algorithms that use large databases of electronic health care claims or records for surveillance are a potential means to achieve this goal. In this systematic review, we reviewed the utility, attempts at validation, and application of such algorithms to detect nonmedical opioid use. We searched PubMed and Embase for articles describing automatable algorithms that used electronic health care claims or records to identify patients or prescribers with likely nonmedical opioid use. We assessed algorithm development, validation, and performance characteristics and the settings where they were applied. Study variability precluded a meta-analysis. Of 15 included algorithms, 10 targeted patients, 2 targeted providers, 2 targeted both, and 1 identified medications with high abuse potential. Most patient-focused algorithms (67%) used prescription drug claims and/or medical claims, with diagnosis codes of substance abuse and/or dependence as the reference standard. Eleven algorithms were developed via regression modeling. Four used natural language processing, data mining, audit analysis, or factor analysis. Automated algorithms can facilitate population-level surveillance. However, there is no true gold standard for determining nonmedical opioid use. Users must recognize the implications of identifying false positives and, conversely, false negatives. Few algorithms have been applied in real-world settings. Automated algorithms may facilitate identification of patients and/or providers most likely to need more intensive screening and/or intervention for nonmedical opioid use. Additional implementation research in real-world settings would clarify their utility. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  20. Change detection over Sokolov open-pit mining area, Czech Republic, using multi-temporal HyMAP data (2009-2010)

    NASA Astrophysics Data System (ADS)

    Adar, S.; Notesco, G.; Brook, A.; Livne, I.; Rojik, P.; Kopacková, V.; Zelenkova, K.; Misurec, J.; Bourguignon, A.; Chevrel, S.; Ehrler, C.; Fisher, C.; Hanus, J.; Shkolnisky, Y.; Ben Dor, E.

    2011-11-01

    Two HyMap images acquired over the same lignite open-pit mining site in Sokolov, Czech Republic, during the summers of 2009 and 2010 (12 months apart), were investigated in this study. The site selected for this research is one of three test sites (the others being in South Africa and Kyrgyzstan) within the framework of the EO-MINERS FP7 Project (http://www.eo-miners.eu). The goal of EO-MINERS is to "integrate new and existing Earth Observation tools to improve best practice in mining activities and to reduce the mining related environmental and societal footprint". Accordingly, the main objective of the current study was to develop hyperspectral-based means for the detection of small spectral changes and to relate these changes to possible degradation or reclamation indicators of the area under investigation. To ensure significant detection of small spectral changes, the temporal domain was investigated along with careful generation of reflectance information. Thus, intensive spectroradiometric ground measurements were carried out to ensure calibration and validation aspects during both overflights. The performance of these corrections was assessed using the Quality Indicators setup developed under a different FP7 project-EUFAR (http://www.eufar.net), which helped select the highest quality data for further work. This approach allows direct distinction of the real information from noise. The reflectance images were used as input for the application of spectral-based change-detection algorithms and indices to account for small and reliable changes. The related algorithms were then developed and applied on a pixel-by-pixel basis to map spectral changes over the space of a year. Using field spectroscopy and ground truth measurements on both overpass dates, it was possible to explain the results and allocate spatial kinetic processes of the environmental changes during the time elapsed between the flights. It was found, for instance, that significant spectral changes are capable of revealing mineral processes, vegetation status and soil formation long before these are apparent to the naked eye. Further study is being conducted under the above initiative to extend this approach to other mining areas worldwide and to improve the robustness of the developed algorithm.

  1. Application of a Genetic Algorithm and Multi Agent System to Explore Emergent Patterns of Social Rationality and a Distress-Based Model for Deceit in the Workplace

    DTIC Science & Technology

    2008-06-01

    postponed the fulfillment of her own Masters Degree by at least 18 months so that I would have the opportunity to earn mine. She is smart , lovely...GENETIC ALGORITHM AND MULTI AGENT SYSTEM TO EXPLORE EMERGENT PATTERNS OF SOCIAL RATIONALITY AND A DISTRESS-BASED MODEL FOR DECEIT IN THE WORKPLACE...of a Genetic Algorithm and Mutli Agent System to Explore Emergent Patterns of Social Rationality and a Distress-Based Model for Deceit in the

  2. Leveraging Python Interoperability Tools to Improve Sapphire's Usability

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gezahegne, A; Love, N S

    2007-12-10

    The Sapphire project at the Center for Applied Scientific Computing (CASC) develops and applies an extensive set of data mining algorithms for the analysis of large data sets. Sapphire's algorithms are currently available as a set of C++ libraries. However many users prefer higher level scripting languages such as Python for their ease of use and flexibility. In this report, we evaluate four interoperability tools for the purpose of wrapping Sapphire's core functionality with Python. Exposing Sapphire's functionality through a Python interface would increase its usability and connect its algorithms to existing Python tools.

  3. Combined mining: discovering informative knowledge in complex data.

    PubMed

    Cao, Longbing; Zhang, Huaifeng; Zhao, Yanchang; Luo, Dan; Zhang, Chengqi

    2011-06-01

    Enterprise data mining applications often involve complex data such as multiple large heterogeneous data sources, user preferences, and business impact. In such situations, a single method or one-step mining is often limited in discovering informative knowledge. It would also be very time and space consuming, if not impossible, to join relevant large data sources for mining patterns consisting of multiple aspects of information. It is crucial to develop effective approaches for mining patterns combining necessary information from multiple relevant business lines, catering for real business settings and decision-making actions rather than just providing a single line of patterns. The recent years have seen increasing efforts on mining more informative patterns, e.g., integrating frequent pattern mining with classifications to generate frequent pattern-based classifiers. Rather than presenting a specific algorithm, this paper builds on our existing works and proposes combined mining as a general approach to mining for informative patterns combining components from either multiple data sets or multiple features or by multiple methods on demand. We summarize general frameworks, paradigms, and basic processes for multifeature combined mining, multisource combined mining, and multimethod combined mining. Novel types of combined patterns, such as incremental cluster patterns, can result from such frameworks, which cannot be directly produced by the existing methods. A set of real-world case studies has been conducted to test the frameworks, with some of them briefed in this paper. They identify combined patterns for informing government debt prevention and improving government service objectives, which show the flexibility and instantiation capability of combined mining in discovering informative knowledge in complex data.

  4. Prediction of toxic metals concentration using artificial intelligence techniques

    NASA Astrophysics Data System (ADS)

    Gholami, R.; Kamkar-Rouhani, A.; Doulati Ardejani, F.; Maleki, Sh.

    2011-12-01

    Groundwater and soil pollution are noted to be the worst environmental problem related to the mining industry because of the pyrite oxidation, and hence acid mine drainage generation, release and transport of the toxic metals. The aim of this paper is to predict the concentration of Ni and Fe using a robust algorithm named support vector machine (SVM). Comparison of the obtained results of SVM with those of the back-propagation neural network (BPNN) indicates that the SVM can be regarded as a proper algorithm for the prediction of toxic metals concentration due to its relative high correlation coefficient and the associated running time. As a matter of fact, the SVM method has provided a better prediction of the toxic metals Fe and Ni and resulted the running time faster compared with that of the BPNN.

  5. Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data

    PubMed Central

    Király, András; Abonyi, János

    2014-01-01

    During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers. PMID:24616651

  6. Imaging and detection of mines from acoustic measurements

    NASA Astrophysics Data System (ADS)

    Witten, Alan J.; DiMarzio, Charles A.; Li, Wen; McKnight, Stephen W.

    1999-08-01

    A laboratory-scale acoustic experiment is described where a buried target, a hockey puck cut in half, is shallowly buried in a sand box. To avoid the need for source and receiver coupling to the host sand, an acoustic wave is generated in the subsurface by a pulsed laser suspended above the air-sand interface. Similarly, an airborne microphone is suspended above this interface and moved in unison with the laser. After some pre-processing of the data, reflections for the target, although weak, could clearly be identified. While the existence and location of the target can be determined by inspection of the data, its unique shape can not. Since target discrimination is important in mine detection, a 3D imaging algorithm was applied to the acquired acoustic data. This algorithm yielded a reconstructed image where the shape of the target was resolved.

  7. A novel association rule mining approach using TID intermediate itemset.

    PubMed

    Aqra, Iyad; Herawan, Tutut; Abdul Ghani, Norjihan; Akhunzada, Adnan; Ali, Akhtar; Bin Razali, Ramdan; Ilahi, Manzoor; Raymond Choo, Kim-Kwang

    2018-01-01

    Designing an efficient association rule mining (ARM) algorithm for multilevel knowledge-based transactional databases that is appropriate for real-world deployments is of paramount concern. However, dynamic decision making that needs to modify the threshold either to minimize or maximize the output knowledge certainly necessitates the extant state-of-the-art algorithms to rescan the entire database. Subsequently, the process incurs heavy computation cost and is not feasible for real-time applications. The paper addresses efficiently the problem of threshold dynamic updation for a given purpose. The paper contributes by presenting a novel ARM approach that creates an intermediate itemset and applies a threshold to extract categorical frequent itemsets with diverse threshold values. Thus, improving the overall efficiency as we no longer needs to scan the whole database. After the entire itemset is built, we are able to obtain real support without the need of rebuilding the itemset (e.g. Itemset list is intersected to obtain the actual support). Moreover, the algorithm supports to extract many frequent itemsets according to a pre-determined minimum support with an independent purpose. Additionally, the experimental results of our proposed approach demonstrate the capability to be deployed in any mining system in a fully parallel mode; consequently, increasing the efficiency of the real-time association rules discovery process. The proposed approach outperforms the extant state-of-the-art and shows promising results that reduce computation cost, increase accuracy, and produce all possible itemsets.

  8. A novel association rule mining approach using TID intermediate itemset

    PubMed Central

    Ali, Akhtar; Bin Razali, Ramdan; Ilahi, Manzoor; Raymond Choo, Kim-Kwang

    2018-01-01

    Designing an efficient association rule mining (ARM) algorithm for multilevel knowledge-based transactional databases that is appropriate for real-world deployments is of paramount concern. However, dynamic decision making that needs to modify the threshold either to minimize or maximize the output knowledge certainly necessitates the extant state-of-the-art algorithms to rescan the entire database. Subsequently, the process incurs heavy computation cost and is not feasible for real-time applications. The paper addresses efficiently the problem of threshold dynamic updation for a given purpose. The paper contributes by presenting a novel ARM approach that creates an intermediate itemset and applies a threshold to extract categorical frequent itemsets with diverse threshold values. Thus, improving the overall efficiency as we no longer needs to scan the whole database. After the entire itemset is built, we are able to obtain real support without the need of rebuilding the itemset (e.g. Itemset list is intersected to obtain the actual support). Moreover, the algorithm supports to extract many frequent itemsets according to a pre-determined minimum support with an independent purpose. Additionally, the experimental results of our proposed approach demonstrate the capability to be deployed in any mining system in a fully parallel mode; consequently, increasing the efficiency of the real-time association rules discovery process. The proposed approach outperforms the extant state-of-the-art and shows promising results that reduce computation cost, increase accuracy, and produce all possible itemsets. PMID:29351287

  9. Managing the Big Data Avalanche in Astronomy - Data Mining the Galaxy Zoo Classification Database

    NASA Astrophysics Data System (ADS)

    Borne, Kirk D.

    2014-01-01

    We will summarize a variety of data mining experiments that have been applied to the Galaxy Zoo database of galaxy classifications, which were provided by the volunteer citizen scientists. The goal of these exercises is to learn new and improved classification rules for diverse populations of galaxies, which can then be applied to much larger sky surveys of the future, such as the LSST (Large Synoptic Sky Survey), which is proposed to obtain detailed photometric data for approximately 20 billion galaxies. The massive Big Data that astronomy projects will generate in the future demand greater application of data mining and data science algorithms, as well as greater training of astronomy students in the skills of data mining and data science. The project described here has involved several graduate and undergraduate research assistants at George Mason University.

  10. Development of a data-mining algorithm to identify ages at reproductive milestones in electronic medical records.

    PubMed

    Malinowski, Jennifer; Farber-Eger, Eric; Crawford, Dana C

    2014-01-01

    Electronic medical records (EMRs) are becoming more widely implemented following directives from the federal government and incentives for supplemental reimbursements for Medicare and Medicaid claims. Replete with rich phenotypic data, EMRs offer a unique opportunity for clinicians and researchers to identify potential research cohorts and perform epidemiologic studies. Notable limitations to the traditional epidemiologic study include cost, time to complete the study, and limited ancestral diversity; EMR-based epidemiologic studies offer an alternative. The Epidemiologic Architecture for Genes Linked to Environment (EAGLE) Study, as part of the Population Architecture using Genomics and Epidemiology (PAGE) I Study, has genotyped more than 15,000 patients of diverse ancestry in BioVU, the Vanderbilt University Medical Center's biorepository linked to the EMR (EAGLE BioVU). We report here the development and performance of data-mining techniques used to identify the age at menarche (AM) and age at menopause (AAM), important milestones in the reproductive lifespan, in women from EAGLE BioVU for genetic association studies. In addition, we demonstrate the ability to discriminate age at naturally-occurring menopause (ANM) from medically-induced menopause. Unusual timing of these events may indicate underlying pathologies and increased risk for some complex diseases and cancer; however, they are not consistently recorded in the EMR. Our algorithm offers a mechanism by which to extract these data for clinical and research goals.

  11. Prediction of carbonate rock type from NMR responses using data mining techniques

    NASA Astrophysics Data System (ADS)

    Gonçalves, Eduardo Corrêa; da Silva, Pablo Nascimento; Silveira, Carla Semiramis; Carneiro, Giovanna; Domingues, Ana Beatriz; Moss, Adam; Pritchard, Tim; Plastino, Alexandre; Azeredo, Rodrigo Bagueira de Vasconcellos

    2017-05-01

    Recent studies have indicated that the accurate identification of carbonate rock types in a reservoir can be employed as a preliminary step to enhance the effectiveness of petrophysical property modeling. Furthermore, rock typing activity has been shown to be of key importance in several steps of formation evaluation, such as the study of sedimentary series, reservoir zonation and well-to-well correlation. In this paper, a methodology based exclusively on the analysis of 1H-NMR (Nuclear Magnetic Resonance) relaxation responses - using data mining algorithms - is evaluated to perform the automatic classification of carbonate samples according to their rock type. We analyze the effectiveness of six different classification algorithms (k-NN, Naïve Bayes, C4.5, Random Forest, SMO and Multilayer Perceptron) and two data preprocessing strategies (discretization and feature selection). The dataset used in this evaluation is formed by 78 1H-NMR T2 distributions of fully brine-saturated rock samples from six different rock type classes. The experiments reveal that the combination of preprocessing strategies with classification algorithms is able to achieve a prediction accuracy of 97.4%.

  12. The use of intelligent database systems in acute pancreatitis--a systematic review.

    PubMed

    van den Heever, Marc; Mittal, Anubhav; Haydock, Matthew; Windsor, John

    2014-01-01

    Acute pancreatitis (AP) is a complex disease with multiple aetiological factors, wide ranging severity, and multiple challenges to effective triage and management. Databases, data mining and machine learning algorithms (MLAs), including artificial neural networks (ANNs), may assist by storing and interpreting data from multiple sources, potentially improving clinical decision-making. 1) Identify database technologies used to store AP data, 2) collate and categorise variables stored in AP databases, 3) identify the MLA technologies, including ANNs, used to analyse AP data, and 4) identify clinical and non-clinical benefits and obstacles in establishing a national or international AP database. Comprehensive systematic search of online reference databases. The predetermined inclusion criteria were all papers discussing 1) databases, 2) data mining or 3) MLAs, pertaining to AP, independently assessed by two reviewers with conflicts resolved by a third author. Forty-three papers were included. Three data mining technologies and five ANN methodologies were reported in the literature. There were 187 collected variables identified. ANNs increase accuracy of severity prediction, one study showed ANNs had a sensitivity of 0.89 and specificity of 0.96 six hours after admission--compare APACHE II (cutoff score ≥8) with 0.80 and 0.85 respectively. Problems with databases were incomplete data, lack of clinical data, diagnostic reliability and missing clinical data. This is the first systematic review examining the use of databases, MLAs and ANNs in the management of AP. The clinical benefits these technologies have over current systems and other advantages to adopting them are identified. Copyright © 2013 IAP and EPC. Published by Elsevier B.V. All rights reserved.

  13. Abandoned Uranium Mine (AUM) Trust Mine Points, Navajo Nation, 2016, US EPA Region 9

    EPA Pesticide Factsheets

    This GIS dataset contains point features that represent mines included in the Navajo Environmental Response Trust. This mine category also includes Priority mines. USEPA and NNEPA prioritized mines based on gamma radiation levels, proximity to homes and potential for water contamination identified in the preliminary assessments. Attributes include mine names, reclaimed status, links to US EPA AUM reports, and the region in which the mine is located. This dataset contains 19 features.

  14. Gene Prioritization of Resistant Rice Gene against Xanthomas oryzae pv. oryzae by Using Text Mining Technologies

    PubMed Central

    Xia, Jingbo; Zhang, Xing; Yuan, Daojun; Chen, Lingling; Webster, Jonathan; Fang, Alex Chengyu

    2013-01-01

    To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization. PMID:24371834

  15. Gene prioritization of resistant rice gene against Xanthomas oryzae pv. oryzae by using text mining technologies.

    PubMed

    Xia, Jingbo; Zhang, Xing; Yuan, Daojun; Chen, Lingling; Webster, Jonathan; Fang, Alex Chengyu

    2013-01-01

    To effectively assess the possibility of the unknown rice protein resistant to Xanthomonas oryzae pv. oryzae, a hybrid strategy is proposed to enhance gene prioritization by combining text mining technologies with a sequence-based approach. The text mining technique of term frequency inverse document frequency is used to measure the importance of distinguished terms which reflect biomedical activity in rice before candidate genes are screened and vital terms are produced. Afterwards, a built-in classifier under the chaos games representation algorithm is used to sieve the best possible candidate gene. Our experiment results show that the combination of these two methods achieves enhanced gene prioritization.

  16. Mining and Querying Multimedia Data

    DTIC Science & Technology

    2011-09-29

    able to capture more subtle spatial variations such as repetitiveness. Local feature descriptors such as SIFT [74] and SURF [12] have also been widely...empirically set to s = 90%, r = 50%, K = 20, where small variations lead to little perturbation of the output. The pseudo-code of the algorithm is...by constructing a three-layer graph based on clustering outputs, and executing a slight variation of random walk with restart algorithm. It provided

  17. Diagnostic support for selected neuromuscular diseases using answer-pattern recognition and data mining techniques: a proof of concept multicenter prospective trial.

    PubMed

    Grigull, Lorenz; Lechner, Werner; Petri, Susanne; Kollewe, Katja; Dengler, Reinhard; Mehmecke, Sandra; Schumacher, Ulrike; Lücke, Thomas; Schneider-Gold, Christiane; Köhler, Cornelia; Güttsches, Anne-Katrin; Kortum, Xiaowei; Klawonn, Frank

    2016-03-08

    Diagnosis of neuromuscular diseases in primary care is often challenging. Rare diseases such as Pompe disease are easily overlooked by the general practitioner. We therefore aimed to develop a diagnostic support tool using patient-oriented questions and combined data mining algorithms recognizing answer patterns in individuals with selected neuromuscular diseases. A multicenter prospective study for the proof of concept was conducted thereafter. First, 16 interviews with patients were conducted focusing on their pre-diagnostic observations and experiences. From these interviews, we developed a questionnaire with 46 items. Then, patients with diagnosed neuromuscular diseases as well as patients without such a disease answered the questionnaire to establish a database for data mining. For proof of concept, initially only six diagnoses were chosen (myotonic dystrophy and myotonia (MdMy), Pompe disease (MP), amyotrophic lateral sclerosis (ALS), polyneuropathy (PNP), spinal muscular atrophy (SMA), other neuromuscular diseases, and no neuromuscular disease (NND). A prospective study was performed to validate the automated malleable system, which included six different classification methods combined in a fusion algorithm proposing a final diagnosis. Finally, new diagnoses were incorporated into the system. In total, questionnaires from 210 individuals were used to train the system. 89.5 % correct diagnoses were achieved during cross-validation. The sensitivity of the system was 93-97 % for individuals with MP, with MdMy and without neuromuscular diseases, but only 69 % in SMA and 81 % in ALS patients. In the prospective trial, 57/64 (89 %) diagnoses were predicted correctly by the computerized system. All questions, or rather all answers, increased the diagnostic accuracy of the system, with the best results reached by the fusion of different classifier methods. Receiver operating curve (ROC) and p-value analyses confirmed the results. A questionnaire-based diagnostic support tool using data mining methods exhibited good results in predicting selected neuromuscular diseases. Due to the variety of neuromuscular diseases, additional studies are required to measure beneficial effects in the clinical setting.

  18. Applying Data Mining Techniques to Extract Hidden Patterns about Breast Cancer Survival in an Iranian Cohort Study.

    PubMed

    Khalkhali, Hamid Reza; Lotfnezhad Afshar, Hadi; Esnaashari, Omid; Jabbari, Nasrollah

    2016-01-01

    Breast cancer survival has been analyzed by many standard data mining algorithms. A group of these algorithms belonged to the decision tree category. Ability of the decision tree algorithms in terms of visualizing and formulating of hidden patterns among study variables were main reasons to apply an algorithm from the decision tree category in the current study that has not studied already. The classification and regression trees (CART) was applied to a breast cancer database contained information on 569 patients in 2007-2010. The measurement of Gini impurity used for categorical target variables was utilized. The classification error that is a function of tree size was measured by 10-fold cross-validation experiments. The performance of created model was evaluated by the criteria as accuracy, sensitivity and specificity. The CART model produced a decision tree with 17 nodes, 9 of which were associated with a set of rules. The rules were meaningful clinically. They showed in the if-then format that Stage was the most important variable for predicting breast cancer survival. The scores of accuracy, sensitivity and specificity were: 80.3%, 93.5% and 53%, respectively. The current study model as the first one created by the CART was able to extract useful hidden rules from a relatively small size dataset.

  19. Implementation of pulse-coupled neural networks in a CNAPS environment.

    PubMed

    Kinser, J M; Lindblad, T

    1999-01-01

    Pulse coupled neural networks (PCNN's) are biologically inspired algorithms very well suited for image/signal preprocessing. While several analog implementations are proposed we suggest a digital implementation in an existing environment, the connected network of adapted processors system (CNAPS). The reason for this is two fold. First, CNAPS is a commercially available chip which has been used for several neural-network implementations. Second, the PCNN is, in almost all applications, a very efficient component of a system requiring subsequent and additional processing. This may include gating, Fourier transforms, neural classifiers, data mining, etc, with or without feedback to the PCNN.

  20. A novel feature extraction approach for microarray data based on multi-algorithm fusion

    PubMed Central

    Jiang, Zhu; Xu, Rong

    2015-01-01

    Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions. PMID:25780277

  1. A novel feature extraction approach for microarray data based on multi-algorithm fusion.

    PubMed

    Jiang, Zhu; Xu, Rong

    2015-01-01

    Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions.

  2. Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

    PubMed Central

    Jia, Yi; Huan, Jun; Buhr, Vincent; Zhang, Jintao; Carayannopoulos, Leonidas N

    2009-01-01

    Background Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail. Results Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusion We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty. PMID:19208148

  3. @Note: a workbench for biomedical text mining.

    PubMed

    Lourenço, Anália; Carreira, Rafael; Carneiro, Sónia; Maia, Paulo; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Ferreira, Eugénio C; Rocha, Isabel; Rocha, Miguel

    2009-08-01

    Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used.

  4. Detecting Diseases in Medical Prescriptions Using Data Mining Tools and Combining Techniques.

    PubMed

    Teimouri, Mehdi; Farzadfar, Farshad; Soudi Alamdari, Mahsa; Hashemi-Meshkini, Amir; Adibi Alamdari, Parisa; Rezaei-Darzi, Ehsan; Varmaghani, Mehdi; Zeynalabedini, Aysan

    2016-01-01

    Data about the prevalence of communicable and non-communicable diseases, as one of the most important categories of epidemiological data, is used for interpreting health status of communities. This study aims to calculate the prevalence of outpatient diseases through the characterization of outpatient prescriptions. The data used in this study is collected from 1412 prescriptions for various types of diseases from which we have focused on the identification of ten diseases. In this study, data mining tools are used to identify diseases for which prescriptions are written. In order to evaluate the performances of these methods, we compare the results with Naïve method. Then, combining methods are used to improve the results. Results showed that Support Vector Machine, with an accuracy of 95.32%, shows better performance than the other methods. The result of Naive method, with an accuracy of 67.71%, is 20% worse than Nearest Neighbor method which has the lowest level of accuracy among the other classification algorithms. The results indicate that the implementation of data mining algorithms resulted in a good performance in characterization of outpatient diseases. These results can help to choose appropriate methods for the classification of prescriptions in larger scales.

  5. Dynamic properties of hot-wire anemometric measurement circuits in the aspect of measurements in mine conditions / Właściwości dynamiczne termoanemometrycznych układów pomiarowych w aspekcie pomiarów w warunkach kopalnianych

    NASA Astrophysics Data System (ADS)

    Jamróz, Paweł; Ligęza, Paweł; Socha, Katarzyna

    2012-12-01

    The use of measurement apparatus under conditions which differ significantly from those under which the apparatus was adjusted carries the risk of altering the previously determined measurement characteristics. This is of special concern in the case of apparatus which is sensitive to external measurement conditions. Advanced measurement systems are equipped with algorithms which allow the negative effect of unstable environmental conditions on their static characteristics to be compensated for. Meanwhile, the problem of altered dynamic properties of such systems is often neglected. This paper presents a model study in which the effect of variable operational conditions on dynamic response of hot-wire anemometric measurement system in the case of simulated mine flows was investigated. A mathematical model of measurement system able to compensate the negative effect of changes in flow velocity and configuration of measurement apparatus itself on its dynamic properties was developed and investigated. Based on conducted experiments, we have developed an automatic regulation algorithm enabling the transmission band of measurement apparatus to be optimized for measurement conditions prevailing in mine environment.

  6. Knowledge Driven Image Mining with Mixture Density Mercer Kernels

    NASA Technical Reports Server (NTRS)

    Srivastava, Ashok N.; Oza, Nikunj

    2004-01-01

    This paper presents a new methodology for automatic knowledge driven image mining based on the theory of Mercer Kernels; which are highly nonlinear symmetric positive definite mappings from the original image space to a very high, possibly infinite dimensional feature space. In that high dimensional feature space, linear clustering, prediction, and classification algorithms can be applied and the results can be mapped back down to the original image space. Thus, highly nonlinear structure in the image can be recovered through the use of well-known linear mathematics in the feature space. This process has a number of advantages over traditional methods in that it allows for nonlinear interactions to be modelled with only a marginal increase in computational costs. In this paper, we present the theory of Mercer Kernels, describe its use in image mining, discuss a new method to generate Mercer Kernels directly from data, and compare the results with existing algorithms on data from the MODIS (Moderate Resolution Spectral Radiometer) instrument taken over the Arctic region. We also discuss the potential application of these methods on the Intelligent Archive, a NASA initiative for developing a tagged image data warehouse for the Earth Sciences.

  7. Knowledge Driven Image Mining with Mixture Density Mercer Kernals

    NASA Technical Reports Server (NTRS)

    Srivastava, Ashok N.; Oza, Nikunj

    2004-01-01

    This paper presents a new methodology for automatic knowledge driven image mining based on the theory of Mercer Kernels, which are highly nonlinear symmetric positive definite mappings from the original image space to a very high, possibly infinite dimensional feature space. In that high dimensional feature space, linear clustering, prediction, and classification algorithms can be applied and the results can be mapped back down to the original image space. Thus, highly nonlinear structure in the image can be recovered through the use of well-known linear mathematics in the feature space. This process has a number of advantages over traditional methods in that it allows for nonlinear interactions to be modelled with only a marginal increase in computational costs. In this paper we present the theory of Mercer Kernels; describe its use in image mining, discuss a new method to generate Mercer Kernels directly from data, and compare the results with existing algorithms on data from the MODIS (Moderate Resolution Spectral Radiometer) instrument taken over the Arctic region. We also discuss the potential application of these methods on the Intelligent Archive, a NASA initiative for developing a tagged image data warehouse for the Earth Sciences.

  8. Development and Testing of Data Mining Algorithms for Earth Observation

    NASA Technical Reports Server (NTRS)

    Glymour, Clark

    2005-01-01

    The new algorithms developed under this project included a principled procedure for classification of objects, events or circumstances according to a target variable when a very large number of potential predictor variables is available but the number of cases that can be used for training a classifier is relatively small. These "high dimensional" problems require finding a minimal set of variables -called the Markov Blanket-- sufficient for predicting the value of the target variable. An algorithm, the Markov Blanket Fan Search, was developed, implemented and tested on both simulated and real data in conjunction with a graphical model classifier, which was also implemented. Another algorithm developed and implemented in TETRAD IV for time series elaborated on work by C. Granger and N. Swanson, which in turn exploited some of our earlier work. The algorithms in question learn a linear time series model from data. Given such a time series, the simultaneous residual covariances, after factoring out time dependencies, may provide information about causal processes that occur more rapidly than the time series representation allow, so called simultaneous or contemporaneous causal processes. Working with A. Monetta, a graduate student from Italy, we produced the correct statistics for estimating the contemporaneous causal structure from time series data using the TETRAD IV suite of algorithms. Two economists, David Bessler and Kevin Hoover, have independently published applications using TETRAD style algorithms to the same purpose. These implementations and algorithmic developments were separately used in two kinds of studies of climate data: Short time series of geographically proximate climate variables predicting agricultural effects in California, and longer duration climate measurements of temperature teleconnections.

  9. Mining Peripheral Arterial Disease Cases from Narrative Clinical Notes Using Natural Language Processing

    PubMed Central

    Afzal, Naveed; Sohn, Sunghwan; Abram, Sara; Scott, Christopher G.; Chaudhry, Rajeev; Liu, Hongfang; Kullo, Iftikhar J.; Arruda-Olson, Adelaide M.

    2016-01-01

    Objective Lower extremity peripheral arterial disease (PAD) is highly prevalent and affects millions of individuals worldwide. We developed a natural language processing (NLP) system for automated ascertainment of PAD cases from clinical narrative notes and compared the performance of the NLP algorithm to billing code algorithms, using ankle-brachial index (ABI) test results as the gold standard. Methods We compared the performance of the NLP algorithm to 1) results of gold standard ABI; 2) previously validated algorithms based on relevant ICD-9 diagnostic codes (simple model) and 3) a combination of ICD-9 codes with procedural codes (full model). A dataset of 1,569 PAD patients and controls was randomly divided into training (n= 935) and testing (n= 634) subsets. Results We iteratively refined the NLP algorithm in the training set including narrative note sections, note types and service types, to maximize its accuracy. In the testing dataset, when compared with both simple and full models, the NLP algorithm had better accuracy (NLP: 91.8%, full model: 81.8%, simple model: 83%, P<.001), PPV (NLP: 92.9%, full model: 74.3%, simple model: 79.9%, P<.001), and specificity (NLP: 92.5%, full model: 64.2%, simple model: 75.9%, P<.001). Conclusions A knowledge-driven NLP algorithm for automatic ascertainment of PAD cases from clinical notes had greater accuracy than billing code algorithms. Our findings highlight the potential of NLP tools for rapid and efficient ascertainment of PAD cases from electronic health records to facilitate clinical investigation and eventually improve care by clinical decision support. PMID:28189359

  10. GESearch: An Interactive GUI Tool for Identifying Gene Expression Signature.

    PubMed

    Ye, Ning; Yin, Hengfu; Liu, Jingjing; Dai, Xiaogang; Yin, Tongming

    2015-01-01

    The huge amount of gene expression data generated by microarray and next-generation sequencing technologies present challenges to exploit their biological meanings. When searching for the coexpression genes, the data mining process is largely affected by selection of algorithms. Thus, it is highly desirable to provide multiple options of algorithms in the user-friendly analytical toolkit to explore the gene expression signatures. For this purpose, we developed GESearch, an interactive graphical user interface (GUI) toolkit, which is written in MATLAB and supports a variety of gene expression data files. This analytical toolkit provides four models, including the mean, the regression, the delegate, and the ensemble models, to identify the coexpression genes, and enables the users to filter data and to select gene expression patterns by browsing the display window or by importing knowledge-based genes. Subsequently, the utility of this analytical toolkit is demonstrated by analyzing two sets of real-life microarray datasets from cell-cycle experiments. Overall, we have developed an interactive GUI toolkit that allows for choosing multiple algorithms for analyzing the gene expression signatures.

  11. Evaluation of a Cubature Kalman Filtering-Based Phase Unwrapping Method for Differential Interferograms with High Noise in Coal Mining Areas

    PubMed Central

    Liu, Wanli; Bian, Zhengfu; Liu, Zhenguo; Zhang, Qiuzhao

    2015-01-01

    Differential interferometric synthetic aperture radar has been shown to be effective for monitoring subsidence in coal mining areas. Phase unwrapping can have a dramatic influence on the monitoring result. In this paper, a filtering-based phase unwrapping algorithm in combination with path-following is introduced to unwrap differential interferograms with high noise in mining areas. It can perform simultaneous noise filtering and phase unwrapping so that the pre-filtering steps can be omitted, thus usually retaining more details and improving the detectable deformation. For the method, the nonlinear measurement model of phase unwrapping is processed using a simplified Cubature Kalman filtering, which is an effective and efficient tool used in many nonlinear fields. Three case studies are designed to evaluate the performance of the method. In Case 1, two tests are designed to evaluate the performance of the method under different factors including the number of multi-looks and path-guiding indexes. The result demonstrates that the unwrapped results are sensitive to the number of multi-looks and that the Fisher Distance is the most suitable path-guiding index for our study. Two case studies are then designed to evaluate the feasibility of the proposed phase unwrapping method based on Cubature Kalman filtering. The results indicate that, compared with the popular Minimum Cost Flow method, the Cubature Kalman filtering-based phase unwrapping can achieve promising results without pre-filtering and is an appropriate method for coal mining areas with high noise. PMID:26153776

  12. Navy/Marine Corps innovative science and technology developments for future enhanced mine detection capabilities

    NASA Astrophysics Data System (ADS)

    Holloway, John H., Jr.; Witherspoon, Ned H.; Miller, Richard E.; Davis, Kenn S.; Suiter, Harold R.; Hilton, Russell J.

    2000-08-01

    JMDT is a Navy/Marine Corps 6.2 Exploratory Development program that is closely coordinated with the 6.4 COBRA acquisition program. The objective of the program is to develop innovative science and technology to enhance future mine detection capabilities. The objective of the program is to develop innovative science and technology to enhance future mine detection capabilities. Prior to transition to acquisition, the COBRA ATD was extremely successful in demonstrating a passive airborne multispectral video sensor system operating in the tactical Pioneer unmanned aerial vehicle (UAV), combined with an integrated ground station subsystem to detect and locate minefields from surf zone to inland areas. JMDT is investigating advanced technology solutions for future enhancements in mine field detection capability beyond the current COBRA ATD demonstrated capabilities. JMDT has recently been delivered next- generation, innovative hardware which was specified by the Coastal System Station and developed under contract. This hardware includes an agile-tuning multispectral, polarimetric, digital video camera and advanced multi wavelength laser illumination technologies to extend the same sorts of multispectral detections from a UAV into the night and over shallow water and other difficult littoral regions. One of these illumination devices is an ultra- compact, highly-efficient near-IR laser diode array. The other is a multi-wavelength range-gateable laser. Additionally, in conjunction with this new technology, algorithm enhancements are being developed in JMDT for future naval capabilities which will outperform the already impressive record of automatic detection of minefields demonstrated by the COBAR ATD.

  13. Developmental GPR mine detection technology known as Balanced Bridge

    NASA Astrophysics Data System (ADS)

    Sherbondy, Kelly D.; Lang, David A.

    1995-06-01

    The Balanced Bridge (BB) detection concept was developed just after the end of WWII. It has been researched for many years since then but it has never truly overcome the following inherent problems: sensitivity to antenna height and tilt variations, detectability of flush mines, sensitivity to soil moisture content, high false alarms, and most importantly, the inability to detect small anti-personnel (AP) mines. Even with all of these shortcomings, the BB sensor technology is still one of the most promising electrmagnetic mine detection systems. This paper will address a new BB detector and its preliminary field performance compared to earlier BB research. The new BB detector has superior capabilities compared to earlier BB efforts involving single frequency or single octave excitation because the new BB operates over a multi-octave bandwidth. The new BB detector also incorporates audio and visual presentations of digitally processed signals where earlier versions only had an audible announcement derived from a simple thresholding algorithm. New BB designs addressing previous BB deficiencies will also be discussed. Design changes include using a broadband printed circuit board antenna, RF transmit and receive components, and a digital signal processor. This new BB detector will be tested at an Advanced Technology Demonstration (ATD) evaluation in FY95. The ATD exit criteria will be discussed and compared to recent field testing of the new BB detector. Preliminary results with the new BB system have demonstrated encouraging results which will be incorporated in this paper.

  14. Weighted Association Rule Mining for Item Groups with Different Properties and Risk Assessment for Networked Systems

    NASA Astrophysics Data System (ADS)

    Kim, Jungja; Ceong, Heetaek; Won, Yonggwan

    In market-basket analysis, weighted association rule (WAR) discovery can mine the rules that include more beneficial information by reflecting item importance for special products. In the point-of-sale database, each transaction is composed of items with similar properties, and item weights are pre-defined and fixed by a factor such as the profit. However, when items are divided into more than one group and the item importance must be measured independently for each group, traditional weighted association rule discovery cannot be used. To solve this problem, we propose a new weighted association rule mining methodology. The items should be first divided into subgroups according to their properties, and the item importance, i.e. item weight, is defined or calculated only with the items included in the subgroup. Then, transaction weight is measured by appropriately summing the item weights from each subgroup, and the weighted support is computed as the fraction of the transaction weights that contains the candidate items relative to the weight of all transactions. As an example, our proposed methodology is applied to assess the vulnerability to threats of computer systems that provide networked services. Our algorithm provides both quantitative risk-level values and qualitative risk rules for the security assessment of networked computer systems using WAR discovery. Also, it can be widely used for new applications with many data sets in which the data items are distinctly separated.

  15. Analysis of design characteristics of a V-type support using an advanced engineering environment

    NASA Astrophysics Data System (ADS)

    Gwiazda, A.; Banaś, W.; Sękala, A.; Cwikla, G.; Topolska, S.; Foit, K.; Monica, Z.

    2017-08-01

    Modern mining support, for the entire period of their use, is the important part of the mining complex, which includes all the devices in the excavation during his normal use. Therefore, during the design of the support, it is an important task to choose the shape and to select the dimensions of a support as well as its strength characteristics. According to the rules, the design process of a support must take into account, inter alia, the type and the dimensions of the expected means of transport, the number and size of pipelines, and the type of additional equipment used excavation area. The support design must ensure the functionality of the excavation process and job security, while maintaining the economic viability of the entire project. Among others it should ensure the selection of a support for specific natural conditions. It is also important to take into consideration the economic characteristics of the project. The article presents an algorithm of integrative approach and its formalized description in the form of integration the areas of different construction characteristics optimization of a V-type mining support. The paper includes the example of its application for developing the construction of this support. In the paper is also described the results of the characteristics analysis and changings that were introduced afterwards. The support models are prepared in the computer environment of the CAD class (Siemens NX PLM). Also the analyses were conducted in this design, graphical environment.

  16. Mining disease state converters for medical intervention of diseases.

    PubMed

    Dong, Guozhu; Duan, Lei; Tang, Changjie

    2010-02-01

    In applications such as gene therapy and drug design, a key goal is to convert the disease state of diseased objects from an undesirable state into a desirable one. Such conversions may be achieved by changing the values of some attributes of the objects. For example, in gene therapy one may convert cancerous cells to normal ones by changing some genes' expression level from low to high or from high to low. In this paper, we define the disease state conversion problem as the discovery of disease state converters; a disease state converter is a small set of attribute value changes that may change an object's disease state from undesirable into desirable. We consider two variants of this problem: personalized disease state converter mining mines disease state converters for a given individual patient with a given disease, and universal disease state converter mining mines disease state converters for all samples with a given disease. We propose a DSCMiner algorithm to discover small and highly effective disease state converters. Since real-life medical experiments on living diseased instances are expensive and time consuming, we use classifiers trained from the datasets of given diseases to evaluate the quality of discovered converter sets. The effectiveness of a disease state converter is measured by the percentage of objects that are successfully converted from undesirable state into desirable state as deemed by state-of-the-art classifiers. We use experiments to evaluate the effectiveness of our algorithm and to show its effectiveness. We also discuss possible research directions for extensions and improvements. We note that the disease state conversion problem also has applications in customer retention, criminal rehabilitation, and company turn-around, where the goal is to convert class membership of objects whose class is an undesirable class.

  17. Hybrid coexpression link similarity graph clustering for mining biological modules from multiple gene expression datasets

    PubMed Central

    2014-01-01

    Background Advances in genomic technologies have enabled the accumulation of vast amount of genomic data, including gene expression data for multiple species under various biological and environmental conditions. Integration of these gene expression datasets is a promising strategy to alleviate the challenges of protein functional annotation and biological module discovery based on a single gene expression data, which suffers from spurious coexpression. Results We propose a joint mining algorithm that constructs a weighted hybrid similarity graph whose nodes are the coexpression links. The weight of an edge between two coexpression links in this hybrid graph is a linear combination of the topological similarities and co-appearance similarities of the corresponding two coexpression links. Clustering the weighted hybrid similarity graph yields recurrent coexpression link clusters (modules). Experimental results on Human gene expression datasets show that the reported modules are functionally homogeneous as evident by their enrichment with biological process GO terms and KEGG pathways. PMID:25221624

  18. Privacy Is Become with, Data Perturbation

    NASA Astrophysics Data System (ADS)

    Singh, Er. Niranjan; Singhai, Niky

    2011-06-01

    Privacy is becoming an increasingly important issue in many data mining applications that deal with health care, security, finance, behavior and other types of sensitive data. Is particularly becoming important in counterterrorism and homeland security-related applications. We touch upon several techniques of masking the data, namely random distortion, including the uniform and Gaussian noise, applied to the data in order to protect it. These perturbation schemes are equivalent to additive perturbation after the logarithmic Transformation. Due to the large volume of research in deriving private information from the additive noise perturbed data, the security of these perturbation schemes is questionable Many artificial intelligence and statistical methods exist for data analysis interpretation, Identifying and measuring the interestingness of patterns and rules discovered, or to be discovered is essential for the evaluation of the mined knowledge and the KDD process as a whole. While some concrete measurements exist, assessing the interestingness of discovered knowledge is still an important research issue. As the tool for the algorithm implementations we chose the language of choice in industrial world MATLAB.

  19. Application of machine learning algorithms for clinical predictive modeling: a data-mining approach in SCT.

    PubMed

    Shouval, R; Bondi, O; Mishan, H; Shimoni, A; Unger, R; Nagler, A

    2014-03-01

    Data collected from hematopoietic SCT (HSCT) centers are becoming more abundant and complex owing to the formation of organized registries and incorporation of biological data. Typically, conventional statistical methods are used for the development of outcome prediction models and risk scores. However, these analyses carry inherent properties limiting their ability to cope with large data sets with multiple variables and samples. Machine learning (ML), a field stemming from artificial intelligence, is part of a wider approach for data analysis termed data mining (DM). It enables prediction in complex data scenarios, familiar to practitioners and researchers. Technological and commercial applications are all around us, gradually entering clinical research. In the following review, we would like to expose hematologists and stem cell transplanters to the concepts, clinical applications, strengths and limitations of such methods and discuss current research in HSCT. The aim of this review is to encourage utilization of the ML and DM techniques in the field of HSCT, including prediction of transplantation outcome and donor selection.

  20. Systematic drug repositioning through mining adverse event data in ClinicalTrials.gov.

    PubMed

    Su, Eric Wen; Sanger, Todd M

    2017-01-01

    Drug repositioning (i.e., drug repurposing) is the process of discovering new uses for marketed drugs. Historically, such discoveries were serendipitous. However, the rapid growth in electronic clinical data and text mining tools makes it feasible to systematically identify drugs with the potential to be repurposed. Described here is a novel method of drug repositioning by mining ClinicalTrials.gov. The text mining tools I2E (Linguamatics) and PolyAnalyst (Megaputer) were utilized. An I2E query extracts "Serious Adverse Events" (SAE) data from randomized trials in ClinicalTrials.gov. Through a statistical algorithm, a PolyAnalyst workflow ranks the drugs where the treatment arm has fewer predefined SAEs than the control arm, indicating that potentially the drug is reducing the level of SAE. Hypotheses could then be generated for the new use of these drugs based on the predefined SAE that is indicative of disease (for example, cancer).

  1. A Comparative Study to Predict Student’s Performance Using Educational Data Mining Techniques

    NASA Astrophysics Data System (ADS)

    Uswatun Khasanah, Annisa; Harwati

    2017-06-01

    Student’s performance prediction is essential to be conducted for a university to prevent student fail. Number of student drop out is one of parameter that can be used to measure student performance and one important point that must be evaluated in Indonesia university accreditation. Data Mining has been widely used to predict student’s performance, and data mining that applied in this field usually called as Educational Data Mining. This study conducted Feature Selection to select high influence attributes with student performance in Department of Industrial Engineering Universitas Islam Indonesia. Then, two popular classification algorithm, Bayesian Network and Decision Tree, were implemented and compared to know the best prediction result. The outcome showed that student’s attendance and GPA in the first semester were in the top rank from all Feature Selection methods, and Bayesian Network is outperforming Decision Tree since it has higher accuracy rate.

  2. Mining knowledge in noisy audio data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Czyzewski, A.

    1996-12-31

    This paper demonstrates a KDD method applied to audio data analysis, particularly, it presents possibilities which result from replacing traditional methods of analysis and acoustic signal processing by KDD algorithms when restoring audio recordings affected by strong noise.

  3. Automatic channel trimming for control systems: A concept

    NASA Technical Reports Server (NTRS)

    Vandervoort, R. J.; Sykes, H. A.

    1977-01-01

    Set of bias signals added to channel inputs automatically normalize differences between channels. Algorithm and second feedback loop compute trim biases. Concept could be applied to regulators and multichannel servosystems for remote manipulators in undersea mining.

  4. Effects of overweight vehicles on NYSDOT infrastructure.

    DOT National Transportation Integrated Search

    2015-09-01

    This report develops a methodology for estimating the effects of different categories of overweight : trucks on NYSDOT pavements and bridges. A data mining algorithm is used to categorize truck : data collected at several Weigh-In-Motion stations aro...

  5. Detecting Plastic PFM-1 Butterfly Mines Using Thermal Infrared Sensing

    NASA Astrophysics Data System (ADS)

    Baur, J.; de Smet, T.; Nikulin, A.

    2017-12-01

    Remnant plastic-composite landmines, such as the mass-produced PFM-1, represent an ongoing humanitarian threat aggravated by high costs associated with traditional demining efforts. These particular unexploded ordnance (UXO) devices pose a challenge to conventional geophysical detection methods, due their plastic-body design and small size. Additionally, the PFM-1s represent a particularly heinous UXO, due to their low mass ( 25 lb) trigger limit and "butterfly" wing design, earning them the reputation of a "toy mine" - disproportionally impacting children across post-conflict areas. We developed a detection algorithm based on data acquired by a thermal infrared camera mounted to a commercial UAV to detect time-variable temperature difference between the PFM-1 and the surrounding environment. We present results of a field study focused on thermal detection and identification of the PFM-1 anti-personnel landmines from a remotely operated unmanned aerial vehicle (UAV). We conducted a series of field detection experiments meant to simulate the mountainous terrains where PFM-1 mines were historically deployed and remain in place. In our tests, 18 inert PFM-1 mines along with the aluminum KSF-1 casing were randomly dispersed to mimic an ellipsoidal minefield of 8-10 x 18-20 m dimensions in a de-vegetated rubble yard at Chenango Valley State Park (New York State). We collected multiple thermal infrared imagery datasets focused on these model minefields with the FLIR Vue Pro R attached to the 3DR Solo UAV flying at approximately at 2 m. We identified different environmental variables to constrain the optimal time of day and daily temperature variations to reveal presence of these plastic UXOs. We show that in the early-morning hours when thermal inertia is greatest, the PFM-1 mines can be detected based on their differential thermal inertia. Because the mines have statistically different temperatures than background and a characteristic shape, we were able to train a supervised learning algorithm to automate detection of the mines over large areas. We anticipate that following further development, this remote sensing method can aid in significantly reducing the cost and time associated with landmine remediation in post-conflict nations.

  6. Information Gain Based Dimensionality Selection for Classifying Text Documents

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dumidu Wijayasekara; Milos Manic; Miles McQueen

    2013-06-01

    Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexitymore » is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.« less

  7. Comparative Analysis of Document level Text Classification Algorithms using R

    NASA Astrophysics Data System (ADS)

    Syamala, Maganti; Nalini, N. J., Dr; Maguluri, Lakshamanaphaneendra; Ragupathy, R., Dr.

    2017-08-01

    From the past few decades there has been tremendous volumes of data available in Internet either in structured or unstructured form. Also, there is an exponential growth of information on Internet, so there is an emergent need of text classifiers. Text mining is an interdisciplinary field which draws attention on information retrieval, data mining, machine learning, statistics and computational linguistics. And to handle this situation, a wide range of supervised learning algorithms has been introduced. Among all these K-Nearest Neighbor(KNN) is efficient and simplest classifier in text classification family. But KNN suffers from imbalanced class distribution and noisy term features. So, to cope up with this challenge we use document based centroid dimensionality reduction(CentroidDR) using R Programming. By combining these two text classification techniques, KNN and Centroid classifiers, we propose a scalable and effective flat classifier, called MCenKNN which works well substantially better than CenKNN.

  8. A study of fuzzy logic ensemble system performance on face recognition problem

    NASA Astrophysics Data System (ADS)

    Polyakova, A.; Lipinskiy, L.

    2017-02-01

    Some problems are difficult to solve by using a single intelligent information technology (IIT). The ensemble of the various data mining (DM) techniques is a set of models which are able to solve the problem by itself, but the combination of which allows increasing the efficiency of the system as a whole. Using the IIT ensembles can improve the reliability and efficiency of the final decision, since it emphasizes on the diversity of its components. The new method of the intellectual informational technology ensemble design is considered in this paper. It is based on the fuzzy logic and is designed to solve the classification and regression problems. The ensemble consists of several data mining algorithms: artificial neural network, support vector machine and decision trees. These algorithms and their ensemble have been tested by solving the face recognition problems. Principal components analysis (PCA) is used for feature selection.

  9. Sentiment analysis enhancement with target variable in Kumar’s Algorithm

    NASA Astrophysics Data System (ADS)

    Arman, A. A.; Kawi, A. B.; Hurriyati, R.

    2016-04-01

    Sentiment analysis (also known as opinion mining) refers to the use of text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews discussion that is being talked in social media for many purposes, ranging from marketing, customer service, or public opinion of public policy. One of the popular algorithm for Sentiment Analysis implementation is Kumar algorithm that developed by Kumar and Sebastian. Kumar algorithm can identify the sentiment score of the statement, sentence or tweet, but cannot determine the relationship of the object or target related to the sentiment being analysed. This research proposed solution for that challenge by adding additional component that represent object or target to the existing algorithm (Kumar algorithm). The result of this research is a modified algorithm that can give sentiment score based on a given object or target.

  10. Accurate Grid-based Clustering Algorithm with Diagonal Grid Searching and Merging

    NASA Astrophysics Data System (ADS)

    Liu, Feng; Ye, Chengcheng; Zhu, Erzhou

    2017-09-01

    Due to the advent of big data, data mining technology has attracted more and more attentions. As an important data analysis method, grid clustering algorithm is fast but with relatively lower accuracy. This paper presents an improved clustering algorithm combined with grid and density parameters. The algorithm first divides the data space into the valid meshes and invalid meshes through grid parameters. Secondly, from the starting point located at the first point of the diagonal of the grids, the algorithm takes the direction of “horizontal right, vertical down” to merge the valid meshes. Furthermore, by the boundary grid processing, the invalid grids are searched and merged when the adjacent left, above, and diagonal-direction grids are all the valid ones. By doing this, the accuracy of clustering is improved. The experimental results have shown that the proposed algorithm is accuracy and relatively faster when compared with some popularly used algorithms.

  11. A Semi-supervised Heat Kernel Pagerank MBO Algorithm for Data Classification

    DTIC Science & Technology

    2016-07-01

    financial predictions, etc. and is finding growing use in text mining studies. In this paper, we present an efficient algorithm for classification of high...video data, set of images, hyperspectral data, medical data, text data, etc. Moreover, the framework provides a way to analyze data whose different...also be incorporated. For text classification, one can use tfidf (term frequency inverse document frequency) to form feature vectors for each document

  12. DCL System Using Deep Learning Approaches for Land-Based or Ship-Based Real-Time Recognition and Localization of Marine Mammals

    DTIC Science & Technology

    2013-09-30

    method has been successfully implemented to automatically detect and recognize pulse trains from minke whales ( songs ) and sperm whales (Physeter...workshops, conferences and data challenges 2. Enhancements of the ASR algorithm for frequency-modulated sounds: Right Whale Study 3...Enhancements of the ASR algorithm for pulse trains: Minke Whale Study 4. Mining Big Data Sound Archives using High Performance Computing software and hardware

  13. Measuring Constraint-Set Utility for Partitional Clustering Algorithms

    NASA Technical Reports Server (NTRS)

    Davidson, Ian; Wagstaff, Kiri L.; Basu, Sugato

    2006-01-01

    Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged over different randomly chosen constraint sets from a given set of labels, thereby masking interesting properties of individual sets. We demonstrate that constraint sets vary significantly in how useful they are for constrained clustering; some constraint sets can actually decrease algorithm performance. We create two quantitative measures, informativeness and coherence, that can be used to identify useful constraint sets. We show that these measures can also help explain differences in performance for four particular constrained clustering algorithms.

  14. A Comprehensive Workflow of Mass Spectrometry-Based Untargeted Metabolomics in Cancer Metabolic Biomarker Discovery Using Human Plasma and Urine

    PubMed Central

    Zou, Wei; She, Jianwen; Tolstikov, Vladimir V.

    2013-01-01

    Current available biomarkers lack sensitivity and/or specificity for early detection of cancer. To address this challenge, a robust and complete workflow for metabolic profiling and data mining is described in details. Three independent and complementary analytical techniques for metabolic profiling are applied: hydrophilic interaction liquid chromatography (HILIC–LC), reversed-phase liquid chromatography (RP–LC), and gas chromatography (GC). All three techniques are coupled to a mass spectrometer (MS) in the full scan acquisition mode, and both unsupervised and supervised methods are used for data mining. The univariate and multivariate feature selection are used to determine subsets of potentially discriminative predictors. These predictors are further identified by obtaining accurate masses and isotopic ratios using selected ion monitoring (SIM) and data-dependent MS/MS and/or accurate mass MSn ion tree scans utilizing high resolution MS. A list combining all of the identified potential biomarkers generated from different platforms and algorithms is used for pathway analysis. Such a workflow combining comprehensive metabolic profiling and advanced data mining techniques may provide a powerful approach for metabolic pathway analysis and biomarker discovery in cancer research. Two case studies with previous published data are adapted and included in the context to elucidate the application of the workflow. PMID:24958150

  15. Prediction of customer behaviour analysis using classification algorithms

    NASA Astrophysics Data System (ADS)

    Raju, Siva Subramanian; Dhandayudam, Prabha

    2018-04-01

    Customer Relationship management plays a crucial role in analyzing of customer behavior patterns and their values with an enterprise. Analyzing of customer data can be efficient performed using various data mining techniques, with the goal of developing business strategies and to enhance the business. In this paper, three classification models (NB, J48, and MLPNN) are studied and evaluated for our experimental purpose. The performance measures of the three classifications are compared using three different parameters (accuracy, sensitivity, specificity) and experimental results expose J48 algorithm has better accuracy with compare to NB and MLPNN algorithm.

  16. A Novel Center Star Multiple Sequence Alignment Algorithm Based on Affine Gap Penalty and K-Band

    NASA Astrophysics Data System (ADS)

    Zou, Quan; Shan, Xiao; Jiang, Yi

    Multiple sequence alignment is one of the most important topics in computational biology, but it cannot deal with the large data so far. As the development of copy-number variant(CNV) and Single Nucleotide Polymorphisms(SNP) research, many researchers want to align numbers of similar sequences for detecting CNV and SNP. In this paper, we propose a novel multiple sequence alignment algorithm based on affine gap penalty and k-band. It can align more quickly and accurately, that will be helpful for mining CNV and SNP. Experiments prove the performance of our algorithm.

  17. Mineral Mapping Using AVIRIS Data at Ray Mine, AZ

    NASA Technical Reports Server (NTRS)

    McCubbin, Ian; Lang, Harold; Green, Robert O.; Roberts, Dar

    1998-01-01

    Imaging Spectroscopy enables the identification and mapping of surface mineralogy over large areas. This study focused on assessing the utility of Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data for environmental impact analysis over the Environmental Protection Agency's (EPA) high priority Superfund site Ray Mine, AZ. Using the Spectral Angle Mapper (SAM) algorithm to analyze AVIRIS data makes it possible to map surface materials that are indicative of acid generating minerals. The improved performance of the AVIRIS sensor since 1996 provides data with sufficient signal to noise ratio to characterize up to 8 image endmembers. Specifically we employed SAM to map minerals associated with mine generated acid waste, namely jarositc, goethite, and hematite, in the presence of a complex mineralogical background.

  18. Using ADOPT Algorithm and Operational Data to Discover Precursors to Aviation Adverse Events

    NASA Technical Reports Server (NTRS)

    Janakiraman, Vijay; Matthews, Bryan; Oza, Nikunj

    2018-01-01

    The US National Airspace System (NAS) is making its transition to the NextGen system and assuring safety is one of the top priorities in NextGen. At present, safety is managed reactively (correct after occurrence of an unsafe event). While this strategy works for current operations, it may soon become ineffective for future airspace designs and high density operations. There is a need for proactive management of safety risks by identifying hidden and "unknown" risks and evaluating the impacts on future operations. To this end, NASA Ames has developed data mining algorithms that finds anomalies and precursors (high-risk states) to safety issues in the NAS. In this paper, we describe a recently developed algorithm called ADOPT that analyzes large volumes of data and automatically identifies precursors from real world data. Precursors help in detecting safety risks early so that the operator can mitigate the risk in time. In addition, precursors also help identify causal factors and help predict the safety incident. The ADOPT algorithm scales well to large data sets and to multidimensional time series, reduce analyst time significantly, quantify multiple safety risks giving a holistic view of safety among other benefits. This paper details the algorithm and includes several case studies to demonstrate its application to discover the "known" and "unknown" safety precursors in aviation operation.

  19. Structural health monitoring feature design by genetic programming

    NASA Astrophysics Data System (ADS)

    Harvey, Dustin Y.; Todd, Michael D.

    2014-09-01

    Structural health monitoring (SHM) systems provide real-time damage and performance information for civil, aerospace, and other high-capital or life-safety critical structures. Conventional data processing involves pre-processing and extraction of low-dimensional features from in situ time series measurements. The features are then input to a statistical pattern recognition algorithm to perform the relevant classification or regression task necessary to facilitate decisions by the SHM system. Traditional design of signal processing and feature extraction algorithms can be an expensive and time-consuming process requiring extensive system knowledge and domain expertise. Genetic programming, a heuristic program search method from evolutionary computation, was recently adapted by the authors to perform automated, data-driven design of signal processing and feature extraction algorithms for statistical pattern recognition applications. The proposed method, called Autofead, is particularly suitable to handle the challenges inherent in algorithm design for SHM problems where the manifestation of damage in structural response measurements is often unclear or unknown. Autofead mines a training database of response measurements to discover information-rich features specific to the problem at hand. This study provides experimental validation on three SHM applications including ultrasonic damage detection, bearing damage classification for rotating machinery, and vibration-based structural health monitoring. Performance comparisons with common feature choices for each problem area are provided demonstrating the versatility of Autofead to produce significant algorithm improvements on a wide range of problems.

  20. Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

    PubMed Central

    Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

    2015-01-01

    Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level. PMID:25830807

  1. Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining.

    PubMed

    Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

    2015-01-01

    Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

  2. Using Data Mining to Detect Health Care Fraud and Abuse: A Review of Literature

    PubMed Central

    Joudaki, Hossein; Rashidian, Arash; Minaei-Bidgoli, Behrouz; Mahmoodi, Mahmood; Geraili, Bijan; Nasiri, Mahdi; Arab, Mohammad

    2015-01-01

    Inappropriate payments by insurance organizations or third party payers occur because of errors, abuse and fraud. The scale of this problem is large enough to make it a priority issue for health systems. Traditional methods of detecting health care fraud and abuse are time-consuming and inefficient. Combining automated methods and statistical knowledge lead to the emergence of a new interdisciplinary branch of science that is named Knowledge Discovery from Databases (KDD). Data mining is a core of the KDD process. Data mining can help third-party payers such as health insurance organizations to extract useful information from thousands of claims and identify a smaller subset of the claims or claimants for further assessment. We reviewed studies that performed data mining techniques for detecting health care fraud and abuse, using supervised and unsupervised data mining approaches. Most available studies have focused on algorithmic data mining without an emphasis on or application to fraud detection efforts in the context of health service provision or health insurance policy. More studies are needed to connect sound and evidence-based diagnosis and treatment approaches toward fraudulent or abusive behaviors. Ultimately, based on available studies, we recommend seven general steps to data mining of health care claims. PMID:25560347

  3. Comparison of rule induction, decision trees and formal concept analysis approaches for classification

    NASA Astrophysics Data System (ADS)

    Kotelnikov, E. V.; Milov, V. R.

    2018-05-01

    Rule-based learning algorithms have higher transparency and easiness to interpret in comparison with neural networks and deep learning algorithms. These properties make it possible to effectively use such algorithms to solve descriptive tasks of data mining. The choice of an algorithm depends also on its ability to solve predictive tasks. The article compares the quality of the solution of the problems with binary and multiclass classification based on the experiments with six datasets from the UCI Machine Learning Repository. The authors investigate three algorithms: Ripper (rule induction), C4.5 (decision trees), In-Close (formal concept analysis). The results of the experiments show that In-Close demonstrates the best quality of classification in comparison with Ripper and C4.5, however the latter two generate more compact rule sets.

  4. Fast Ss-Ilm a Computationally Efficient Algorithm to Discover Socially Important Locations

    NASA Astrophysics Data System (ADS)

    Dokuz, A. S.; Celik, M.

    2017-11-01

    Socially important locations are places which are frequently visited by social media users in their social media lifetime. Discovering socially important locations provide several valuable information about user behaviours on social media networking sites. However, discovering socially important locations are challenging due to data volume and dimensions, spatial and temporal calculations, location sparseness in social media datasets, and inefficiency of current algorithms. In the literature, several studies are conducted to discover important locations, however, the proposed approaches do not work in computationally efficient manner. In this study, we propose Fast SS-ILM algorithm by modifying the algorithm of SS-ILM to mine socially important locations efficiently. Experimental results show that proposed Fast SS-ILM algorithm decreases execution time of socially important locations discovery process up to 20 %.

  5. Evaluation of airborne thermal infrared imagery for locating mine drainage sites in the Lower Kettle Creek and Cooks Run Basins, Pennsylvania, USA

    USGS Publications Warehouse

    Sams, James I.; Veloski, Garret

    2003-01-01

    High-resolution airborne thermal infrared (TIR) imagery data were collected over 90.6 km2 (35 mi2) of remote and rugged terrain in the Kettle Creek and Cooks Run Basins, tributaries of the West Branch of the Susquehanna River in north-central Pennsylvania. The purpose of this investigation was to evaluate the effectiveness of TIR for identifying sources of acid mine drainage (AMD) associated with abandoned coal mines. Coal mining from the late 1800s resulted in many AMD sources from abandoned mines in the area. However, very little detailed mine information was available, particularly on the source locations of AMD sites. Potential AMD sources were extracted from airborne TIR data employing custom image processing algorithms and GIS data analysis. Based on field reconnaissance of 103 TIR anomalies, 53 sites (51%) were classified as AMD. The AMD sources had low pH (<4) and elevated concentrations of iron and aluminum. Of the 53 sites, approximately 26 sites could be correlated with sites previously documented as AMD. The other 27 mine discharges identified in the TIR data were previously undocumented. This paper presents a summary of the procedures used to process the TIR data and extract potential mine drainage sites, methods used for field reconnaissance and verification of TIR data, and a brief summary of water-quality data.

  6. Space assets for demining assistance

    NASA Astrophysics Data System (ADS)

    Kruijff, Michiel; Eriksson, Daniel; Bouvet, Thomas; Griffiths, Alexander; Craig, Matthew; Sahli, Hichem; González-Rosón, Fernando Valcarce; Willekens, Philippe; Ginati, Amnon

    2013-02-01

    Populations emerging from armed conflicts often remain threatened by landmines and explosive remnants of war. The international mine action community is concerned with the relief of this threat. The Space Assets for Demining Assistance (SADA) undertaking is a set of activities that aim at developing new services to improve the socio-economic impact of mine action activities, primarily focused on the release of land thought to be contaminated, a process described as land release. SADA was originally initiated by the International Astronautical Federation (IAF). It has been implemented under the Integrated Applications Promotion (IAP) program of the European Space Agency (ESA). Land release in mine action is the process whereby the demining community identifies, surveys and prioritizes suspected hazardous areas for more detailed investigation, which eventually results in the clearance of landmines and other explosives, thereby releasing land to the local population. SADA has a broad scope, covering activities, such as planning (risk and impact analysis, prioritization, and resource management), field operations and reporting. SADA services are developed in two phases: feasibility studies followed by demonstration projects. Three parallel feasibility studies have been performed. They aimed at defining an integrated set of space enabled services to support the land release process in mine action, and at analyzing their added value, viability and sustainability. The needs of the mine action sector have been assessed and the potential contribution of space assets has been identified. Support services have been formulated. To test their fieldability, proofs of concept involving mine action end users in various operational field settings have been performed by each of the study teams. The economic viability has also been assessed. Whenever relevant and cost-effective, SADA aims at integrating Earth observation data, GNSS navigation and SatCom technologies with existing mine action tools and procedures, as well as with novel aerial survey technologies. Such conformity with existing user processes, as well as available budgets and appropriateness of technology based solutions given the field level operational setting are important conditions for success. The studies have demonstrated that Earth observation data, satellite navigation solutions and in some cases, satellite communication, indeed can provide added value to mine action activities if properly tailored based on close user interaction and provided through a suitable channel. Such added value for example includes easy and sustained access to Earth observation data for general purpose mapping, land use assessment for post-release progress reporting, and multi-source data fusion algorithms to help quantify risks and socio-economic impact for prioritization and planning purposes. The environment and boundaries of a hazardous area can also be better specified to support the land release process including detailed survey and clearance operations. Satellite communication can help to provide relevant data to remote locations, but is not regarded as strongly user driven. Finally, satellite navigation can support more precise non-technical surveys, as well as aerial observation with small planes or hand-launched UAV's. To ensure the activity is genuinely user driven, the Geneva International Center for Humanitarian Demining (GICHD) plays an important role as ESA's external advisor. ESA is furthermore supported by a representative field operator, the Swiss Foundation of Mine Action (FSD), providing ESA with a direct connection to the field level end users. Specifically FSD has provided a shared user needs baseline to the three study teams. To ensure solutions meet with end user requirements, the study teams themselves include mine action representatives and have interacted closely with their pre-existing and newly established contacts within the mine action community.

  7. The Increase of Power Efficiency of Underground Coal Mining by the Forecasting of Electric Power Consumption

    NASA Astrophysics Data System (ADS)

    Efremenko, Vladimir; Belyaevsky, Roman; Skrebneva, Evgeniya

    2017-11-01

    In article the analysis of electric power consumption and problems of power saving on coal mines are considered. Nowadays the share of conditionally constant costs of electric power for providing safe working conditions underground on coal mines is big. Therefore, the power efficiency of underground coal mining depends on electric power expense of the main technological processes and size of conditionally constant costs. The important direction of increase of power efficiency of coal mining is forecasting of a power consumption and monitoring of electric power expense. One of the main approaches to reducing of electric power costs is increase in accuracy of the enterprise demand in the wholesale electric power market. It is offered to use artificial neural networks to forecasting of day-ahead power consumption with hourly breakdown. At the same time use of neural and indistinct (hybrid) systems on the principles of fuzzy logic, neural networks and genetic algorithms is more preferable. This model allows to do exact short-term forecasts at a small array of input data. A set of the input parameters characterizing mining-and-geological and technological features of the enterprise is offered.

  8. 76 FR 25277 - Lowering Miners' Exposure to Respirable Coal Mine Dust, Including Continuous Personal Dust Monitors

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-05-04

    ... 1219-AB64 Lowering Miners' Exposure to Respirable Coal Mine Dust, Including Continuous Personal Dust... to Respirable Coal Mine Dust, Including Continuous Personal Dust Monitors. This extension gives... Miners' Exposure to Respirable Coal Mine Dust, Including Continuous Personal Dust Monitors. In response...

  9. An implementation of differential evolution algorithm for inversion of geoelectrical data

    NASA Astrophysics Data System (ADS)

    Balkaya, Çağlayan

    2013-11-01

    Differential evolution (DE), a population-based evolutionary algorithm (EA) has been implemented to invert self-potential (SP) and vertical electrical sounding (VES) data sets. The algorithm uses three operators including mutation, crossover and selection similar to genetic algorithm (GA). Mutation is the most important operator for the success of DE. Three commonly used mutation strategies including DE/best/1 (strategy 1), DE/rand/1 (strategy 2) and DE/rand-to-best/1 (strategy 3) were applied together with a binomial type crossover. Evolution cycle of DE was realized without boundary constraints. For the test studies performed with SP data, in addition to both noise-free and noisy synthetic data sets two field data sets observed over the sulfide ore body in the Malachite mine (Colorado) and over the ore bodies in the Neem-Ka Thana cooper belt (India) were considered. VES test studies were carried out using synthetically produced resistivity data representing a three-layered earth model and a field data set example from Gökçeada (Turkey), which displays a seawater infiltration problem. Mutation strategies mentioned above were also extensively tested on both synthetic and field data sets in consideration. Of these, strategy 1 was found to be the most effective strategy for the parameter estimation by providing less computational cost together with a good accuracy. The solutions obtained by DE for the synthetic cases of SP were quite consistent with particle swarm optimization (PSO) which is a more widely used population-based optimization algorithm than DE in geophysics. Estimated parameters of SP and VES data were also compared with those obtained from Metropolis-Hastings (M-H) sampling algorithm based on simulated annealing (SA) without cooling to clarify uncertainties in the solutions. Comparison to the M-H algorithm shows that DE performs a fast approximate posterior sampling for the case of low-dimensional inverse geophysical problems.

  10. Astroinformatics, data mining and the future of astronomical research

    NASA Astrophysics Data System (ADS)

    Brescia, Massimo; Longo, Giuseppe

    2013-08-01

    Astronomy, as many other scientific disciplines, is facing a true data deluge which is bound to change both the praxis and the methodology of every day research work. The emerging field of astroinformatics, while on the one end appears crucial to face the technological challenges, on the other is opening new exciting perspectives for new astronomical discoveries through the implementation of advanced data mining procedures. The complexity of astronomical data and the variety of scientific problems, however, call for innovative algorithms and methods as well as for an extreme usage of ICT technologies.

  11. Numerical simulation study on the distribution law of smoke flow velocity in horizontal tunnel fire

    NASA Astrophysics Data System (ADS)

    Liu, Yejiao; Tian, Zhichao; Xue, Junhua; Wang, Wencai

    2018-02-01

    According to the fluid similarity theory, the simulation experiment system of mining tunnel fire is established. The grid division of experimental model roadway is carried on by GAMBIT software. By setting the boundary and initial conditions of smoke flow during fire period in FLUENT software, using RNG k-Ɛ two-equation turbulence model, energy equation and SIMPLE algorithm, the steady state numerical simulation of smoke flow velocity in mining tunnel is done to obtain the distribution law of smoke flow velocity in tunnel during fire period.

  12. Analysing Customer Opinions with Text Mining Algorithms

    NASA Astrophysics Data System (ADS)

    Consoli, Domenico

    2009-08-01

    Knowing what the customer thinks of a particular product/service helps top management to introduce improvements in processes and products, thus differentiating the company from their competitors and gain competitive advantages. The customers, with their preferences, determine the success or failure of a company. In order to know opinions of the customers we can use technologies available from the web 2.0 (blog, wiki, forums, chat, social networking, social commerce). From these web sites, useful information must be extracted, for strategic purposes, using techniques of sentiment analysis or opinion mining.

  13. Trace element content of gossans at four mines in the West Shasta massive sulfide district.

    USGS Publications Warehouse

    Sanzolone, R.F.; Domenico, J.A.

    1985-01-01

    Paired analyses of the spongy whole-rock gossan and its botryoidal crust ("chipped rock rind') show little differences, whereas duplicate samples of each at individual sites show such extreme differences as to preclude the use of the data in areal mapping. Gossans from disseminated sulphides have lower and less variable trace-element contents than gossans from massive sulphides, due in part to dilution by rock silicates. Computer reduction of the data by a regionalizing algorithm enables determination of pattern differences among the four mines.-G.J.N.

  14. Semantic similarity analysis of protein data: assessment with biological features and issues.

    PubMed

    Guzzi, Pietro H; Mina, Marco; Guerra, Concettina; Cannataro, Mario

    2012-09-01

    The integration of proteomics data with biological knowledge is a recent trend in bioinformatics. A lot of biological information is available and is spread on different sources and encoded in different ontologies (e.g. Gene Ontology). Annotating existing protein data with biological information may enable the use (and the development) of algorithms that use biological ontologies as framework to mine annotated data. Recently many methodologies and algorithms that use ontologies to extract knowledge from data, as well as to analyse ontologies themselves have been proposed and applied to other fields. Conversely, the use of such annotations for the analysis of protein data is a relatively novel research area that is currently becoming more and more central in research. Existing approaches span from the definition of the similarity among genes and proteins on the basis of the annotating terms, to the definition of novel algorithms that use such similarities for mining protein data on a proteome-wide scale. This work, after the definition of main concept of such analysis, presents a systematic discussion and comparison of main approaches. Finally, remaining challenges, as well as possible future directions of research are presented.

  15. A non-linear data mining parameter selection algorithm for continuous variables

    PubMed Central

    Razavi, Marianne; Brady, Sean

    2017-01-01

    In this article, we propose a new data mining algorithm, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, a preferred selection method should have the potential of adding a supplementary level of regression analysis that would capture complex relationships in the data via mathematical transformation of the predictors and exploration of synergistic effects of combined variables. The method that we present here has the potential to produce an optimal subset of variables, rendering the overall process of model selection more efficient. This algorithm introduces interpretable parameters by transforming the original inputs and also a faithful fit to the data. The core objective of this paper is to introduce a new estimation technique for the classical least square regression framework. This new automatic variable transformation and model selection method could offer an optimal and stable model that minimizes the mean square error and variability, while combining all possible subset selection methodology with the inclusion variable transformations and interactions. Moreover, this method controls multicollinearity, leading to an optimal set of explanatory variables. PMID:29131829

  16. Issues of Exploitation of Induction Motors in the Course of Underground Mining Operations

    NASA Astrophysics Data System (ADS)

    Gumula, Stanisław; Hudy, Wiktor; Piaskowska-Silarska, Malgorzata; Pytel, Krzysztof

    2017-09-01

    Mining industry is one of the most important customers of electric motors. The most commonly used in the contemporary mining industry is alternating current machines used for processing electrical energy into mechanical energy. The operating problems and the influence of qualitative interference acting on the inputs of individual regulators to field-oriented system in the course of underground mining operations has been presented in the publication. The object of controlling the speed is a slip-ring induction motor. Settings of regulators were calculated using an evolutionary algorithm. Examination of system dynamics was performed by a computer with the use of the MATLAB / Simulink software. According to analyzes, large distortion of input signals of regulators adversely affects the rotational speed that pursued by the control system, which may cause a large vibration of the whole system and, consequently, its much faster destruction. Designed system is characterized by a significantly better resistance to interference. The system is stable with the properly selected settings of regulators, which is particularly important during the operation of machinery used in underground mining.

  17. Minehunting sonar system research and development

    NASA Astrophysics Data System (ADS)

    Ferguson, Brian

    2002-05-01

    Sea mines have the potential to threaten the freedom of the seas by disrupting maritime trade and restricting the freedom of maneuver of navies. The acoustic detection, localization, and classification of sea mines involves a sequence of operations starting with the transmission of a sonar pulse and ending with an operator interpreting the information on a sonar display. A recent improvement to the process stems from the application of neural networks to the computed aided detection of sea mines. The advent of ultrawideband sonar transducers together with pulse compression techniques offers a thousandfold increase in the bandwidth-time product of conventional minehunting sonar transmissions enabling stealth mines to be detected at longer ranges. These wideband signals also enable mines to be imaged at safe standoff distances by applying tomographic image reconstruction techniques. The coupling of wideband transducer technology with synthetic aperture processing enhances the resolution of side scan sonars in both the cross-track and along-track directions. The principles on which conventional and advanced minehunting sonars are based are reviewed and the results of applying novel sonar signal processing algorithms to high-frequency sonar data collected in Australian waters are presented.

  18. Static versus dynamic sampling for data mining

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    John, G.H.; Langley, P.

    1996-12-31

    As data warehouses grow to the point where one hundred gigabytes is considered small, the computational efficiency of data-mining algorithms on large databases becomes increasingly important. Using a sample from the database can speed up the datamining process, but this is only acceptable if it does not reduce the quality of the mined knowledge. To this end, we introduce the {open_quotes}Probably Close Enough{close_quotes} criterion to describe the desired properties of a sample. Sampling usually refers to the use of static statistical tests to decide whether a sample is sufficiently similar to the large database, in the absence of any knowledgemore » of the tools the data miner intends to use. We discuss dynamic sampling methods, which take into account the mining tool being used and can thus give better samples. We describe dynamic schemes that observe a mining tool`s performance on training samples of increasing size and use these results to determine when a sample is sufficiently large. We evaluate these sampling methods on data from the UCI repository and conclude that dynamic sampling is preferable.« less

  19. Quantifying Forest and Coastal Disturbance from Industrial Mining Using Satellite Time Series Analysis Under Very Cloudy Conditions

    NASA Astrophysics Data System (ADS)

    Alonzo, M.; Van Den Hoek, J.; Ahmed, N.

    2015-12-01

    The open-pit Grasberg mine, located in the highlands of Western Papua, Indonesia, and operated by PT Freeport Indonesia (PT-FI), is among the world's largest in terms of copper and gold production. Over the last 27 years, PT-FI has used the Ajkwa River to transport an estimated 1.3 billion tons of tailings from the mine into the so-called Ajkwa Deposition Area (ADA). The ADA is the product of aggradation and lateral expansion of the Ajkwa River into the surrounding lowland rainforest and mangroves, which include species important to the livelihoods of indigenous Papuans. Mine tailings that do not settle in the ADA disperse into the Arafura Sea where they increase levels of suspended particulate matter (SPM) and associated concentrations of dissolved copper. Despite the mine's large-scale operations, ecological impact of mine tailings deposition on the forest and estuarial ecosystems have received minimal formal study. While ground-based inquiries are nearly impossible due to access restrictions, assessment via satellite remote sensing is promising but hindered by extreme cloud cover. In this study, we characterize ridgeline-to-coast environmental impacts along the Ajkwa River, from the Grasberg mine to the Arafura Sea between 1987 and 2014. We use "all available" Landsat TM and ETM+ images collected over this time period to both track pixel-level vegetation disturbance and monitor changes in coastal SPM levels. Existing temporal segmentation algorithms are unable to assess both acute and protracted trajectories of vegetation change due to pervasive cloud cover. In response, we employ robust, piecewise linear regression on noisy vegetation index (NDVI) data in a manner that is relatively insensitive to atmospheric contamination. Using this disturbance detection technique we constructed land cover histories for every pixel, based on 199 image dates, to differentiate processes of vegetation decline, disturbance, and regrowth. Using annual reports from PT-FI, we show that the changing extent and spatial patterns of riparian vegetation disturbance directly correlate with yearly tailings production rates. While the rate of vegetation disturbance decreased after 1998, SPM levels along the Arafura coast increased, suggesting the failure of PT-FI to fully confine tailings to the ADA.

  20. Mining peripheral arterial disease cases from narrative clinical notes using natural language processing.

    PubMed

    Afzal, Naveed; Sohn, Sunghwan; Abram, Sara; Scott, Christopher G; Chaudhry, Rajeev; Liu, Hongfang; Kullo, Iftikhar J; Arruda-Olson, Adelaide M

    2017-06-01

    Lower extremity peripheral arterial disease (PAD) is highly prevalent and affects millions of individuals worldwide. We developed a natural language processing (NLP) system for automated ascertainment of PAD cases from clinical narrative notes and compared the performance of the NLP algorithm with billing code algorithms, using ankle-brachial index test results as the gold standard. We compared the performance of the NLP algorithm to (1) results of gold standard ankle-brachial index; (2) previously validated algorithms based on relevant International Classification of Diseases, Ninth Revision diagnostic codes (simple model); and (3) a combination of International Classification of Diseases, Ninth Revision codes with procedural codes (full model). A dataset of 1569 patients with PAD and controls was randomly divided into training (n = 935) and testing (n = 634) subsets. We iteratively refined the NLP algorithm in the training set including narrative note sections, note types, and service types, to maximize its accuracy. In the testing dataset, when compared with both simple and full models, the NLP algorithm had better accuracy (NLP, 91.8%; full model, 81.8%; simple model, 83%; P < .001), positive predictive value (NLP, 92.9%; full model, 74.3%; simple model, 79.9%; P < .001), and specificity (NLP, 92.5%; full model, 64.2%; simple model, 75.9%; P < .001). A knowledge-driven NLP algorithm for automatic ascertainment of PAD cases from clinical notes had greater accuracy than billing code algorithms. Our findings highlight the potential of NLP tools for rapid and efficient ascertainment of PAD cases from electronic health records to facilitate clinical investigation and eventually improve care by clinical decision support. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

Top