Multi-INT Complex Event Processing using Approximate, Incremental Graph Pattern Search
2012-06-01
graph pattern search and SPARQL queries . Total execution time for 10 executions each of 5 random pattern searches in synthetic data sets...01/11 1000 10000 100000 RDF triples Time (secs) 10 20 Graph pattern algorithm SPARQL queries Initial Performance Comparisons 09/18/11 2011 Thrust Area
Querying graphs in protein-protein interactions networks using feedback vertex set.
Blin, Guillaume; Sikora, Florian; Vialette, Stéphane
2010-01-01
Recent techniques increase rapidly the amount of our knowledge on interactions between proteins. The interpretation of these new information depends on our ability to retrieve known substructures in the data, the Protein-Protein Interactions (PPIs) networks. In an algorithmic point of view, it is an hard task since it often leads to NP-hard problems. To overcome this difficulty, many authors have provided tools for querying patterns with a restricted topology, i.e., paths or trees in PPI networks. Such restriction leads to the development of fixed parameter tractable (FPT) algorithms, which can be practicable for restricted sizes of queries. Unfortunately, Graph Homomorphism is a W[1]-hard problem, and hence, no FPT algorithm can be found when patterns are in the shape of general graphs. However, Dost et al. gave an algorithm (which is not implemented) to query graphs with a bounded treewidth in PPI networks (the treewidth of the query being involved in the time complexity). In this paper, we propose another algorithm for querying pattern in the shape of graphs, also based on dynamic programming and the color-coding technique. To transform graphs queries into trees without loss of informations, we use feedback vertex set coupled to a node duplication mechanism. Hence, our algorithm is FPT for querying graphs with a bounded size of their feedback vertex set. It gives an alternative to the treewidth parameter, which can be better or worst for a given query. We provide a python implementation which allows us to validate our implementation on real data. Especially, we retrieve some human queries in the shape of graphs into the fly PPI network.
Durand, Patrick; Labarre, Laurent; Meil, Alain; Divo, Jean-Louis; Vandenbrouck, Yves; Viari, Alain; Wojcik, Jérôme
2006-01-17
A large variety of biological data can be represented by graphs. These graphs can be constructed from heterogeneous data coming from genomic and post-genomic technologies, but there is still need for tools aiming at exploring and analysing such graphs. This paper describes GenoLink, a software platform for the graphical querying and exploration of graphs. GenoLink provides a generic framework for representing and querying data graphs. This framework provides a graph data structure, a graph query engine, allowing to retrieve sub-graphs from the entire data graph, and several graphical interfaces to express such queries and to further explore their results. A query consists in a graph pattern with constraints attached to the vertices and edges. A query result is the set of all sub-graphs of the entire data graph that are isomorphic to the pattern and satisfy the constraints. The graph data structure does not rely upon any particular data model but can dynamically accommodate for any user-supplied data model. However, for genomic and post-genomic applications, we provide a default data model and several parsers for the most popular data sources. GenoLink does not require any programming skill since all operations on graphs and the analysis of the results can be carried out graphically through several dedicated graphical interfaces. GenoLink is a generic and interactive tool allowing biologists to graphically explore various sources of information. GenoLink is distributed either as a standalone application or as a component of the Genostar/Iogma platform. Both distributions are free for academic research and teaching purposes and can be requested at academy@genostar.com. A commercial licence form can be obtained for profit company at info@genostar.com. See also http://www.genostar.org.
Durand, Patrick; Labarre, Laurent; Meil, Alain; Divo1, Jean-Louis; Vandenbrouck, Yves; Viari, Alain; Wojcik, Jérôme
2006-01-01
Background A large variety of biological data can be represented by graphs. These graphs can be constructed from heterogeneous data coming from genomic and post-genomic technologies, but there is still need for tools aiming at exploring and analysing such graphs. This paper describes GenoLink, a software platform for the graphical querying and exploration of graphs. Results GenoLink provides a generic framework for representing and querying data graphs. This framework provides a graph data structure, a graph query engine, allowing to retrieve sub-graphs from the entire data graph, and several graphical interfaces to express such queries and to further explore their results. A query consists in a graph pattern with constraints attached to the vertices and edges. A query result is the set of all sub-graphs of the entire data graph that are isomorphic to the pattern and satisfy the constraints. The graph data structure does not rely upon any particular data model but can dynamically accommodate for any user-supplied data model. However, for genomic and post-genomic applications, we provide a default data model and several parsers for the most popular data sources. GenoLink does not require any programming skill since all operations on graphs and the analysis of the results can be carried out graphically through several dedicated graphical interfaces. Conclusion GenoLink is a generic and interactive tool allowing biologists to graphically explore various sources of information. GenoLink is distributed either as a standalone application or as a component of the Genostar/Iogma platform. Both distributions are free for academic research and teaching purposes and can be requested at academy@genostar.com. A commercial licence form can be obtained for profit company at info@genostar.com. See also . PMID:16417636
VISAGE: Interactive Visual Graph Querying.
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2016-06-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete , an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with "wildcard" nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE's ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries.
VISAGE: Interactive Visual Graph Querying
Pienta, Robert; Navathe, Shamkant; Tamersoy, Acar; Tong, Hanghang; Endert, Alex; Chau, Duen Horng
2017-01-01
Extracting useful patterns from large network datasets has become a fundamental challenge in many domains. We present VISAGE, an interactive visual graph querying approach that empowers users to construct expressive queries, without writing complex code (e.g., finding money laundering rings of bankers and business owners). Our contributions are as follows: (1) we introduce graph autocomplete, an interactive approach that guides users to construct and refine queries, preventing over-specification; (2) VISAGE guides the construction of graph queries using a data-driven approach, enabling users to specify queries with varying levels of specificity, from concrete and detailed (e.g., query by example), to abstract (e.g., with “wildcard” nodes of any types), to purely structural matching; (3) a twelve-participant, within-subject user study demonstrates VISAGE’s ease of use and the ability to construct graph queries significantly faster than using a conventional query language; (4) VISAGE works on real graphs with over 468K edges, achieving sub-second response times for common queries. PMID:28553670
Percolator: Scalable Pattern Discovery in Dynamic Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhury, Sutanay; Purohit, Sumit; Lin, Peng
We demonstrate Percolator, a distributed system for graph pattern discovery in dynamic graphs. In contrast to conventional mining systems, Percolator advocates efficient pattern mining schemes that (1) support pattern detection with keywords; (2) integrate incremental and parallel pattern mining; and (3) support analytical queries such as trend analysis. The core idea of Percolator is to dynamically decide and verify a small fraction of patterns and their in- stances that must be inspected in response to buffered updates in dynamic graphs, with a total mining cost independent of graph size. We demonstrate a) the feasibility of incremental pattern mining by walkingmore » through each component of Percolator, b) the efficiency and scalability of Percolator over the sheer size of real-world dynamic graphs, and c) how the user-friendly GUI of Percolator inter- acts with users to support keyword-based queries that detect, browse and inspect trending patterns. We also demonstrate two user cases of Percolator, in social media trend analysis and academic collaboration analysis, respectively.« less
Evaluation of Graph Pattern Matching Workloads in Graph Analysis Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Seokyong; Lee, Sangkeun; Lim, Seung-Hwan
2016-01-01
Graph analysis has emerged as a powerful method for data scientists to represent, integrate, query, and explore heterogeneous data sources. As a result, graph data management and mining became a popular area of research, and led to the development of plethora of systems in recent years. Unfortunately, the number of emerging graph analysis systems and the wide range of applications, coupled with a lack of apples-to-apples comparisons, make it difficult to understand the trade-offs between different systems and the graph operations for which they are designed. A fair comparison of these systems is a challenging task for the following reasons:more » multiple data models, non-standardized serialization formats, various query interfaces to users, and diverse environments they operate in. To address these key challenges, in this paper we present a new benchmark suite by extending the Lehigh University Benchmark (LUBM) to cover the most common capabilities of various graph analysis systems. We provide the design process of the benchmark, which generalizes the workflow for data scientists to conduct the desired graph analysis on different graph analysis systems. Equipped with this extended benchmark suite, we present performance comparison for nine subgraph pattern retrieval operations over six graph analysis systems, namely NetworkX, Neo4j, Jena, Titan, GraphX, and uRiKA. Through the proposed benchmark suite, this study reveals both quantitative and qualitative findings in (1) implications in loading data into each system; (2) challenges in describing graph patterns for each query interface; and (3) different sensitivity of each system to query selectivity. We envision that this study will pave the road for: (i) data scientists to select the suitable graph analysis systems, and (ii) data management system designers to advance graph analysis systems.« less
A distributed query execution engine of big attributed graphs.
Batarfi, Omar; Elshawi, Radwa; Fayoumi, Ayman; Barnawi, Ahmed; Sakr, Sherif
2016-01-01
A graph is a popular data model that has become pervasively used for modeling structural relationships between objects. In practice, in many real-world graphs, the graph vertices and edges need to be associated with descriptive attributes. Such type of graphs are referred to as attributed graphs. G-SPARQL has been proposed as an expressive language, with a centralized execution engine, for querying attributed graphs. G-SPARQL supports various types of graph querying operations including reachability, pattern matching and shortest path where any G-SPARQL query may include value-based predicates on the descriptive information (attributes) of the graph edges/vertices in addition to the structural predicates. In general, a main limitation of centralized systems is that their vertical scalability is always restricted by the physical limits of computer systems. This article describes the design, implementation in addition to the performance evaluation of DG-SPARQL, a distributed, hybrid and adaptive parallel execution engine of G-SPARQL queries. In this engine, the topology of the graph is distributed over the main memory of the underlying nodes while the graph data are maintained in a relational store which is replicated on the disk of each of the underlying nodes. DG-SPARQL evaluates parts of the query plan via SQL queries which are pushed to the underlying relational stores while other parts of the query plan, as necessary, are evaluated via indexless memory-based graph traversal algorithms. Our experimental evaluation shows the efficiency and the scalability of DG-SPARQL on querying massive attributed graph datasets in addition to its ability to outperform the performance of Apache Giraph, a popular distributed graph processing system, by orders of magnitudes.
Massive Scale Cyber Traffic Analysis: A Driver for Graph Database Research
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Choudhury, S.; Haglin, David J.
2013-06-19
We describe the significance and prominence of network traffic analysis (TA) as a graph- and network-theoretical domain for advancing research in graph database systems. TA involves observing and analyzing the connections between clients, servers, hosts, and actors within IP networks, both at particular times and as extended over times. Towards that end, NetFlow (or more generically, IPFLOW) data are available from routers and servers which summarize coherent groups of IP packets flowing through the network. IPFLOW databases are routinely interrogated statistically and visualized for suspicious patterns. But the ability to cast IPFLOW data as a massive graph and query itmore » interactively, in order to e.g.\\ identify connectivity patterns, is less well advanced, due to a number of factors including scaling, and their hybrid nature combining graph connectivity and quantitative attributes. In this paper, we outline requirements and opportunities for graph-structured IPFLOW analytics based on our experience with real IPFLOW databases. Specifically, we describe real use cases from the security domain, cast them as graph patterns, show how to express them in two graph-oriented query languages SPARQL and Datalog, and use these examples to motivate a new class of "hybrid" graph-relational systems.« less
VIGOR: Interactive Visual Exploration of Graph Query Results.
Pienta, Robert; Hohman, Fred; Endert, Alex; Tamersoy, Acar; Roundy, Kevin; Gates, Chris; Navathe, Shamkant; Chau, Duen Horng
2018-01-01
Finding patterns in graphs has become a vital challenge in many domains from biological systems, network security, to finance (e.g., finding money laundering rings of bankers and business owners). While there is significant interest in graph databases and querying techniques, less research has focused on helping analysts make sense of underlying patterns within a group of subgraph results. Visualizing graph query results is challenging, requiring effective summarization of a large number of subgraphs, each having potentially shared node-values, rich node features, and flexible structure across queries. We present VIGOR, a novel interactive visual analytics system, for exploring and making sense of query results. VIGOR uses multiple coordinated views, leveraging different data representations and organizations to streamline analysts sensemaking process. VIGOR contributes: (1) an exemplar-based interaction technique, where an analyst starts with a specific result and relaxes constraints to find other similar results or starts with only the structure (i.e., without node value constraints), and adds constraints to narrow in on specific results; and (2) a novel feature-aware subgraph result summarization. Through a collaboration with Symantec, we demonstrate how VIGOR helps tackle real-world problems through the discovery of security blindspots in a cybersecurity dataset with over 11,000 incidents. We also evaluate VIGOR with a within-subjects study, demonstrating VIGOR's ease of use over a leading graph database management system, and its ability to help analysts understand their results at higher speed and make fewer errors.
Predicting and Detecting Emerging Cyberattack Patterns Using StreamWorks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, George; Choudhury, Sutanay; Feo, John T.
2014-06-30
The number and sophistication of cyberattacks on industries and governments have dramatically grown in recent years. To counter this movement, new advanced tools and techniques are needed to detect cyberattacks in their early stages such that defensive actions may be taken to avert or mitigate potential damage. From a cybersecurity analysis perspective, detecting cyberattacks may be cast as a problem of identifying patterns in computer network traffic. Logically and intuitively, these patterns may take on the form of a directed graph that conveys how an attack or intrusion propagates through the computers of a network. Such cyberattack graphs could providemore » cybersecurity analysts with powerful conceptual representations that are natural to express and analyze. We have been researching and developing graph-centric approaches and algorithms for dynamic cyberattack detection. The advanced dynamic graph algorithms we are developing will be packaged into a streaming network analysis framework known as StreamWorks. With StreamWorks, a scientist or analyst may detect and identify precursor events and patterns as they emerge in complex networks. This analysis framework is intended to be used in a dynamic environment where network data is streamed in and is appended to a large-scale dynamic graph. Specific graphical query patterns are decomposed and collected into a graph query library. The individual decomposed subpatterns in the library are continuously and efficiently matched against the dynamic graph as it evolves to identify and detect early, partial subgraph patterns. The scalable emerging subgraph pattern algorithms will match on both structural and semantic network properties.« less
A Semantic Graph Query Language
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kaplan, I L
2006-10-16
Semantic graphs can be used to organize large amounts of information from a number of sources into one unified structure. A semantic query language provides a foundation for extracting information from the semantic graph. The graph query language described here provides a simple, powerful method for querying semantic graphs.
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhury, Sutanay; Holder, Larry; Chin, George
2015-05-27
Cyber security is one of the most significant technical challenges in current times. Detecting adversarial activities, prevention of theft of intellectual properties and customer data is a high priority for corporations and government agencies around the world. Cyber defenders need to analyze massive-scale, high-resolution network flows to identify, categorize, and mitigate attacks involving networks spanning institutional and national boundaries. Many of the cyber attacks can be described as subgraph patterns, with prominent examples being insider infiltrations (path queries), denial of service (parallel paths) and malicious spreads (tree queries). This motivates us to explore subgraph matching on streaming graphs in amore » continuous setting. The novelty of our work lies in using the subgraph distributional statistics collected from the streaming graph to determine the query processing strategy. We introduce a ``Lazy Search" algorithm where the search strategy is decided on a vertex-to-vertex basis depending on the likelihood of a match in the vertex neighborhood. We also propose a metric named ``Relative Selectivity" that is used to select between different query processing strategies. Our experiments performed on real online news, network traffic stream and a synthetic social network benchmark demonstrate 10-100x speedups over non-incremental, selectivity agnostic approaches.« less
EmptyHeaded: A Relational Engine for Graph Processing
Aberger, Christopher R.; Tu, Susan; Olukotun, Kunle; Ré, Christopher
2016-01-01
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded’s design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP. PMID:28077912
NASA Astrophysics Data System (ADS)
Kase, Sue E.; Vanni, Michelle; Knight, Joanne A.; Su, Yu; Yan, Xifeng
2016-05-01
Within operational environments decisions must be made quickly based on the information available. Identifying an appropriate knowledge base and accurately formulating a search query are critical tasks for decision-making effectiveness in dynamic situations. The spreading of graph data management tools to access large graph databases is a rapidly emerging research area of potential benefit to the intelligence community. A graph representation provides a natural way of modeling data in a wide variety of domains. Graph structures use nodes, edges, and properties to represent and store data. This research investigates the advantages of information search by graph query initiated by the analyst and interactively refined within the contextual dimensions of the answer space toward a solution. The paper introduces SLQ, a user-friendly graph querying system enabling the visual formulation of schemaless and structureless graph queries. SLQ is demonstrated with an intelligence analyst information search scenario focused on identifying individuals responsible for manufacturing a mosquito-hosted deadly virus. The scenario highlights the interactive construction of graph queries without prior training in complex query languages or graph databases, intuitive navigation through the problem space, and visualization of results in graphical format.
Stracuzzi, David John; Brost, Randolph C.; Phillips, Cynthia A.; ...
2015-09-26
Geospatial semantic graphs provide a robust foundation for representing and analyzing remote sensor data. In particular, they support a variety of pattern search operations that capture the spatial and temporal relationships among the objects and events in the data. However, in the presence of large data corpora, even a carefully constructed search query may return a large number of unintended matches. This work considers the problem of calculating a quality score for each match to the query, given that the underlying data are uncertain. As a result, we present a preliminary evaluation of three methods for determining both match qualitymore » scores and associated uncertainty bounds, illustrated in the context of an example based on overhead imagery data.« less
Efficient Synthesis of Graph Methods: a Dynamically Scheduled Architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Minutoli, Marco; Castellana, Vito G.; Tumeo, Antonino
RDF databases naturally map to a graph representation and employ languages, such as SPARQL, that implements queries as graph pattern matching routines. Graph methods exhibit an irregular behavior: they present unpredictable, fine-grained data accesses, and are synchronization inten- sive. Graph data structures expose large amounts of dy- namic parallelism, but are difficult to partition without gen- erating load unbalance. In this paper, we present a novel ar- chitecture to improve the synthesis of graph methods. Our design addresses the issues of these algorithms with two com- ponents: a Dynamic Task Scheduler (DTS), which reduces load unbalance and maximize resource utilization,more » and a Hi- erarchical Memory Interface controller (HMI), which pro- vides support for concurrent memory operations on multi- ported/multi-banked shared memories. We evaluate our ap- proach by generating the accelerators for a set of SPARQL queries from the Lehigh University Benchmark (LUBM). We first analyze the load unbalance of these queries, showing that execution time among tasks can differ even of order of magnitudes. We then synthesize the queries and com- pare the performance of the resulting accelerators against the current state of the art. Experimental results show that our solution provides a speedup over the serial implementa- tion close to the theoretical maximum and a speedup up to 3.45 over a baseline parallel implementation. We conclude our study by exploring the design space to achieve maximum memory channels utilization. The best design used at least three of the four memory channels for more than 90% of the execution time.« less
Reactome graph database: Efficient access to complex pathway data
Korninger, Florian; Viteri, Guilherme; Marin-Garcia, Pablo; Ping, Peipei; Wu, Guanming; Stein, Lincoln; D’Eustachio, Peter
2018-01-01
Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types. PMID:29377902
Reactome graph database: Efficient access to complex pathway data.
Fabregat, Antonio; Korninger, Florian; Viteri, Guilherme; Sidiropoulos, Konstantinos; Marin-Garcia, Pablo; Ping, Peipei; Wu, Guanming; Stein, Lincoln; D'Eustachio, Peter; Hermjakob, Henning
2018-01-01
Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types.
Collaborative mining of graph patterns from multiple sources
NASA Astrophysics Data System (ADS)
Levchuk, Georgiy; Colonna-Romanoa, John
2016-05-01
Intelligence analysts require automated tools to mine multi-source data, including answering queries, learning patterns of life, and discovering malicious or anomalous activities. Graph mining algorithms have recently attracted significant attention in intelligence community, because the text-derived knowledge can be efficiently represented as graphs of entities and relationships. However, graph mining models are limited to use-cases involving collocated data, and often make restrictive assumptions about the types of patterns that need to be discovered, the relationships between individual sources, and availability of accurate data segmentation. In this paper we present a model to learn the graph patterns from multiple relational data sources, when each source might have only a fragment (or subgraph) of the knowledge that needs to be discovered, and segmentation of data into training or testing instances is not available. Our model is based on distributed collaborative graph learning, and is effective in situations when the data is kept locally and cannot be moved to a centralized location. Our experiments show that proposed collaborative learning achieves learning quality better than aggregated centralized graph learning, and has learning time comparable to traditional distributed learning in which a knowledge of data segmentation is needed.
GraQL: A Query Language for High-Performance Attributed Graph Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chavarría-Miranda, Daniel; Castellana, Vito G.; Morari, Alessandro
Graph databases have gained increasing interest in the last few years due to the emergence of data sources which are not easily analyzable in traditional relational models or for which a graph data model is the natural representation. In order to understand the design and implementation choices for an attributed graph database backend and query language, we have started to design our infrastructure for attributed graph databases. In this paper, we describe the design considerations of our in-memory attributed graph database system with a particular focus on the data definition and query language components.
SPARQL Query Re-writing Using Partonomy Based Transformation Rules
NASA Astrophysics Data System (ADS)
Jain, Prateek; Yeh, Peter Z.; Verma, Kunal; Henson, Cory A.; Sheth, Amit P.
Often the information present in a spatial knowledge base is represented at a different level of granularity and abstraction than the query constraints. For querying ontology's containing spatial information, the precise relationships between spatial entities has to be specified in the basic graph pattern of SPARQL query which can result in long and complex queries. We present a novel approach to help users intuitively write SPARQL queries to query spatial data, rather than relying on knowledge of the ontology structure. Our framework re-writes queries, using transformation rules to exploit part-whole relations between geographical entities to address the mismatches between query constraints and knowledge base. Our experiments were performed on completely third party datasets and queries. Evaluations were performed on Geonames dataset using questions from National Geographic Bee serialized into SPARQL and British Administrative Geography Ontology using questions from a popular trivia website. These experiments demonstrate high precision in retrieval of results and ease in writing queries.
Fast Inbound Top-K Query for Random Walk with Restart.
Zhang, Chao; Jiang, Shan; Chen, Yucheng; Sun, Yidan; Han, Jiawei
2015-09-01
Random walk with restart (RWR) is widely recognized as one of the most important node proximity measures for graphs, as it captures the holistic graph structure and is robust to noise in the graph. In this paper, we study a novel query based on the RWR measure, called the inbound top-k (Ink) query. Given a query node q and a number k , the Ink query aims at retrieving k nodes in the graph that have the largest weighted RWR scores to q . Ink queries can be highly useful for various applications such as traffic scheduling, disease treatment, and targeted advertising. Nevertheless, none of the existing RWR computation techniques can accurately and efficiently process the Ink query in large graphs. We propose two algorithms, namely Squeeze and Ripple, both of which can accurately answer the Ink query in a fast and incremental manner. To identify the top- k nodes, Squeeze iteratively performs matrix-vector multiplication and estimates the lower and upper bounds for all the nodes in the graph. Ripple employs a more aggressive strategy by only estimating the RWR scores for the nodes falling in the vicinity of q , the nodes outside the vicinity do not need to be evaluated because their RWR scores are propagated from the boundary of the vicinity and thus upper bounded. Ripple incrementally expands the vicinity until the top- k result set can be obtained. Our extensive experiments on real-life graph data sets show that Ink queries can retrieve interesting results, and the proposed algorithms are orders of magnitude faster than state-of-the-art method.
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhury, Sutanay; Holder, Larry; Chin, George
2015-02-02
Cyber security is one of the most significant technical challenges in current times. Detecting adversarial activities, prevention of theft of intellectual properties and customer data is a high priority for corporations and government agencies around the world. Cyber defenders need to analyze massive-scale, high-resolution network flows to identify, categorize, and mitigate attacks involving net- works spanning institutional and national boundaries. Many of the cyber attacks can be described as subgraph patterns, with promi- nent examples being insider infiltrations (path queries), denial of service (parallel paths) and malicious spreads (tree queries). This motivates us to explore subgraph matching on streaming graphsmore » in a continuous setting. The novelty of our work lies in using the subgraph distributional statistics collected from the streaming graph to determine the query processing strategy. We introduce a “Lazy Search" algorithm where the search strategy is decided on a vertex-to-vertex basis depending on the likelihood of a match in the vertex neighborhood. We also propose a metric named “Relative Selectivity" that is used to se- lect between different query processing strategies. Our experiments performed on real online news, network traffic stream and a syn- thetic social network benchmark demonstrate 10-100x speedups over selectivity agnostic approaches.« less
Graph Mining Meets the Semantic Web
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Sangkeun; Sukumar, Sreenivas R; Lim, Seung-Hwan
The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today, data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. We address that need through implementation of three popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, and PageRank). We implement these algorithms as SPARQL queries, wrapped within Python scripts. We evaluatemore » the performance of our implementation on 6 real world data sets and show graph mining algorithms (that have a linear-algebra formulation) can indeed be unleashed on data represented as RDF graphs using the SPARQL query interface.« less
Labeling RDF Graphs for Linear Time and Space Querying
NASA Astrophysics Data System (ADS)
Furche, Tim; Weinzierl, Antonius; Bry, François
Indices and data structures for web querying have mostly considered tree shaped data, reflecting the view of XML documents as tree-shaped. However, for RDF (and when querying ID/IDREF constraints in XML) data is indisputably graph-shaped. In this chapter, we first study existing indexing and labeling schemes for RDF and other graph datawith focus on support for efficient adjacency and reachability queries. For XML, labeling schemes are an important part of the widespread adoption of XML, in particular for mapping XML to existing (relational) database technology. However, the existing indexing and labeling schemes for RDF (and graph data in general) sacrifice one of the most attractive properties of XML labeling schemes, the constant time (and per-node space) test for adjacency (child) and reachability (descendant). In the second part, we introduce the first labeling scheme for RDF data that retains this property and thus achieves linear time and space processing of acyclic RDF queries on a significantly larger class of graphs than previous approaches (which are mostly limited to tree-shaped data). Finally, we show how this labeling scheme can be applied to (acyclic) SPARQL queries to obtain an evaluation algorithm with time and space complexity linear in the number of resources in the queried RDF graph.
An alternative database approach for management of SNOMED CT and improved patient data queries.
Campbell, W Scott; Pedersen, Jay; McClay, James C; Rao, Praveen; Bastola, Dhundy; Campbell, James R
2015-10-01
SNOMED CT is the international lingua franca of terminologies for human health. Based in Description Logics (DL), the terminology enables data queries that incorporate inferences between data elements, as well as, those relationships that are explicitly stated. However, the ontologic and polyhierarchical nature of the SNOMED CT concept model make it difficult to implement in its entirety within electronic health record systems that largely employ object oriented or relational database architectures. The result is a reduction of data richness, limitations of query capability and increased systems overhead. The hypothesis of this research was that a graph database (graph DB) architecture using SNOMED CT as the basis for the data model and subsequently modeling patient data upon the semantic core of SNOMED CT could exploit the full value of the terminology to enrich and support advanced data querying capability of patient data sets. The hypothesis was tested by instantiating a graph DB with the fully classified SNOMED CT concept model. The graph DB instance was tested for integrity by calculating the transitive closure table for the SNOMED CT hierarchy and comparing the results with transitive closure tables created using current, validated methods. The graph DB was then populated with 461,171 anonymized patient record fragments and over 2.1 million associated SNOMED CT clinical findings. Queries, including concept negation and disjunction, were then run against the graph database and an enterprise Oracle relational database (RDBMS) of the same patient data sets. The graph DB was then populated with laboratory data encoded using LOINC, as well as, medication data encoded with RxNorm and complex queries performed using LOINC, RxNorm and SNOMED CT to identify uniquely described patient populations. A graph database instance was successfully created for two international releases of SNOMED CT and two US SNOMED CT editions. Transitive closure tables and descriptive statistics generated using the graph database were identical to those using validated methods. Patient queries produced identical patient count results to the Oracle RDBMS with comparable times. Database queries involving defining attributes of SNOMED CT concepts were possible with the graph DB. The same queries could not be directly performed with the Oracle RDBMS representation of the patient data and required the creation and use of external terminology services. Further, queries of undefined depth were successful in identifying unknown relationships between patient cohorts. The results of this study supported the hypothesis that a patient database built upon and around the semantic model of SNOMED CT was possible. The model supported queries that leveraged all aspects of the SNOMED CT logical model to produce clinically relevant query results. Logical disjunction and negation queries were possible using the data model, as well as, queries that extended beyond the structural IS_A hierarchy of SNOMED CT to include queries that employed defining attribute-values of SNOMED CT concepts as search parameters. As medical terminologies, such as SNOMED CT, continue to expand, they will become more complex and model consistency will be more difficult to assure. Simultaneously, consumers of data will increasingly demand improvements to query functionality to accommodate additional granularity of clinical concepts without sacrificing speed. This new line of research provides an alternative approach to instantiating and querying patient data represented using advanced computable clinical terminologies. Copyright © 2015 Elsevier Inc. All rights reserved.
A Coding Method for Efficient Subgraph Querying on Vertex- and Edge-Labeled Graphs
Zhu, Lei; Song, Qinbao; Guo, Yuchen; Du, Lei; Zhu, Xiaoyan; Wang, Guangtao
2014-01-01
Labeled graphs are widely used to model complex data in many domains, so subgraph querying has been attracting more and more attention from researchers around the world. Unfortunately, subgraph querying is very time consuming since it involves subgraph isomorphism testing that is known to be an NP-complete problem. In this paper, we propose a novel coding method for subgraph querying that is based on Laplacian spectrum and the number of walks. Our method follows the filtering-and-verification framework and works well on graph databases with frequent updates. We also propose novel two-step filtering conditions that can filter out most false positives and prove that the two-step filtering conditions satisfy the no-false-negative requirement (no dismissal in answers). Extensive experiments on both real and synthetic graphs show that, compared with six existing counterpart methods, our method can effectively improve the efficiency of subgraph querying. PMID:24853266
Parasol: An Architecture for Cross-Cloud Federated Graph Querying
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lieberman, Michael; Choudhury, Sutanay; Hughes, Marisa
2014-06-22
Large scale data fusion of multiple datasets can often provide in- sights that examining datasets individually cannot. However, when these datasets reside in different data centers and cannot be collocated due to technical, administrative, or policy barriers, a unique set of problems arise that hamper querying and data fusion. To ad- dress these problems, a system and architecture named Parasol is presented that enables federated queries over graph databases residing in multiple clouds. Parasol’s design is flexible and requires only minimal assumptions for participant clouds. Query optimization techniques are also described that are compatible with Parasol’s lightweight architecture. Experiments onmore » a prototype implementation of Parasol indicate its suitability for cross-cloud federated graph queries.« less
HodDB: Design and Analysis of a Query Processor for Brick.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fierro, Gabriel; Culler, David
Brick is a recently proposed metadata schema and ontology for describing building components and the relationships between them. It represents buildings as directed labeled graphs using the RDF data model. Using the SPARQL query language, building-agnostic applications query a Brick graph to discover the set of resources and relationships they require to operate. Latency-sensitive applications, such as user interfaces, demand response and modelpredictive control, require fast queries — conventionally less than 100ms. We benchmark a set of popular open-source and commercial SPARQL databases against three real Brick models using seven application queries and find that none of them meet thismore » performance target. This lack of performance can be attributed to design decisions that optimize for queries over large graphs consisting of billions of triples, but give poor spatial locality and join performance on the small dense graphs typical of Brick. We present the design and evaluation of HodDB, a RDF/SPARQL database for Brick built over a node-based index structure. HodDB performs Brick queries 3-700x faster than leading SPARQL databases and consistently meets the 100ms threshold, enabling the portability of important latency-sensitive building applications.« less
A Random Walk Approach to Query Informative Constraints for Clustering.
Abin, Ahmad Ali
2017-08-09
This paper presents a random walk approach to the problem of querying informative constraints for clustering. The proposed method is based on the properties of the commute time, that is the expected time taken for a random walk to travel between two nodes and return, on the adjacency graph of data. Commute time has the nice property of that, the more short paths connect two given nodes in a graph, the more similar those nodes are. Since computing the commute time takes the Laplacian eigenspectrum into account, we use this property in a recursive fashion to query informative constraints for clustering. At each recursion, the proposed method constructs the adjacency graph of data and utilizes the spectral properties of the commute time matrix to bipartition the adjacency graph. Thereafter, the proposed method benefits from the commute times distance on graph to query informative constraints between partitions. This process iterates for each partition until the stop condition becomes true. Experiments on real-world data show the efficiency of the proposed method for constraints selection.
EAGLE: 'EAGLE'Is an' Algorithmic Graph Library for Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
2015-01-16
The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. Today there is no tools to conduct "graph mining" on RDF standard data sets. We address that need through implementation of popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, degree distribution,more » diversity degree, PageRank, etc.). We implement these algorithms as SPARQL queries, wrapped within Python scripts and call our software tool as EAGLE. In RDF style, EAGLE stands for "EAGLE 'Is an' algorithmic graph library for exploration. EAGLE is like 'MATLAB' for 'Linked Data.'« less
SING: Subgraph search In Non-homogeneous Graphs
2010-01-01
Background Finding the subgraphs of a graph database that are isomorphic to a given query graph has practical applications in several fields, from cheminformatics to image understanding. Since subgraph isomorphism is a computationally hard problem, indexing techniques have been intensively exploited to speed up the process. Such systems filter out those graphs which cannot contain the query, and apply a subgraph isomorphism algorithm to each residual candidate graph. The applicability of such systems is limited to databases of small graphs, because their filtering power degrades on large graphs. Results In this paper, SING (Subgraph search In Non-homogeneous Graphs), a novel indexing system able to cope with large graphs, is presented. The method uses the notion of feature, which can be a small subgraph, subtree or path. Each graph in the database is annotated with the set of all its features. The key point is to make use of feature locality information. This idea is used to both improve the filtering performance and speed up the subgraph isomorphism task. Conclusions Extensive tests on chemical compounds, biological networks and synthetic graphs show that the proposed system outperforms the most popular systems in query time over databases of medium and large graphs. Other specific tests show that the proposed system is effective for single large graphs. PMID:20170516
Processing SPARQL queries with regular expressions in RDF databases
2011-01-01
Background As the Resource Description Framework (RDF) data model is widely used for modeling and sharing a lot of online bioinformatics resources such as Uniprot (dev.isb-sib.ch/projects/uniprot-rdf) or Bio2RDF (bio2rdf.org), SPARQL - a W3C recommendation query for RDF databases - has become an important query language for querying the bioinformatics knowledge bases. Moreover, due to the diversity of users’ requests for extracting information from the RDF data as well as the lack of users’ knowledge about the exact value of each fact in the RDF databases, it is desirable to use the SPARQL query with regular expression patterns for querying the RDF data. To the best of our knowledge, there is currently no work that efficiently supports regular expression processing in SPARQL over RDF databases. Most of the existing techniques for processing regular expressions are designed for querying a text corpus, or only for supporting the matching over the paths in an RDF graph. Results In this paper, we propose a novel framework for supporting regular expression processing in SPARQL query. Our contributions can be summarized as follows. 1) We propose an efficient framework for processing SPARQL queries with regular expression patterns in RDF databases. 2) We propose a cost model in order to adapt the proposed framework in the existing query optimizers. 3) We build a prototype for the proposed framework in C++ and conduct extensive experiments demonstrating the efficiency and effectiveness of our technique. Conclusions Experiments with a full-blown RDF engine show that our framework outperforms the existing ones by up to two orders of magnitude in processing SPARQL queries with regular expression patterns. PMID:21489225
Processing SPARQL queries with regular expressions in RDF databases.
Lee, Jinsoo; Pham, Minh-Duc; Lee, Jihwan; Han, Wook-Shin; Cho, Hune; Yu, Hwanjo; Lee, Jeong-Hoon
2011-03-29
As the Resource Description Framework (RDF) data model is widely used for modeling and sharing a lot of online bioinformatics resources such as Uniprot (dev.isb-sib.ch/projects/uniprot-rdf) or Bio2RDF (bio2rdf.org), SPARQL - a W3C recommendation query for RDF databases - has become an important query language for querying the bioinformatics knowledge bases. Moreover, due to the diversity of users' requests for extracting information from the RDF data as well as the lack of users' knowledge about the exact value of each fact in the RDF databases, it is desirable to use the SPARQL query with regular expression patterns for querying the RDF data. To the best of our knowledge, there is currently no work that efficiently supports regular expression processing in SPARQL over RDF databases. Most of the existing techniques for processing regular expressions are designed for querying a text corpus, or only for supporting the matching over the paths in an RDF graph. In this paper, we propose a novel framework for supporting regular expression processing in SPARQL query. Our contributions can be summarized as follows. 1) We propose an efficient framework for processing SPARQL queries with regular expression patterns in RDF databases. 2) We propose a cost model in order to adapt the proposed framework in the existing query optimizers. 3) We build a prototype for the proposed framework in C++ and conduct extensive experiments demonstrating the efficiency and effectiveness of our technique. Experiments with a full-blown RDF engine show that our framework outperforms the existing ones by up to two orders of magnitude in processing SPARQL queries with regular expression patterns.
Browsing schematics: Query-filtered graphs with context nodes
NASA Technical Reports Server (NTRS)
Ciccarelli, Eugene C.; Nardi, Bonnie A.
1988-01-01
The early results of a research project to create tools for building interfaces to intelligent systems on the NASA Space Station are reported. One such tool is the Schematic Browser which helps users engaged in engineering problem solving find and select schematics from among a large set. Users query for schematics with certain components, and the Schematic Browser presents a graph whose nodes represent the schematics with those components. The query greatly reduces the number of choices presented to the user, filtering the graph to a manageable size. Users can reformulate and refine the query serially until they locate the schematics of interest. To help users maintain orientation as they navigate a large body of data, the graph also includes nodes that are not matches but provide global and local context for the matching nodes. Context nodes include landmarks, ancestors, siblings, children and previous matches.
G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.
Wang, Xiaohong; Smalter, Aaron; Huan, Jun; Lushington, Gerald H
2009-01-01
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database.Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.
Query optimization for graph analytics on linked data using SPARQL
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Seokyong; Lee, Sangkeun; Lim, Seung -Hwan
2015-07-01
Triplestores that support query languages such as SPARQL are emerging as the preferred and scalable solution to represent data and meta-data as massive heterogeneous graphs using Semantic Web standards. With increasing adoption, the desire to conduct graph-theoretic mining and exploratory analysis has also increased. Addressing that desire, this paper presents a solution that is the marriage of Graph Theory and the Semantic Web. We present software that can analyze Linked Data using graph operations such as counting triangles, finding eccentricity, testing connectedness, and computing PageRank directly on triple stores via the SPARQL interface. We describe the process of optimizing performancemore » of the SPARQL-based implementation of such popular graph algorithms by reducing the space-overhead, simplifying iterative complexity and removing redundant computations by understanding query plans. Our optimized approach shows significant performance gains on triplestores hosted on stand-alone workstations as well as hardware-optimized scalable supercomputers such as the Cray XMT.« less
Property Graph vs RDF Triple Store: A Comparison on Glycan Substructure Search
Alocci, Davide; Mariethoz, Julien; Horlacher, Oliver; Bolleman, Jerven T.; Campbell, Matthew P.; Lisacek, Frederique
2015-01-01
Resource description framework (RDF) and Property Graph databases are emerging technologies that are used for storing graph-structured data. We compare these technologies through a molecular biology use case: glycan substructure search. Glycans are branched tree-like molecules composed of building blocks linked together by chemical bonds. The molecular structure of a glycan can be encoded into a direct acyclic graph where each node represents a building block and each edge serves as a chemical linkage between two building blocks. In this context, Graph databases are possible software solutions for storing glycan structures and Graph query languages, such as SPARQL and Cypher, can be used to perform a substructure search. Glycan substructure searching is an important feature for querying structure and experimental glycan databases and retrieving biologically meaningful data. This applies for example to identifying a region of the glycan recognised by a glycan binding protein (GBP). In this study, 19,404 glycan structures were selected from GlycomeDB (www.glycome-db.org) and modelled for being stored into a RDF triple store and a Property Graph. We then performed two different sets of searches and compared the query response times and the results from both technologies to assess performance and accuracy. The two implementations produced the same results, but interestingly we noted a difference in the query response times. Qualitative measures such as portability were also used to define further criteria for choosing the technology adapted to solving glycan substructure search and other comparable issues. PMID:26656740
Use of Graph Database for the Integration of Heterogeneous Biological Data.
Yoon, Byoung-Ha; Kim, Seon-Kyu; Kim, Seon-Young
2017-03-01
Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.
Use of Graph Database for the Integration of Heterogeneous Biological Data
Yoon, Byoung-Ha; Kim, Seon-Kyu
2017-01-01
Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data. PMID:28416946
SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases.
Schweiger, Dominik; Trajanoski, Zlatko; Pabinger, Stephan
2014-08-15
Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.
Bandyopadhyay, Deepak; Huan, Jun; Prins, Jan; Snoeyink, Jack; Wang, Wei; Tropsha, Alexander
2009-11-01
Protein function prediction is one of the central problems in computational biology. We present a novel automated protein structure-based function prediction method using libraries of local residue packing patterns that are common to most proteins in a known functional family. Critical to this approach is the representation of a protein structure as a graph where residue vertices (residue name used as a vertex label) are connected by geometrical proximity edges. The approach employs two steps. First, it uses a fast subgraph mining algorithm to find all occurrences of family-specific labeled subgraphs for all well characterized protein structural and functional families. Second, it queries a new structure for occurrences of a set of motifs characteristic of a known family, using a graph index to speed up Ullman's subgraph isomorphism algorithm. The confidence of function inference from structure depends on the number of family-specific motifs found in the query structure compared with their distribution in a large non-redundant database of proteins. This method can assign a new structure to a specific functional family in cases where sequence alignments, sequence patterns, structural superposition and active site templates fail to provide accurate annotation.
Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track
2015-11-20
Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track Paul N. Bennett Microsoft Research Redmond, USA pauben...anchor text graph has proven useful in the general realm of query reformulation [2], we sought to quantify the value of extracting key phrases from...anchor text in the broader setting of the task understanding track. Given a query, our approach considers a simple method for identifying a relevant
A novel adaptive Cuckoo search for optimal query plan generation.
Gomathi, Ramalingam; Sharmila, Dhandapani
2014-01-01
The emergence of multiple web pages day by day leads to the development of the semantic web technology. A World Wide Web Consortium (W3C) standard for storing semantic web data is the resource description framework (RDF). To enhance the efficiency in the execution time for querying large RDF graphs, the evolving metaheuristic algorithms become an alternate to the traditional query optimization methods. This paper focuses on the problem of query optimization of semantic web data. An efficient algorithm called adaptive Cuckoo search (ACS) for querying and generating optimal query plan for large RDF graphs is designed in this research. Experiments were conducted on different datasets with varying number of predicates. The experimental results have exposed that the proposed approach has provided significant results in terms of query execution time. The extent to which the algorithm is efficient is tested and the results are documented.
NASA Astrophysics Data System (ADS)
Arenas, Marcelo; Gutierrez, Claudio; Pérez, Jorge
The Resource Description Framework (RDF) is the standard data model for representing information about World Wide Web resources. In January 2008, it was released the recommendation of the W3C for querying RDF data, a query language called SPARQL. In this chapter, we give a detailed description of the semantics of this language. We start by focusing on the definition of a formal semantics for the core part of SPARQL, and then move to the definition for the entire language, including all the features in the specification of SPARQL by the W3C such as blank nodes in graph patterns and bag semantics for solutions.
G-Bean: an ontology-graph based web tool for biomedical literature retrieval
2014-01-01
Background Currently, most people use NCBI's PubMed to search the MEDLINE database, an important bibliographical information source for life science and biomedical information. However, PubMed has some drawbacks that make it difficult to find relevant publications pertaining to users' individual intentions, especially for non-expert users. To ameliorate the disadvantages of PubMed, we developed G-Bean, a graph based biomedical search engine, to search biomedical articles in MEDLINE database more efficiently. Methods G-Bean addresses PubMed's limitations with three innovations: (1) Parallel document index creation: a multithreaded index creation strategy is employed to generate the document index for G-Bean in parallel; (2) Ontology-graph based query expansion: an ontology graph is constructed by merging four major UMLS (Version 2013AA) vocabularies, MeSH, SNOMEDCT, CSP and AOD, to cover all concepts in National Library of Medicine (NLM) database; a Personalized PageRank algorithm is used to compute concept relevance in this ontology graph and the Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme is used to re-rank the concepts. The top 500 ranked concepts are selected for expanding the initial query to retrieve more accurate and relevant information; (3) Retrieval and re-ranking of documents based on user's search intention: after the user selects any article from the existing search results, G-Bean analyzes user's selections to determine his/her true search intention and then uses more relevant and more specific terms to retrieve additional related articles. The new articles are presented to the user in the order of their relevance to the already selected articles. Results Performance evaluation with 106 OHSUMED benchmark queries shows that G-Bean returns more relevant results than PubMed does when using these queries to search the MEDLINE database. PubMed could not even return any search result for some OHSUMED queries because it failed to form the appropriate Boolean query statement automatically from the natural language query strings. G-Bean is available at http://bioinformatics.clemson.edu/G-Bean/index.php. Conclusions G-Bean addresses PubMed's limitations with ontology-graph based query expansion, automatic document indexing, and user search intention discovery. It shows significant advantages in finding relevant articles from the MEDLINE database to meet the information need of the user. PMID:25474588
G-Bean: an ontology-graph based web tool for biomedical literature retrieval.
Wang, James Z; Zhang, Yuanyuan; Dong, Liang; Li, Lin; Srimani, Pradip K; Yu, Philip S
2014-01-01
Currently, most people use NCBI's PubMed to search the MEDLINE database, an important bibliographical information source for life science and biomedical information. However, PubMed has some drawbacks that make it difficult to find relevant publications pertaining to users' individual intentions, especially for non-expert users. To ameliorate the disadvantages of PubMed, we developed G-Bean, a graph based biomedical search engine, to search biomedical articles in MEDLINE database more efficiently. G-Bean addresses PubMed's limitations with three innovations: (1) Parallel document index creation: a multithreaded index creation strategy is employed to generate the document index for G-Bean in parallel; (2) Ontology-graph based query expansion: an ontology graph is constructed by merging four major UMLS (Version 2013AA) vocabularies, MeSH, SNOMEDCT, CSP and AOD, to cover all concepts in National Library of Medicine (NLM) database; a Personalized PageRank algorithm is used to compute concept relevance in this ontology graph and the Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme is used to re-rank the concepts. The top 500 ranked concepts are selected for expanding the initial query to retrieve more accurate and relevant information; (3) Retrieval and re-ranking of documents based on user's search intention: after the user selects any article from the existing search results, G-Bean analyzes user's selections to determine his/her true search intention and then uses more relevant and more specific terms to retrieve additional related articles. The new articles are presented to the user in the order of their relevance to the already selected articles. Performance evaluation with 106 OHSUMED benchmark queries shows that G-Bean returns more relevant results than PubMed does when using these queries to search the MEDLINE database. PubMed could not even return any search result for some OHSUMED queries because it failed to form the appropriate Boolean query statement automatically from the natural language query strings. G-Bean is available at http://bioinformatics.clemson.edu/G-Bean/index.php. G-Bean addresses PubMed's limitations with ontology-graph based query expansion, automatic document indexing, and user search intention discovery. It shows significant advantages in finding relevant articles from the MEDLINE database to meet the information need of the user.
Learning of Multimodal Representations With Random Walks on the Click Graph.
Wu, Fei; Lu, Xinyan; Song, Jun; Yan, Shuicheng; Zhang, Zhongfei Mark; Rui, Yong; Zhuang, Yueting
2016-02-01
In multimedia information retrieval, most classic approaches tend to represent different modalities of media in the same feature space. With the click data collected from the users' searching behavior, existing approaches take either one-to-one paired data (text-image pairs) or ranking examples (text-query-image and/or image-query-text ranking lists) as training examples, which do not make full use of the click data, particularly the implicit connections among the data objects. In this paper, we treat the click data as a large click graph, in which vertices are images/text queries and edges indicate the clicks between an image and a query. We consider learning a multimodal representation from the perspective of encoding the explicit/implicit relevance relationship between the vertices in the click graph. By minimizing both the truncated random walk loss as well as the distance between the learned representation of vertices and their corresponding deep neural network output, the proposed model which is named multimodal random walk neural network (MRW-NN) can be applied to not only learn robust representation of the existing multimodal data in the click graph, but also deal with the unseen queries and images to support cross-modal retrieval. We evaluate the latent representation learned by MRW-NN on a public large-scale click log data set Clickture and further show that MRW-NN achieves much better cross-modal retrieval performance on the unseen queries/images than the other state-of-the-art methods.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Wu, Kesheng
The Resource Description Framework (RDF) is a popular data model for representing linked data sets arising from the web, as well as large scienti c data repositories such as UniProt. RDF data intrinsically represents a labeled and directed multi-graph. SPARQL is a query language for RDF that expresses subgraph pattern- nding queries on this implicit multigraph in a SQL- like syntax. SPARQL queries generate complex intermediate join queries; to compute these joins e ciently, we propose a new strategy based on bitmap indexes. We store the RDF data in column-oriented structures as compressed bitmaps along with two dictionaries. This papermore » makes three new contributions. (i) We present an e cient parallel strategy for parsing the raw RDF data, building dictionaries of unique entities, and creating compressed bitmap indexes of the data. (ii) We utilize the constructed bitmap indexes to e ciently answer SPARQL queries, simplifying the join evaluations. (iii) To quantify the performance impact of using bitmap indexes, we compare our approach to the state-of-the-art triple-store RDF-3X. We nd that our bitmap index-based approach to answering queries is up to an order of magnitude faster for a variety of SPARQL queries, on gigascale RDF data sets.« less
Enhancing SAMOS Data Access in DOMS via a Neo4j Property Graph Database.
NASA Astrophysics Data System (ADS)
Stallard, A. P.; Smith, S. R.; Elya, J. L.
2016-12-01
The Shipboard Automated Meteorological and Oceanographic System (SAMOS) initiative provides routine access to high-quality marine meteorological and near-surface oceanographic observations from research vessels. The Distributed Oceanographic Match-Up Service (DOMS) under development is a centralized service that allows researchers to easily match in situ and satellite oceanographic data from distributed sources to facilitate satellite calibration, validation, and retrieval algorithm development. The service currently uses Apache Solr as a backend search engine on each node in the distributed network. While Solr is a high-performance solution that facilitates creation and maintenance of indexed data, it is limited in the sense that its schema is fixed. The property graph model escapes this limitation by creating relationships between data objects. The authors will present the development of the SAMOS Neo4j property graph database including new search possibilities that take advantage of the property graph model, performance comparisons with Apache Solr, and a vision for graph databases as a storage tool for oceanographic data. The integration of the SAMOS Neo4j graph into DOMS will also be described. Currently, Neo4j contains spatial and temporal records from SAMOS which are modeled into a time tree and r-tree using Graph Aware and Spatial plugin tools for Neo4j. These extensions provide callable Java procedures within CYPHER (Neo4j's query language) that generate in-graph structures. Once generated, these structures can be queried using procedures from these libraries, or directly via CYPHER statements. Neo4j excels at performing relationship and path-based queries, which challenge relational-SQL databases because they require memory intensive joins due to the limitation of their design. Consider a user who wants to find records over several years, but only for specific months. If a traditional database only stores timestamps, this type of query would be complex and likely prohibitively slow. Using the time tree model, one can specify a path from the root to the data which restricts resolutions to certain timeframes (e.g., months). This query can be executed without joins, unions, or other compute-intensive operations, putting Neo4j at a computational advantage to the SQL database alternative.
Enabling Graph Appliance for Genome Assembly
DOE Office of Scientific and Technical Information (OSTI.GOV)
Singh, Rina; Graves, Jeffrey A; Lee, Sangkeun
2015-01-01
In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to storemore » and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less
Composing Data Parallel Code for a SPARQL Graph Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste
Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less
Menopause and big data: Word Adjacency Graph modeling of menopause-related ChaCha data.
Carpenter, Janet S; Groves, Doyle; Chen, Chen X; Otte, Julie L; Miller, Wendy R
2017-07-01
To detect and visualize salient queries about menopause using Big Data from ChaCha. We used Word Adjacency Graph (WAG) modeling to detect clusters and visualize the range of menopause-related topics and their mutual proximity. The subset of relevant queries was fully modeled. We split each query into token words (ie, meaningful words and phrases) and removed stopwords (ie, not meaningful functional words). The remaining words were considered in sequence to build summary tables of words and two and three-word phrases. Phrases occurring at least 10 times were used to build a network graph model that was iteratively refined by observing and removing clusters of unrelated content. We identified two menopause-related subsets of queries by searching for questions containing menopause and menopause-related terms (eg, climacteric, hot flashes, night sweats, hormone replacement). The first contained 263,363 queries from individuals aged 13 and older and the second contained 5,892 queries from women aged 40 to 62 years. In the first set, we identified 12 topic clusters: 6 relevant to menopause and 6 less relevant. In the second set, we identified 15 topic clusters: 11 relevant to menopause and 4 less relevant. Queries about hormones were pervasive within both WAG models. Many of the queries reflected low literacy levels and/or feelings of embarrassment. We modeled menopause-related queries posed by ChaCha users between 2009 and 2012. ChaCha data may be used on its own or in combination with other Big Data sources to identify patient-driven educational needs and create patient-centered interventions.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Building Scalable Knowledge Graphs for Earth Science
NASA Technical Reports Server (NTRS)
Ramachandran, Rahul; Maskey, Manil; Gatlin, Patrick; Zhang, Jia; Duan, Xiaoyi; Miller, J. J.; Bugbee, Kaylin; Christopher, Sundar; Freitag, Brian
2017-01-01
Knowledge Graphs link key entities in a specific domain with other entities via relationships. From these relationships, researchers can query knowledge graphs for probabilistic recommendations to infer new knowledge. Scientific papers are an untapped resource which knowledge graphs could leverage to accelerate research discovery. Goal: Develop an end-to-end (semi) automated methodology for constructing Knowledge Graphs for Earth Science.
Real-time community detection in full social networks on a laptop
Chamberlain, Benjamin Paul; Levy-Kramer, Josh; Humby, Clive
2018-01-01
For a broad range of research and practical applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As global social networks (e.g., Facebook and Twitter) are very large, most approaches make use of distributed computing systems for this purpose. Distributing graph processing requires solving many difficult engineering problems, which has lead some researchers to look at single-machine solutions that are faster and easier to maintain. In this article, we present an approach for analyzing full social networks on a standard laptop, allowing for interactive exploration of the communities in the locality of a set of user specified query vertices. The key idea is that the aggregate actions of large numbers of users can be compressed into a data structure that encapsulates the edge weights between vertices in a derived graph. Local communities can be constructed by selecting vertices that are connected to the query vertices with high edge weights in the derived graph. This compression is robust to noise and allows for interactive queries of local communities in real-time, which we define to be less than the average human reaction time of 0.25s. We achieve single-machine real-time performance by compressing the neighborhood of each vertex using minhash signatures and facilitate rapid queries through Locality Sensitive Hashing. These techniques reduce query times from hours using industrial desktop machines operating on the full graph to milliseconds on standard laptops. Our method allows exploration of strongly associated regions (i.e., communities) of large graphs in real-time on a laptop. It has been deployed in software that is actively used by social network analysts and offers another channel for media owners to monetize their data, helping them to continue to provide free services that are valued by billions of people globally. PMID:29342158
Balaur, Irina; Saqi, Mansoor; Barat, Ana; Lysenko, Artem; Mazein, Alexander; Rawlings, Christopher J; Ruskin, Heather J; Auffray, Charles
2017-10-01
The development of colorectal cancer (CRC)-the third most common cancer type-has been associated with deregulations of cellular mechanisms stimulated by both genetic and epigenetic events. StatEpigen is a manually curated and annotated database, containing information on interdependencies between genetic and epigenetic signals, and specialized currently for CRC research. Although StatEpigen provides a well-developed graphical user interface for information retrieval, advanced queries involving associations between multiple concepts can benefit from more detailed graph representation of the integrated data. This can be achieved by using a graph database (NoSQL) approach. Data were extracted from StatEpigen and imported to our newly developed EpiGeNet, a graph database for storage and querying of conditional relationships between molecular (genetic and epigenetic) events observed at different stages of colorectal oncogenesis. We illustrate the enhanced capability of EpiGeNet for exploration of different queries related to colorectal tumor progression; specifically, we demonstrate the query process for (i) stage-specific molecular events, (ii) most frequently observed genetic and epigenetic interdependencies in colon adenoma, and (iii) paths connecting key genes reported in CRC and associated events. The EpiGeNet framework offers improved capability for management and visualization of data on molecular events specific to CRC initiation and progression.
Dynamic Querying of Mass-Storage RDF Data with Rule-Based Entailment Regimes
NASA Astrophysics Data System (ADS)
Ianni, Giovambattista; Krennwallner, Thomas; Martello, Alessandra; Polleres, Axel
RDF Schema (RDFS) as a lightweight ontology language is gaining popularity and, consequently, tools for scalable RDFS inference and querying are needed. SPARQL has become recently a W3C standard for querying RDF data, but it mostly provides means for querying simple RDF graphs only, whereas querying with respect to RDFS or other entailment regimes is left outside the current specification. In this paper, we show that SPARQL faces certain unwanted ramifications when querying ontologies in conjunction with RDF datasets that comprise multiple named graphs, and we provide an extension for SPARQL that remedies these effects. Moreover, since RDFS inference has a close relationship with logic rules, we generalize our approach to select a custom ruleset for specifying inferences to be taken into account in a SPARQL query. We show that our extensions are technically feasible by providing benchmark results for RDFS querying in our prototype system GiaBATA, which uses Datalog coupled with a persistent Relational Database as a back-end for implementing SPARQL with dynamic rule-based inference. By employing different optimization techniques like magic set rewriting our system remains competitive with state-of-the-art RDFS querying systems.
DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases
NASA Astrophysics Data System (ADS)
Bröcheler, Matthias; Pugliese, Andrea; Subrahmanian, V. S.
RDF is an increasingly important paradigm for the representation of information on the Web. As RDF databases increase in size to approach tens of millions of triples, and as sophisticated graph matching queries expressible in languages like SPARQL become increasingly important, scalability becomes an issue. To date, there is no graph-based indexing method for RDF data where the index was designed in a way that makes it disk-resident. There is therefore a growing need for indexes that can operate efficiently when the index itself resides on disk. In this paper, we first propose the DOGMA index for fast subgraph matching on disk and then develop a basic algorithm to answer queries over this index. This algorithm is then significantly sped up via an optimized algorithm that uses efficient (but correct) pruning strategies when combined with two different extensions of the index. We have implemented a preliminary system and tested it against four existing RDF database systems developed by others. Our experiments show that our algorithm performs very well compared to these systems, with orders of magnitude improvements for complex graph queries.
Top-k similar graph matching using TraM in biological networks.
Amin, Mohammad Shafkat; Finley, Russell L; Jamil, Hasan M
2012-01-01
Many emerging database applications entail sophisticated graph-based query manipulation, predominantly evident in large-scale scientific applications. To access the information embedded in graphs, efficient graph matching tools and algorithms have become of prime importance. Although the prohibitively expensive time complexity associated with exact subgraph isomorphism techniques has limited its efficacy in the application domain, approximate yet efficient graph matching techniques have received much attention due to their pragmatic applicability. Since public domain databases are noisy and incomplete in nature, inexact graph matching techniques have proven to be more promising in terms of inferring knowledge from numerous structural data repositories. In this paper, we propose a novel technique called TraM for approximate graph matching that off-loads a significant amount of its processing on to the database making the approach viable for large graphs. Moreover, the vector space embedding of the graphs and efficient filtration of the search space enables computation of approximate graph similarity at a throw-away cost. We annotate nodes of the query graphs by means of their global topological properties and compare them with neighborhood biased segments of the datagraph for proper matches. We have conducted experiments on several real data sets, and have demonstrated the effectiveness and efficiency of the proposed method
Linked data and provenance in biological data webs.
Zhao, Jun; Miles, Alistair; Klyne, Graham; Shotton, David
2009-03-01
The Web is now being used as a platform for publishing and linking life science data. The Web's linking architecture can be exploited to join heterogeneous data from multiple sources. However, as data are frequently being updated in a decentralized environment, provenance information becomes critical to providing reliable and trustworthy services to scientists. This article presents design patterns for representing and querying provenance information relating to mapping links between heterogeneous data from sources in the domain of functional genomics. We illustrate the use of named resource description framework (RDF) graphs at different levels of granularity to make provenance assertions about linked data, and demonstrate that these assertions are sufficient to support requirements including data currency, integrity, evidential support and historical queries.
Framework for Querying and Analysis of Evolving Graphs
ERIC Educational Resources Information Center
Moffitt, Vera Zaychik
2017-01-01
Graph representations underlie many modern computer applications, capturing the structure of such diverse networks as the Internet, personal associations, roads, sensors, and metabolic pathways. While the static structure of graphs is a well-explored field, a new emphasis is being placed on understanding and representing the way these networks…
Guhlin, Joseph; Silverstein, Kevin A T; Zhou, Peng; Tiffin, Peter; Young, Nevin D
2017-08-10
Rapid generation of omics data in recent years have resulted in vast amounts of disconnected datasets without systemic integration and knowledge building, while individual groups have made customized, annotated datasets available on the web with few ways to link them to in-lab datasets. With so many research groups generating their own data, the ability to relate it to the larger genomic and comparative genomic context is becoming increasingly crucial to make full use of the data. The Omics Database Generator (ODG) allows users to create customized databases that utilize published genomics data integrated with experimental data which can be queried using a flexible graph database. When provided with omics and experimental data, ODG will create a comparative, multi-dimensional graph database. ODG can import definitions and annotations from other sources such as InterProScan, the Gene Ontology, ENZYME, UniPathway, and others. This annotation data can be especially useful for studying new or understudied species for which transcripts have only been predicted, and rapidly give additional layers of annotation to predicted genes. In better studied species, ODG can perform syntenic annotation translations or rapidly identify characteristics of a set of genes or nucleotide locations, such as hits from an association study. ODG provides a web-based user-interface for configuring the data import and for querying the database. Queries can also be run from the command-line and the database can be queried directly through programming language hooks available for most languages. ODG supports most common genomic formats as well as generic, easy to use tab-separated value format for user-provided annotations. ODG is a user-friendly database generation and query tool that adapts to the supplied data to produce a comparative genomic database or multi-layered annotation database. ODG provides rapid comparative genomic annotation and is therefore particularly useful for non-model or understudied species. For species for which more data are available, ODG can be used to conduct complex multi-omics, pattern-matching queries.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Coram, Jamie L.; Morrow, James D.; Perkins, David Nikolaus
2015-09-01
This document describes the PANTHER R&D Application, a proof-of-concept user interface application developed under the PANTHER Grand Challenge LDRD. The purpose of the application is to explore interaction models for graph analytics, drive algorithmic improvements from an end-user point of view, and support demonstration of PANTHER technologies to potential customers. The R&D Application implements a graph-centric interaction model that exposes analysts to the algorithms contained within the GeoGraphy graph analytics library. Users define geospatial-temporal semantic graph queries by constructing search templates based on nodes, edges, and the constraints among them. Users then analyze the results of the queries using bothmore » geo-spatial and temporal visualizations. Development of this application has made user experience an explicit driver for project and algorithmic level decisions that will affect how analysts one day make use of PANTHER technologies.« less
Temporal Representation in Semantic Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Levandoski, J J; Abdulla, G M
2007-08-07
A wide range of knowledge discovery and analysis applications, ranging from business to biological, make use of semantic graphs when modeling relationships and concepts. Most of the semantic graphs used in these applications are assumed to be static pieces of information, meaning temporal evolution of concepts and relationships are not taken into account. Guided by the need for more advanced semantic graph queries involving temporal concepts, this paper surveys the existing work involving temporal representations in semantic graphs.
FTree query construction for virtual screening: a statistical analysis.
Gerlach, Christof; Broughton, Howard; Zaliani, Andrea
2008-02-01
FTrees (FT) is a known chemoinformatic tool able to condense molecular descriptions into a graph object and to search for actives in large databases using graph similarity. The query graph is classically derived from a known active molecule, or a set of actives, for which a similar compound has to be found. Recently, FT similarity has been extended to fragment space, widening its capabilities. If a user were able to build a knowledge-based FT query from information other than a known active structure, the similarity search could be combined with other, normally separate, fields like de-novo design or pharmacophore searches. With this aim in mind, we performed a comprehensive analysis of several databases in terms of FT description and provide a basic statistical analysis of the FT spaces so far at hand. Vendors' catalogue collections and MDDR as a source of potential or known "actives", respectively, have been used. With the results reported herein, a set of ranges, mean values and standard deviations for several query parameters are presented in order to set a reference guide for the users. Applications on how to use this information in FT query building are also provided, using a newly built 3D-pharmacophore from 57 5HT-1F agonists and a published one which was used for virtual screening for tRNA-guanine transglycosylase (TGT) inhibitors.
FTree query construction for virtual screening: a statistical analysis
NASA Astrophysics Data System (ADS)
Gerlach, Christof; Broughton, Howard; Zaliani, Andrea
2008-02-01
FTrees (FT) is a known chemoinformatic tool able to condense molecular descriptions into a graph object and to search for actives in large databases using graph similarity. The query graph is classically derived from a known active molecule, or a set of actives, for which a similar compound has to be found. Recently, FT similarity has been extended to fragment space, widening its capabilities. If a user were able to build a knowledge-based FT query from information other than a known active structure, the similarity search could be combined with other, normally separate, fields like de-novo design or pharmacophore searches. With this aim in mind, we performed a comprehensive analysis of several databases in terms of FT description and provide a basic statistical analysis of the FT spaces so far at hand. Vendors' catalogue collections and MDDR as a source of potential or known "actives", respectively, have been used. With the results reported herein, a set of ranges, mean values and standard deviations for several query parameters are presented in order to set a reference guide for the users. Applications on how to use this information in FT query building are also provided, using a newly built 3D-pharmacophore from 57 5HT-1F agonists and a published one which was used for virtual screening for tRNA-guanine transglycosylase (TGT) inhibitors.
Graph-Based Semantic Web Service Composition for Healthcare Data Integration.
Arch-Int, Ngamnij; Arch-Int, Somjit; Sonsilphong, Suphachoke; Wanchai, Paweena
2017-01-01
Within the numerous and heterogeneous web services offered through different sources, automatic web services composition is the most convenient method for building complex business processes that permit invocation of multiple existing atomic services. The current solutions in functional web services composition lack autonomous queries of semantic matches within the parameters of web services, which are necessary in the composition of large-scale related services. In this paper, we propose a graph-based Semantic Web Services composition system consisting of two subsystems: management time and run time. The management-time subsystem is responsible for dependency graph preparation in which a dependency graph of related services is generated automatically according to the proposed semantic matchmaking rules. The run-time subsystem is responsible for discovering the potential web services and nonredundant web services composition of a user's query using a graph-based searching algorithm. The proposed approach was applied to healthcare data integration in different health organizations and was evaluated according to two aspects: execution time measurement and correctness measurement.
Graph-Based Semantic Web Service Composition for Healthcare Data Integration
2017-01-01
Within the numerous and heterogeneous web services offered through different sources, automatic web services composition is the most convenient method for building complex business processes that permit invocation of multiple existing atomic services. The current solutions in functional web services composition lack autonomous queries of semantic matches within the parameters of web services, which are necessary in the composition of large-scale related services. In this paper, we propose a graph-based Semantic Web Services composition system consisting of two subsystems: management time and run time. The management-time subsystem is responsible for dependency graph preparation in which a dependency graph of related services is generated automatically according to the proposed semantic matchmaking rules. The run-time subsystem is responsible for discovering the potential web services and nonredundant web services composition of a user's query using a graph-based searching algorithm. The proposed approach was applied to healthcare data integration in different health organizations and was evaluated according to two aspects: execution time measurement and correctness measurement. PMID:29065602
Cyclone: java-based querying and computing with Pathway/Genome databases.
Le Fèvre, François; Smidtas, Serge; Schächter, Vincent
2007-05-15
Cyclone aims at facilitating the use of BioCyc, a collection of Pathway/Genome Databases (PGDBs). Cyclone provides a fully extensible Java Object API to analyze and visualize these data. Cyclone can read and write PGDBs, and can write its own data in the CycloneML format. This format is automatically generated from the BioCyc ontology by Cyclone itself, ensuring continued compatibility. Cyclone objects can also be stored in a relational database CycloneDB. Queries can be written in SQL, and in an intuitive and concise object-oriented query language, Hibernate Query Language (HQL). In addition, Cyclone interfaces easily with Java software including the Eclipse IDE for HQL edition, the Jung API for graph algorithms or Cytoscape for graph visualization. Cyclone is freely available under an open source license at: http://sourceforge.net/projects/nemo-cyclone. For download and installation instructions, tutorials, use cases and examples, see http://nemo-cyclone.sourceforge.net.
Application of kernel functions for accurate similarity search in large chemical databases.
Wang, Xiaohong; Huan, Jun; Smalter, Aaron; Lushington, Gerald H
2010-04-29
Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.
Jeong, Hyundoo; Yoon, Byung-Jun
2017-03-14
Network querying algorithms provide computational means to identify conserved network modules in large-scale biological networks that are similar to known functional modules, such as pathways or molecular complexes. Two main challenges for network querying algorithms are the high computational complexity of detecting potential isomorphism between the query and the target graphs and ensuring the biological significance of the query results. In this paper, we propose SEQUOIA, a novel network querying algorithm that effectively addresses these issues by utilizing a context-sensitive random walk (CSRW) model for network comparison and minimizing the network conductance of potential matches in the target network. The CSRW model, inspired by the pair hidden Markov model (pair-HMM) that has been widely used for sequence comparison and alignment, can accurately assess the node-to-node correspondence between different graphs by accounting for node insertions and deletions. The proposed algorithm identifies high-scoring network regions based on the CSRW scores, which are subsequently extended by maximally reducing the network conductance of the identified subnetworks. Performance assessment based on real PPI networks and known molecular complexes show that SEQUOIA outperforms existing methods and clearly enhances the biological significance of the query results. The source code and datasets can be downloaded from http://www.ece.tamu.edu/~bjyoon/SEQUOIA .
DOE Office of Scientific and Technical Information (OSTI.GOV)
Park, Yubin; Shankar, Mallikarjun; Park, Byung H.
Designing a database system for both efficient data management and data services has been one of the enduring challenges in the healthcare domain. In many healthcare systems, data services and data management are often viewed as two orthogonal tasks; data services refer to retrieval and analytic queries such as search, joins, statistical data extraction, and simple data mining algorithms, while data management refers to building error-tolerant and non-redundant database systems. The gap between service and management has resulted in rigid database systems and schemas that do not support effective analytics. We compose a rich graph structure from an abstracted healthcaremore » RDBMS to illustrate how we can fill this gap in practice. We show how a healthcare graph can be automatically constructed from a normalized relational database using the proposed 3NF Equivalent Graph (3EG) transformation.We discuss a set of real world graph queries such as finding self-referrals, shared providers, and collaborative filtering, and evaluate their performance over a relational database and its 3EG-transformed graph. Experimental results show that the graph representation serves as multiple de-normalized tables, thus reducing complexity in a database and enhancing data accessibility of users. Based on this finding, we propose an ensemble framework of databases for healthcare applications.« less
MIMO: an efficient tool for molecular interaction maps overlap
2013-01-01
Background Molecular pathways represent an ensemble of interactions occurring among molecules within the cell and between cells. The identification of similarities between molecular pathways across organisms and functions has a critical role in understanding complex biological processes. For the inference of such novel information, the comparison of molecular pathways requires to account for imperfect matches (flexibility) and to efficiently handle complex network topologies. To date, these characteristics are only partially available in tools designed to compare molecular interaction maps. Results Our approach MIMO (Molecular Interaction Maps Overlap) addresses the first problem by allowing the introduction of gaps and mismatches between query and template pathways and permits -when necessary- supervised queries incorporating a priori biological information. It then addresses the second issue by relying directly on the rich graph topology described in the Systems Biology Markup Language (SBML) standard, and uses multidigraphs to efficiently handle multiple queries on biological graph databases. The algorithm has been here successfully used to highlight the contact point between various human pathways in the Reactome database. Conclusions MIMO offers a flexible and efficient graph-matching tool for comparing complex biological pathways. PMID:23672344
Bim-Gis Integrated Geospatial Information Model Using Semantic Web and Rdf Graphs
NASA Astrophysics Data System (ADS)
Hor, A.-H.; Jadidi, A.; Sohn, G.
2016-06-01
In recent years, 3D virtual indoor/outdoor urban modelling becomes a key spatial information framework for many civil and engineering applications such as evacuation planning, emergency and facility management. For accomplishing such sophisticate decision tasks, there is a large demands for building multi-scale and multi-sourced 3D urban models. Currently, Building Information Model (BIM) and Geographical Information Systems (GIS) are broadly used as the modelling sources. However, data sharing and exchanging information between two modelling domains is still a huge challenge; while the syntactic or semantic approaches do not fully provide exchanging of rich semantic and geometric information of BIM into GIS or vice-versa. This paper proposes a novel approach for integrating BIM and GIS using semantic web technologies and Resources Description Framework (RDF) graphs. The novelty of the proposed solution comes from the benefits of integrating BIM and GIS technologies into one unified model, so-called Integrated Geospatial Information Model (IGIM). The proposed approach consists of three main modules: BIM-RDF and GIS-RDF graphs construction, integrating of two RDF graphs, and query of information through IGIM-RDF graph using SPARQL. The IGIM generates queries from both the BIM and GIS RDF graphs resulting a semantically integrated model with entities representing both BIM classes and GIS feature objects with respect to the target-client application. The linkage between BIM-RDF and GIS-RDF is achieved through SPARQL endpoints and defined by a query using set of datasets and entity classes with complementary properties, relationships and geometries. To validate the proposed approach and its performance, a case study was also tested using IGIM system design.
Constructing a Graph Database for Semantic Literature-Based Discovery.
Hristovski, Dimitar; Kastrin, Andrej; Dinevski, Dejan; Rindflesch, Thomas C
2015-01-01
Literature-based discovery (LBD) generates discoveries, or hypotheses, by combining what is already known in the literature. Potential discoveries have the form of relations between biomedical concepts; for example, a drug may be determined to treat a disease other than the one for which it was intended. LBD views the knowledge in a domain as a network; a set of concepts along with the relations between them. As a starting point, we used SemMedDB, a database of semantic relations between biomedical concepts extracted with SemRep from Medline. SemMedDB is distributed as a MySQL relational database, which has some problems when dealing with network data. We transformed and uploaded SemMedDB into the Neo4j graph database, and implemented the basic LBD discovery algorithms with the Cypher query language. We conclude that storing the data needed for semantic LBD is more natural in a graph database. Also, implementing LBD discovery algorithms is conceptually simpler with a graph query language when compared with standard SQL.
SPARK: Adapting Keyword Query to Semantic Search
NASA Astrophysics Data System (ADS)
Zhou, Qi; Wang, Chong; Xiong, Miao; Wang, Haofen; Yu, Yong
Semantic search promises to provide more accurate result than present-day keyword search. However, progress with semantic search has been delayed due to the complexity of its query languages. In this paper, we explore a novel approach of adapting keywords to querying the semantic web: the approach automatically translates keyword queries into formal logic queries so that end users can use familiar keywords to perform semantic search. A prototype system named 'SPARK' has been implemented in light of this approach. Given a keyword query, SPARK outputs a ranked list of SPARQL queries as the translation result. The translation in SPARK consists of three major steps: term mapping, query graph construction and query ranking. Specifically, a probabilistic query ranking model is proposed to select the most likely SPARQL query. In the experiment, SPARK achieved an encouraging translation result.
In-context query reformulation for failing SPARQL queries
NASA Astrophysics Data System (ADS)
Viswanathan, Amar; Michaelis, James R.; Cassidy, Taylor; de Mel, Geeth; Hendler, James
2017-05-01
Knowledge bases for decision support systems are growing increasingly complex, through continued advances in data ingest and management approaches. However, humans do not possess the cognitive capabilities to retain a bird's-eyeview of such knowledge bases, and may end up issuing unsatisfiable queries to such systems. This work focuses on the implementation of a query reformulation approach for graph-based knowledge bases, specifically designed to support the Resource Description Framework (RDF). The reformulation approach presented is instance-and schema-aware. Thus, in contrast to relaxation techniques found in the state-of-the-art, the presented approach produces in-context query reformulation.
Graph-Based Weakly-Supervised Methods for Information Extraction & Integration
ERIC Educational Resources Information Center
Talukdar, Partha Pratim
2010-01-01
The variety and complexity of potentially-related data resources available for querying--webpages, databases, data warehouses--has been growing ever more rapidly. There is a growing need to pose integrative queries "across" multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse…
Visual Exploratory Search of Relationship Graphs on Smartphones
Ouyang, Jianquan; Zheng, Hao; Kong, Fanbin; Liu, Tianming
2013-01-01
This paper presents a novel framework for Visual Exploratory Search of Relationship Graphs on Smartphones (VESRGS) that is composed of three major components: inference and representation of semantic relationship graphs on the Web via meta-search, visual exploratory search of relationship graphs through both querying and browsing strategies, and human-computer interactions via the multi-touch interface and mobile Internet on smartphones. In comparison with traditional lookup search methodologies, the proposed VESRGS system is characterized with the following perceived advantages. 1) It infers rich semantic relationships between the querying keywords and other related concepts from large-scale meta-search results from Google, Yahoo! and Bing search engines, and represents semantic relationships via graphs; 2) the exploratory search approach empowers users to naturally and effectively explore, adventure and discover knowledge in a rich information world of interlinked relationship graphs in a personalized fashion; 3) it effectively takes the advantages of smartphones’ user-friendly interfaces and ubiquitous Internet connection and portability. Our extensive experimental results have demonstrated that the VESRGS framework can significantly improve the users’ capability of seeking the most relevant relationship information to their own specific needs. We envision that the VESRGS framework can be a starting point for future exploration of novel, effective search strategies in the mobile Internet era. PMID:24223936
High-performance analysis of filtered semantic graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Buluc, Aydin; Fox, Armando; Gilbert, John R.
2012-01-01
High performance is a crucial consideration when executing a complex analytic query on a massive semantic graph. In a semantic graph, vertices and edges carry "attributes" of various types. Analytic queries on semantic graphs typically depend on the values of these attributes; thus, the computation must either view the graph through a filter that passes only those individual vertices and edges of interest, or else must first materialize a subgraph or subgraphs consisting of only the vertices and edges of interest. The filtered approach is superior due to its generality, ease of use, and memory efficiency, but may carry amore » performance cost. In the Knowledge Discovery Toolbox (KDT), a Python library for parallel graph computations, the user writes filters in a high-level language, but those filters result in relatively low performance due to the bottleneck of having to call into the Python interpreter for each edge. In this work, we use the Selective Embedded JIT Specialization (SEJITS) approach to automatically translate filters defined by programmers into a lower-level efficiency language, bypassing the upcall into Python. We evaluate our approach by comparing it with the high-performance C++ /MPI Combinatorial BLAS engine, and show that the productivity gained by using a high-level filtering language comes without sacrificing performance.« less
High-Performance Data Analytics Beyond the Relational and Graph Data Models with GEMS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Minutoli, Marco; Bhatt, Shreyansh
Graphs represent an increasingly popular data model for data-analytics, since they can naturally represent relationships and interactions between entities. Relational databases and their pure table-based data model are not well suitable to store and process sparse data. Consequently, graph databases have gained interest in the last few years and the Resource Description Framework (RDF) became the standard data model for graph data. Nevertheless, while RDF is well suited to analyze the relationships between the entities, it is not efficient in representing their attributes and properties. In this work we propose the adoption of a new hybrid data model, based onmore » attributed graphs, that aims at overcoming the limitations of the pure relational and graph data models. We present how we have re-designed the GEMS data-analytics framework to fully take advantage of the proposed hybrid data model. To improve analysts productivity, in addition to a C++ API for applications development, we adopt GraQL as input query language. We validate our approach implementing a set of queries on net-flow data and we compare our framework performance against Neo4j. Experimental results show significant performance improvement over Neo4j, up to several orders of magnitude when increasing the size of the input data.« less
NASA Astrophysics Data System (ADS)
Hornung, Thomas; Simon, Kai; Lausen, Georg
Combining information from different Web sources often results in a tedious and repetitive process, e.g. even simple information requests might require to iterate over a result list of one Web query and use each single result as input for a subsequent query. One approach for this chained queries are data-centric mashups, which allow to visually model the data flow as a graph, where the nodes represent the data source and the edges the data flow.
EvoGraph: On-The-Fly Efficient Mining of Evolving Graphs on GPU
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sengupta, Dipanjan; Song, Shuaiwen
With the prevalence of the World Wide Web and social networks, there has been a growing interest in high performance analytics for constantly-evolving dynamic graphs. Modern GPUs provide massive AQ1 amount of parallelism for efficient graph processing, but the challenges remain due to their lack of support for the near real-time streaming nature of dynamic graphs. Specifically, due to the current high volume and velocity of graph data combined with the complexity of user queries, traditional processing methods by first storing the updates and then repeatedly running static graph analytics on a sequence of versions or snapshots are deemed undesirablemore » and computational infeasible on GPU. We present EvoGraph, a highly efficient and scalable GPU- based dynamic graph analytics framework.« less
Hewitt, Robin; Gobbi, Alberto; Lee, Man-Ling
2005-01-01
Relational databases are the current standard for storing and retrieving data in the pharmaceutical and biotech industries. However, retrieving data from a relational database requires specialized knowledge of the database schema and of the SQL query language. At Anadys, we have developed an easy-to-use system for searching and reporting data in a relational database to support our drug discovery project teams. This system is fast and flexible and allows users to access all data without having to write SQL queries. This paper presents the hierarchical, graph-based metadata representation and SQL-construction methods that, together, are the basis of this system's capabilities.
GBA manager: an online tool for querying low-complexity regions in proteins.
Bandyopadhyay, Nirmalya; Kahveci, Tamer
2010-01-01
Abstract We developed GBA Manager, an online software that facilitates the Graph-Based Algorithm (GBA) we proposed in our earlier work. GBA identifies the low-complexity regions (LCR) of protein sequences. GBA exploits a similarity matrix, such as BLOSUM62, to compute the complexity of the subsequences of the input protein sequence. It uses a graph-based algorithm to accurately compute the regions that have low complexities. GBA Manager is a user friendly web-service that enables online querying of protein sequences using GBA. In addition to querying capabilities of the existing GBA algorithm, GBA Manager computes the p-values of the LCR identified. The p-value gives an estimate of the possibility that the region appears by chance. GBA Manager presents the output in three different understandable formats. GBA Manager is freely accessible at http://bioinformatics.cise.ufl.edu/GBA/GBA.htm .
Zhang, Guo-Qiang; Luo, Lingyun; Ogbuji, Chime; Joslyn, Cliff; Mejino, Jose; Sahoo, Satya S
2012-01-01
The interaction of multiple types of relationships among anatomical classes in the Foundational Model of Anatomy (FMA) can provide inferred information valuable for quality assurance. This paper introduces a method called Motif Checking (MOCH) to study the effects of such multi-relation type interactions for detecting logical inconsistencies as well as other anomalies represented by the motifs. MOCH represents patterns of multi-type interaction as small labeled (with multiple types of edges) sub-graph motifs, whose nodes represent class variables, and labeled edges represent relational types. By representing FMA as an RDF graph and motifs as SPARQL queries, fragments of FMA are automatically obtained as auditing candidates. Leveraging the scalability and reconfigurability of Semantic Web Technology, we performed exhaustive analyses of a variety of labeled sub-graph motifs. The quality assurance feature of MOCH comes from the distinct use of a subset of the edges of the graph motifs as constraints for disjointness, whereby bringing in rule-based flavor to the approach as well. With possible disjointness implied by antonyms, we performed manual inspection of the resulting FMA fragments and tracked down sources of abnormal inferred conclusions (logical inconsistencies), which are amendable for programmatic revision of the FMA. Our results demonstrate that MOCH provides a unique source of valuable information for quality assurance. Since our approach is general, it is applicable to any ontological system with an OWL representation.
Zhang, Guo-Qiang; Luo, Lingyun; Ogbuji, Chime; Joslyn, Cliff; Mejino, Jose; Sahoo, Satya S
2012-01-01
The interaction of multiple types of relationships among anatomical classes in the Foundational Model of Anatomy (FMA) can provide inferred information valuable for quality assurance. This paper introduces a method called Motif Checking (MOCH) to study the effects of such multi-relation type interactions for detecting logical inconsistencies as well as other anomalies represented by the motifs. MOCH represents patterns of multi-type interaction as small labeled (with multiple types of edges) sub-graph motifs, whose nodes represent class variables, and labeled edges represent relational types. By representing FMA as an RDF graph and motifs as SPARQL queries, fragments of FMA are automatically obtained as auditing candidates. Leveraging the scalability and reconfigurability of Semantic Web Technology, we performed exhaustive analyses of a variety of labeled sub-graph motifs. The quality assurance feature of MOCH comes from the distinct use of a subset of the edges of the graph motifs as constraints for disjointness, whereby bringing in rule-based flavor to the approach as well. With possible disjointness implied by antonyms, we performed manual inspection of the resulting FMA fragments and tracked down sources of abnormal inferred conclusions (logical inconsistencies), which are amendable for programmatic revision of the FMA. Our results demonstrate that MOCH provides a unique source of valuable information for quality assurance. Since our approach is general, it is applicable to any ontological system with an OWL representation. PMID:23304382
VPipe: Virtual Pipelining for Scheduling of DAG Stream Query Plans
NASA Astrophysics Data System (ADS)
Wang, Song; Gupta, Chetan; Mehta, Abhay
There are data streams all around us that can be harnessed for tremendous business and personal advantage. For an enterprise-level stream processing system such as CHAOS [1] (Continuous, Heterogeneous Analytic Over Streams), handling of complex query plans with resource constraints is challenging. While several scheduling strategies exist for stream processing, efficient scheduling of complex DAG query plans is still largely unsolved. In this paper, we propose a novel execution scheme for scheduling complex directed acyclic graph (DAG) query plans with meta-data enriched stream tuples. Our solution, called Virtual Pipelined Chain (or VPipe Chain for short), effectively extends the "Chain" pipelining scheduling approach to complex DAG query plans.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brost, Randolph C.; McLendon, William Clarence,
2013-01-01
Modeling geospatial information with semantic graphs enables search for sites of interest based on relationships between features, without requiring strong a priori models of feature shape or other intrinsic properties. Geospatial semantic graphs can be constructed from raw sensor data with suitable preprocessing to obtain a discretized representation. This report describes initial work toward extending geospatial semantic graphs to include temporal information, and initial results applying semantic graph techniques to SAR image data. We describe an efficient graph structure that includes geospatial and temporal information, which is designed to support simultaneous spatial and temporal search queries. We also report amore » preliminary implementation of feature recognition, semantic graph modeling, and graph search based on input SAR data. The report concludes with lessons learned and suggestions for future improvements.« less
Accelerating semantic graph databases on commodity clusters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morari, Alessandro; Castellana, Vito G.; Haglin, David J.
We are developing a full software system for accelerating semantic graph databases on commodity cluster that scales to hundreds of nodes while maintaining constant query throughput. Our framework comprises a SPARQL to C++ compiler, a library of parallel graph methods and a custom multithreaded runtime layer, which provides a Partitioned Global Address Space (PGAS) programming model with fork/join parallelism and automatic load balancing over a commodity clusters. We present preliminary results for the compiler and for the runtime.
An asynchronous traversal engine for graph-based rich metadata management
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dai, Dong; Carns, Philip; Ross, Robert B.
Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
An asynchronous traversal engine for graph-based rich metadata management
Dai, Dong; Carns, Philip; Ross, Robert B.; ...
2016-06-23
Rich metadata in high-performance computing (HPC) systems contains extended information about users, jobs, data files, and their relationships. Property graphs are a promising data model to represent heterogeneous rich metadata flexibly. Specifically, a property graph can use vertices to represent different entities and edges to record the relationships between vertices with unique annotations. The high-volume HPC use case, with millions of entities and relationships, naturally requires an out-of-core distributed property graph database, which must support live updates (to ingest production information in real time), low-latency point queries (for frequent metadata operations such as permission checking), and large-scale traversals (for provenancemore » data mining). Among these needs, large-scale property graph traversals are particularly challenging for distributed graph storage systems. Most existing graph systems implement a "level synchronous" breadth-first search algorithm that relies on global synchronization in each traversal step. This performs well in many problem domains; but a rich metadata management system is characterized by imbalanced graphs, long traversal lengths, and concurrent workloads, each of which has the potential to introduce or exacerbate stragglers (i.e., abnormally slow steps or servers in a graph traversal) that lead to low overall throughput for synchronous traversal algorithms. Previous research indicated that the straggler problem can be mitigated by using asynchronous traversal algorithms, and many graph-processing frameworks have successfully demonstrated this approach. Such systems require the graph to be loaded into a separate batch-processing framework instead of being iteratively accessed, however. In this work, we investigate a general asynchronous graph traversal engine that can operate atop a rich metadata graph in its native format. We outline a traversal-aware query language and key optimizations (traversal-affiliate caching and execution merging) necessary for efficient performance. We further explore the effect of different graph partitioning strategies on the traversal performance for both synchronous and asynchronous traversal engines. Our experiments show that the asynchronous graph traversal engine is more efficient than its synchronous counterpart in the case of HPC rich metadata processing, where more servers are involved and larger traversals are needed. Furthermore, the asynchronous traversal engine is more adaptive to different graph partitioning strategies.« less
In-Memory Graph Databases for Web-Scale Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Morari, Alessandro; Weaver, Jesse R.
RDF databases have emerged as one of the most relevant way for organizing, integrating, and managing expo- nentially growing, often heterogeneous, and not rigidly structured data for a variety of scientific and commercial fields. In this paper we discuss the solutions integrated in GEMS (Graph database Engine for Multithreaded Systems), a software framework for implementing RDF databases on commodity, distributed-memory high-performance clusters. Unlike the majority of current RDF databases, GEMS has been designed from the ground up to primarily employ graph-based methods. This is reflected in all the layers of its stack. The GEMS framework is composed of: a SPARQL-to-C++more » compiler, a library of data structures and related methods to access and modify them, and a custom runtime providing lightweight software multithreading, network messages aggregation and a partitioned global address space. We provide an overview of the framework, detailing its component and how they have been closely designed and customized to address issues of graph methods applied to large-scale datasets on clusters. We discuss in details the principles that enable automatic translation of the queries (expressed in SPARQL, the query language of choice for RDF databases) to graph methods, and identify differences with respect to other RDF databases.« less
A Model of Knowledge Based Information Retrieval with Hierarchical Concept Graph.
ERIC Educational Resources Information Center
Kim, Young Whan; Kim, Jin H.
1990-01-01
Proposes a model of knowledge-based information retrieval (KBIR) that is based on a hierarchical concept graph (HCG) which shows relationships between index terms and constitutes a hierarchical thesaurus as a knowledge base. Conceptual distance between a query and an object is discussed and the use of Boolean operators is described. (25…
One Shot Detection with Laplacian Object and Fast Matrix Cosine Similarity.
Biswas, Sujoy Kumar; Milanfar, Peyman
2016-03-01
One shot, generic object detection involves searching for a single query object in a larger target image. Relevant approaches have benefited from features that typically model the local similarity patterns. In this paper, we combine local similarity (encoded by local descriptors) with a global context (i.e., a graph structure) of pairwise affinities among the local descriptors, embedding the query descriptors into a low dimensional but discriminatory subspace. Unlike principal components that preserve global structure of feature space, we actually seek a linear approximation to the Laplacian eigenmap that permits us a locality preserving embedding of high dimensional region descriptors. Our second contribution is an accelerated but exact computation of matrix cosine similarity as the decision rule for detection, obviating the computationally expensive sliding window search. We leverage the power of Fourier transform combined with integral image to achieve superior runtime efficiency that allows us to test multiple hypotheses (for pose estimation) within a reasonably short time. Our approach to one shot detection is training-free, and experiments on the standard data sets confirm the efficacy of our model. Besides, low computation cost of the proposed (codebook-free) object detector facilitates rather straightforward query detection in large data sets including movie videos.
Analyzing and synthesizing phylogenies using tree alignment graphs.
Smith, Stephen A; Brown, Joseph W; Hinchliff, Cody E
2013-01-01
Phylogenetic trees are used to analyze and visualize evolution. However, trees can be imperfect datatypes when summarizing multiple trees. This is especially problematic when accommodating for biological phenomena such as horizontal gene transfer, incomplete lineage sorting, and hybridization, as well as topological conflict between datasets. Additionally, researchers may want to combine information from sets of trees that have partially overlapping taxon sets. To address the problem of analyzing sets of trees with conflicting relationships and partially overlapping taxon sets, we introduce methods for aligning, synthesizing and analyzing rooted phylogenetic trees within a graph, called a tree alignment graph (TAG). The TAG can be queried and analyzed to explore uncertainty and conflict. It can also be synthesized to construct trees, presenting an alternative to supertrees approaches. We demonstrate these methods with two empirical datasets. In order to explore uncertainty, we constructed a TAG of the bootstrap trees from the Angiosperm Tree of Life project. Analysis of the resulting graph demonstrates that areas of the dataset that are unresolved in majority-rule consensus tree analyses can be understood in more detail within the context of a graph structure, using measures incorporating node degree and adjacency support. As an exercise in synthesis (i.e., summarization of a TAG constructed from the alignment trees), we also construct a TAG consisting of the taxonomy and source trees from a recent comprehensive bird study. We synthesized this graph into a tree that can be reconstructed in a repeatable fashion and where the underlying source information can be updated. The methods presented here are tractable for large scale analyses and serve as a basis for an alternative to consensus tree and supertree methods. Furthermore, the exploration of these graphs can expose structures and patterns within the dataset that are otherwise difficult to observe.
Analyzing and Synthesizing Phylogenies Using Tree Alignment Graphs
Smith, Stephen A.; Brown, Joseph W.; Hinchliff, Cody E.
2013-01-01
Phylogenetic trees are used to analyze and visualize evolution. However, trees can be imperfect datatypes when summarizing multiple trees. This is especially problematic when accommodating for biological phenomena such as horizontal gene transfer, incomplete lineage sorting, and hybridization, as well as topological conflict between datasets. Additionally, researchers may want to combine information from sets of trees that have partially overlapping taxon sets. To address the problem of analyzing sets of trees with conflicting relationships and partially overlapping taxon sets, we introduce methods for aligning, synthesizing and analyzing rooted phylogenetic trees within a graph, called a tree alignment graph (TAG). The TAG can be queried and analyzed to explore uncertainty and conflict. It can also be synthesized to construct trees, presenting an alternative to supertrees approaches. We demonstrate these methods with two empirical datasets. In order to explore uncertainty, we constructed a TAG of the bootstrap trees from the Angiosperm Tree of Life project. Analysis of the resulting graph demonstrates that areas of the dataset that are unresolved in majority-rule consensus tree analyses can be understood in more detail within the context of a graph structure, using measures incorporating node degree and adjacency support. As an exercise in synthesis (i.e., summarization of a TAG constructed from the alignment trees), we also construct a TAG consisting of the taxonomy and source trees from a recent comprehensive bird study. We synthesized this graph into a tree that can be reconstructed in a repeatable fashion and where the underlying source information can be updated. The methods presented here are tractable for large scale analyses and serve as a basis for an alternative to consensus tree and supertree methods. Furthermore, the exploration of these graphs can expose structures and patterns within the dataset that are otherwise difficult to observe. PMID:24086118
Dogrusoz, U; Erson, E Z; Giral, E; Demir, E; Babur, O; Cetintas, A; Colak, R
2006-02-01
Patikaweb provides a Web interface for retrieving and analyzing biological pathways in the Patika database, which contains data integrated from various prominent public pathway databases. It features a user-friendly interface, dynamic visualization and automated layout, advanced graph-theoretic queries for extracting biologically important phenomena, local persistence capability and exporting facilities to various pathway exchange formats.
Toward An Unstructured Mesh Database
NASA Astrophysics Data System (ADS)
Rezaei Mahdiraji, Alireza; Baumann, Peter Peter
2014-05-01
Unstructured meshes are used in several application domains such as earth sciences (e.g., seismology), medicine, oceanography, cli- mate modeling, GIS as approximate representations of physical objects. Meshes subdivide a domain into smaller geometric elements (called cells) which are glued together by incidence relationships. The subdivision of a domain allows computational manipulation of complicated physical structures. For instance, seismologists model earthquakes using elastic wave propagation solvers on hexahedral meshes. The hexahedral con- tains several hundred millions of grid points and millions of hexahedral cells. Each vertex node in the hexahedrals stores a multitude of data fields. To run simulation on such meshes, one needs to iterate over all the cells, iterate over incident cells to a given cell, retrieve coordinates of cells, assign data values to cells, etc. Although meshes are used in many application domains, to the best of our knowledge there is no database vendor that support unstructured mesh features. Currently, the main tool for querying and manipulating unstructured meshes are mesh libraries, e.g., CGAL and GRAL. Mesh li- braries are dedicated libraries which includes mesh algorithms and can be run on mesh representations. The libraries do not scale with dataset size, do not have declarative query language, and need deep C++ knowledge for query implementations. Furthermore, due to high coupling between the implementations and input file structure, the implementations are less reusable and costly to maintain. A dedicated mesh database offers the following advantages: 1) declarative querying, 2) ease of maintenance, 3) hiding mesh storage structure from applications, and 4) transparent query optimization. To design a mesh database, the first challenge is to define a suitable generic data model for unstructured meshes. We proposed ImG-Complexes data model as a generic topological mesh data model which extends incidence graph model to multi-incidence relationships. We instrument ImG model with sets of optional and application-specific constraints which can be used to check validity of meshes for a specific class of object such as manifold, pseudo-manifold, and simplicial manifold. We conducted experiments to measure the performance of the graph database solution in processing mesh queries and compare it with GrAL mesh library and PostgreSQL database on synthetic and real mesh datasets. The experiments show that each system perform well on specific types of mesh queries, e.g., graph databases perform well on global path-intensive queries. In the future, we investigate database operations for the ImG model and design a mesh query language.
Unapparent Information Revelation: Text Mining for Counterterrorism
NASA Astrophysics Data System (ADS)
Srihari, Rohini K.
Unapparent information revelation (UIR) is a special case of text mining that focuses on detecting possible links between concepts across multiple text documents by generating an evidence trail explaining the connection. A traditional search involving, for example, two or more person names will attempt to find documents mentioning both these individuals. This research focuses on a different interpretation of such a query: what is the best evidence trail across documents that explains a connection between these individuals? For example, all may be good golfers. A generalization of this task involves query terms representing general concepts (e.g. indictment, foreign policy). Previous approaches to this problem have focused on graph mining involving hyperlinked documents, and link analysis exploiting named entities. A new robust framework is presented, based on (i) generating concept chain graphs, a hybrid content representation, (ii) performing graph matching to select candidate subgraphs, and (iii) subsequently using graphical models to validate hypotheses using ranked evidence trails. We adapt the DUC data set for cross-document summarization to evaluate evidence trails generated by this approach
A Visual Analytics Paradigm Enabling Trillion-Edge Graph Exploration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wong, Pak C.; Haglin, David J.; Gillen, David S.
We present a visual analytics paradigm and a system prototype for exploring web-scale graphs. A web-scale graph is described as a graph with ~one trillion edges and ~50 billion vertices. While there is an aggressive R&D effort in processing and exploring web-scale graphs among internet vendors such as Facebook and Google, visualizing a graph of that scale still remains an underexplored R&D area. The paper describes a nontraditional peek-and-filter strategy that facilitates the exploration of a graph database of unprecedented size for visualization and analytics. We demonstrate that our system prototype can 1) preprocess a graph with ~25 billion edgesmore » in less than two hours and 2) support database query and visualization on the processed graph database afterward. Based on our computational performance results, we argue that we most likely will achieve the one trillion edge mark (a computational performance improvement of 40 times) for graph visual analytics in the near future.« less
Scenario driven data modelling: a method for integrating diverse sources of data and data streams
Brettin, Thomas S.; Cottingham, Robert W.; Griffith, Shelton D.; Quest, Daniel J.
2015-09-08
A system and method of integrating diverse sources of data and data streams is presented. The method can include selecting a scenario based on a topic, creating a multi-relational directed graph based on the scenario, identifying and converting resources in accordance with the scenario and updating the multi-directed graph based on the resources, identifying data feeds in accordance with the scenario and updating the multi-directed graph based on the data feeds, identifying analytical routines in accordance with the scenario and updating the multi-directed graph using the analytical routines and identifying data outputs in accordance with the scenario and defining queries to produce the data outputs from the multi-directed graph.
NASA Astrophysics Data System (ADS)
Kuznetsov, Valentin; Riley, Daniel; Afaq, Anzar; Sekhri, Vijay; Guo, Yuyi; Lueking, Lee
2010-04-01
The CMS experiment has implemented a flexible and powerful system enabling users to find data within the CMS physics data catalog. The Dataset Bookkeeping Service (DBS) comprises a database and the services used to store and access metadata related to CMS physics data. To this, we have added a generalized query system in addition to the existing web and programmatic interfaces to the DBS. This query system is based on a query language that hides the complexity of the underlying database structure by discovering the join conditions between database tables. This provides a way of querying the system that is simple and straightforward for CMS data managers and physicists to use without requiring knowledge of the database tables or keys. The DBS Query Language uses the ANTLR tool to build the input query parser and tokenizer, followed by a query builder that uses a graph representation of the DBS schema to construct the SQL query sent to underlying database. We will describe the design of the query system, provide details of the language components and overview of how this component fits into the overall data discovery system architecture.
Simultaneously Discovering and Localizing Common Objects in Wild Images.
Wang, Zhenzhen; Yuan, Junsong
2018-09-01
Motivated by the recent success of supervised and weakly supervised common object discovery, in this paper, we move forward one step further to tackle common object discovery in a fully unsupervised way. Generally, object co-localization aims at simultaneously localizing objects of the same class across a group of images. Traditional object localization/detection usually trains specific object detectors which require bounding box annotations of object instances, or at least image-level labels to indicate the presence/absence of objects in an image. Given a collection of images without any annotations, our proposed fully unsupervised method is to simultaneously discover images that contain common objects and also localize common objects in corresponding images. Without requiring to know the total number of common objects, we formulate this unsupervised object discovery as a sub-graph mining problem from a weighted graph of object proposals, where nodes correspond to object proposals, and edges represent the similarities between neighbouring proposals. The positive images and common objects are jointly discovered by finding sub-graphs of strongly connected nodes, with each sub-graph capturing one object pattern. The optimization problem can be efficiently solved by our proposed maximal-flow-based algorithm. Instead of assuming that each image contains only one common object, our proposed solution can better address wild images where each image may contain multiple common objects or even no common object. Moreover, our proposed method can be easily tailored to the task of image retrieval in which the nodes correspond to the similarity between query and reference images. Extensive experiments on PASCAL VOC 2007 and Object Discovery data sets demonstrate that even without any supervision, our approach can discover/localize common objects of various classes in the presence of scale, view point, appearance variation, and partial occlusions. We also conduct broad experiments on image retrieval benchmarks, Holidays and Oxford5k data sets, to show that our proposed method, which considers both the similarity between query and reference images and also similarities among reference images, can help to improve the retrieval results significantly.
Big Data Analytics with Datalog Queries on Spark.
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2016-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics.
Big Data Analytics with Datalog Queries on Spark
Shkapsky, Alexander; Yang, Mohan; Interlandi, Matteo; Chiu, Hsuan; Condie, Tyson; Zaniolo, Carlo
2017-01-01
There is great interest in exploiting the opportunity provided by cloud computing platforms for large-scale analytics. Among these platforms, Apache Spark is growing in popularity for machine learning and graph analytics. Developing efficient complex analytics in Spark requires deep understanding of both the algorithm at hand and the Spark API or subsystem APIs (e.g., Spark SQL, GraphX). Our BigDatalog system addresses the problem by providing concise declarative specification of complex queries amenable to efficient evaluation. Towards this goal, we propose compilation and optimization techniques that tackle the important problem of efficiently supporting recursion in Spark. We perform an experimental comparison with other state-of-the-art large-scale Datalog systems and verify the efficacy of our techniques and effectiveness of Spark in supporting Datalog-based analytics. PMID:28626296
Perkins, David Nikolaus; Brost, Randolph; Ray, Lawrence P.
2017-08-08
Various technologies for facilitating analysis of large remote sensing and geolocation datasets to identify features of interest are described herein. A search query can be submitted to a computing system that executes searches over a geospatial temporal semantic (GTS) graph to identify features of interest. The GTS graph comprises nodes corresponding to objects described in the remote sensing and geolocation datasets, and edges that indicate geospatial or temporal relationships between pairs of nodes in the nodes. Trajectory information is encoded in the GTS graph by the inclusion of movable nodes to facilitate searches for features of interest in the datasets relative to moving objects such as vehicles.
NOUS: Construction and Querying of Dynamic Knowledge Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choudhury, Sutanay; Agarwal, Khushbu; Purohit, Sumit
The ability to construct domain specific knowledge graphs (KG) and perform question-answering or hypothesis generation is a trans- formative capability. Despite their value, automated construction of knowledge graphs remains an expensive technical challenge that is beyond the reach for most enterprises and academic institutions. We propose an end-to-end framework for developing custom knowl- edge graph driven analytics for arbitrary application domains. The uniqueness of our system lies A) in its combination of curated KGs along with knowledge extracted from unstructured text, B) support for advanced trending and explanatory questions on a dynamic KG, and C) the ability to answer queriesmore » where the answer is embedded across multiple data sources.« less
Improving integrative searching of systems chemical biology data using semantic annotation.
Chen, Bin; Ding, Ying; Wild, David J
2012-03-08
Systems chemical biology and chemogenomics are considered critical, integrative disciplines in modern biomedical research, but require data mining of large, integrated, heterogeneous datasets from chemistry and biology. We previously developed an RDF-based resource called Chem2Bio2RDF that enabled querying of such data using the SPARQL query language. Whilst this work has proved useful in its own right as one of the first major resources in these disciplines, its utility could be greatly improved by the application of an ontology for annotation of the nodes and edges in the RDF graph, enabling a much richer range of semantic queries to be issued. We developed a generalized chemogenomics and systems chemical biology OWL ontology called Chem2Bio2OWL that describes the semantics of chemical compounds, drugs, protein targets, pathways, genes, diseases and side-effects, and the relationships between them. The ontology also includes data provenance. We used it to annotate our Chem2Bio2RDF dataset, making it a rich semantic resource. Through a series of scientific case studies we demonstrate how this (i) simplifies the process of building SPARQL queries, (ii) enables useful new kinds of queries on the data and (iii) makes possible intelligent reasoning and semantic graph mining in chemogenomics and systems chemical biology. Chem2Bio2OWL is available at http://chem2bio2rdf.org/owl. The document is available at http://chem2bio2owl.wikispaces.com.
Distributed Sensing and Processing Adaptive Collaboration Environment (D-SPACE)
2014-07-01
to the query graph, or subgraph permutations with the same mismatch cost (often the case for homogeneous and/or symmetrical data/query). To avoid...decisions are generated in a bottom-up manner using the metric of entropy at the cluster level (Figure 9c). Using the definition of belief messages...for a cluster and a set of data nodes in this cluster , we compute the entropy for forward and backward messages as (,) = −∑ (
A Query Integrator and Manager for the Query Web
Brinkley, James F.; Detwiler, Landon T.
2012-01-01
We introduce two concepts: the Query Web as a layer of interconnected queries over the document web and the semantic web, and a Query Web Integrator and Manager (QI) that enables the Query Web to evolve. QI permits users to write, save and reuse queries over any web accessible source, including other queries saved in other installations of QI. The saved queries may be in any language (e.g. SPARQL, XQuery); the only condition for interconnection is that the queries return their results in some form of XML. This condition allows queries to chain off each other, and to be written in whatever language is appropriate for the task. We illustrate the potential use of QI for several biomedical use cases, including ontology view generation using a combination of graph-based and logical approaches, value set generation for clinical data management, image annotation using terminology obtained from an ontology web service, ontology-driven brain imaging data integration, small-scale clinical data integration, and wider-scale clinical data integration. Such use cases illustrate the current range of applications of QI and lead us to speculate about the potential evolution from smaller groups of interconnected queries into a larger query network that layers over the document and semantic web. The resulting Query Web could greatly aid researchers and others who now have to manually navigate through multiple information sources in order to answer specific questions. PMID:22531831
SPARQLog: SPARQL with Rules and Quantification
NASA Astrophysics Data System (ADS)
Bry, François; Furche, Tim; Marnette, Bruno; Ley, Clemens; Linse, Benedikt; Poppe, Olga
SPARQL has become the gold-standard for RDF query languages. Nevertheless, we believe there is further room for improving RDF query languages. In this chapter, we investigate the addition of rules and quantifier alternation to SPARQL. That extension, called SPARQLog, extends previous RDF query languages by arbitrary quantifier alternation: blank nodes may occur in the scope of all, some, or none of the universal variables of a rule. In addition, SPARQLog is aware of important RDF features such as the distinction between blank nodes, literals and IRIs or the RDFS vocabulary. The semantics of SPARQLog is closed (every answer is an RDF graph), but lifts RDF's restrictions on literal and blank node occurrences for intermediary data. We show how to define a sound and complete operational semantics that can be implemented using existing logic programming techniques. While SPARQLog is Turing complete, we identify a decidable (in fact, polynomial time) fragment SwARQLog ensuring polynomial data-complexity inspired from the notion of super-weak acyclicity in data exchange. Furthermore, we prove that SPARQLog with no universal quantifiers in the scope of existential ones (∀ ∃ fragment) is equivalent to full SPARQLog in presence of graph projection. Thus, the convenience of arbitrary quantifier alternation comes, in fact, for free. These results, though here presented in the context of RDF querying, apply similarly also in the more general setting of data exchange.
Guidelines for a graph-theoretic implementation of structural equation modeling
Grace, James B.; Schoolmaster, Donald R.; Guntenspergen, Glenn R.; Little, Amanda M.; Mitchell, Brian R.; Miller, Kathryn M.; Schweiger, E. William
2012-01-01
Structural equation modeling (SEM) is increasingly being chosen by researchers as a framework for gaining scientific insights from the quantitative analyses of data. New ideas and methods emerging from the study of causality, influences from the field of graphical modeling, and advances in statistics are expanding the rigor, capability, and even purpose of SEM. Guidelines for implementing the expanded capabilities of SEM are currently lacking. In this paper we describe new developments in SEM that we believe constitute a third-generation of the methodology. Most characteristic of this new approach is the generalization of the structural equation model as a causal graph. In this generalization, analyses are based on graph theoretic principles rather than analyses of matrices. Also, new devices such as metamodels and causal diagrams, as well as an increased emphasis on queries and probabilistic reasoning, are now included. Estimation under a graph theory framework permits the use of Bayesian or likelihood methods. The guidelines presented start from a declaration of the goals of the analysis. We then discuss how theory frames the modeling process, requirements for causal interpretation, model specification choices, selection of estimation method, model evaluation options, and use of queries, both to summarize retrospective results and for prospective analyses. The illustrative example presented involves monitoring data from wetlands on Mount Desert Island, home of Acadia National Park. Our presentation walks through the decision process involved in developing and evaluating models, as well as drawing inferences from the resulting prediction equations. In addition to evaluating hypotheses about the connections between human activities and biotic responses, we illustrate how the structural equation (SE) model can be queried to understand how interventions might take advantage of an environmental threshold to limit Typha invasions. The guidelines presented provide for an updated definition of the SEM process that subsumes the historical matrix approach under a graph-theory implementation. The implementation is also designed to permit complex specifications and to be compatible with various estimation methods. Finally, they are meant to foster the use of probabilistic reasoning in both retrospective and prospective considerations of the quantitative implications of the results.
Graph-Based Object Class Discovery
NASA Astrophysics Data System (ADS)
Xia, Shengping; Hancock, Edwin R.
We are interested in the problem of discovering the set of object classes present in a database of images using a weakly supervised graph-based framework. Rather than making use of the ”Bag-of-Features (BoF)” approach widely used in current work on object recognition, we represent each image by a graph using a group of selected local invariant features. Using local feature matching and iterative Procrustes alignment, we perform graph matching and compute a similarity measure. Borrowing the idea of query expansion , we develop a similarity propagation based graph clustering (SPGC) method. Using this method class specific clusters of the graphs can be obtained. Such a cluster can be generally represented by using a higher level graph model whose vertices are the clustered graphs, and the edge weights are determined by the pairwise similarity measure. Experiments are performed on a dataset, in which the number of images increases from 1 to 50K and the number of objects increases from 1 to over 500. Some objects have been discovered with total recall and a precision 1 in a single cluster.
Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks.
Balaur, Irina; Mazein, Alexander; Saqi, Mansoor; Lysenko, Artem; Rawlings, Christopher J; Auffray, Charles
2017-04-01
The goal of this work is to offer a computational framework for exploring data from the Recon2 human metabolic reconstruction model. Advanced user access features have been developed using the Neo4j graph database technology and this paper describes key features such as efficient management of the network data, examples of the network querying for addressing particular tasks, and how query results are converted back to the Systems Biology Markup Language (SBML) standard format. The Neo4j-based metabolic framework facilitates exploration of highly connected and comprehensive human metabolic data and identification of metabolic subnetworks of interest. A Java-based parser component has been developed to convert query results (available in the JSON format) into SBML and SIF formats in order to facilitate further results exploration, enhancement or network sharing. The Neo4j-based metabolic framework is freely available from: https://diseaseknowledgebase.etriks.org/metabolic/browser/ . The java code files developed for this work are available from the following url: https://github.com/ibalaur/MetabolicFramework . ibalaur@eisbm.org. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Recon2Neo4j: applying graph database technologies for managing comprehensive genome-scale networks
Mazein, Alexander; Saqi, Mansoor; Lysenko, Artem; Rawlings, Christopher J.; Auffray, Charles
2017-01-01
Abstract Summary: The goal of this work is to offer a computational framework for exploring data from the Recon2 human metabolic reconstruction model. Advanced user access features have been developed using the Neo4j graph database technology and this paper describes key features such as efficient management of the network data, examples of the network querying for addressing particular tasks, and how query results are converted back to the Systems Biology Markup Language (SBML) standard format. The Neo4j-based metabolic framework facilitates exploration of highly connected and comprehensive human metabolic data and identification of metabolic subnetworks of interest. A Java-based parser component has been developed to convert query results (available in the JSON format) into SBML and SIF formats in order to facilitate further results exploration, enhancement or network sharing. Availability and Implementation: The Neo4j-based metabolic framework is freely available from: https://diseaseknowledgebase.etriks.org/metabolic/browser/. The java code files developed for this work are available from the following url: https://github.com/ibalaur/MetabolicFramework. Contact: ibalaur@eisbm.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27993779
Learning context-sensitive shape similarity by graph transduction.
Bai, Xiang; Yang, Xingwei; Latecki, Longin Jan; Liu, Wenyu; Tu, Zhuowen
2010-05-01
Shape similarity and shape retrieval are very important topics in computer vision. The recent progress in this domain has been mostly driven by designing smart shape descriptors for providing better similarity measure between pairs of shapes. In this paper, we provide a new perspective to this problem by considering the existing shapes as a group, and study their similarity measures to the query shape in a graph structure. Our method is general and can be built on top of any existing shape similarity measure. For a given similarity measure, a new similarity is learned through graph transduction. The new similarity is learned iteratively so that the neighbors of a given shape influence its final similarity to the query. The basic idea here is related to PageRank ranking, which forms a foundation of Google Web search. The presented experimental results demonstrate that the proposed approach yields significant improvements over the state-of-art shape matching algorithms. We obtained a retrieval rate of 91.61 percent on the MPEG-7 data set, which is the highest ever reported in the literature. Moreover, the learned similarity by the proposed method also achieves promising improvements on both shape classification and shape clustering.
Differences in Reporting the Ragweed Pollen Season Using Google Trends across 15 Countries.
Bousquet, Jean; Agache, Ioana; Berger, Uwe; Bergmann, Karl-Christian; Besancenot, Jean-Pierre; Bousquet, Philippe J; Casale, Tom; d'Amato, Gennaro; Kaidashev, Igor; Khaitov, Musa; Mösges, Ralph; Nekam, Kristof; Onorato, Gabrielle L; Plavec, Davor; Sheikh, Aziz; Thibaudon, Michel; Vautard, Robert; Zidarn, Mihaela
2018-05-09
Google Trends (GT) searches trends of specific queries in Google, which potentially reflect the real-life epidemiology of allergic rhinitis. We compared GT terms related to ragweed pollen allergy in American and European Union countries with a known ragweed pollen season. Our aim was to assess seasonality and the terms needed to perform the GT searches and to compare these during the spring and summer pollen seasons. We examined GT queries from January 1, 2011, to January 4, 2017. We included 15 countries with a known ragweed pollen season and used the standard 5-year GT graphs. We used the GT translation for all countries and the untranslated native terms for each country. The results of "pollen," "ragweed," and "allergy" searches differed between countries, but "ragweed" was clearly identified in 12 of the 15 countries. There was considerable heterogeneity of findings when the GT translation was used. For Croatia, Hungary, Romania, Serbia, and Slovenia, the GT translation was inappropriate. The country patterns of "pollen," "hay fever," and "allergy" differed in 8 of the 11 countries with identified "ragweed" queries during the spring and the summer, indicating that the perception of tree and grass pollen allergy differs from that of ragweed pollen. To investigate ragweed pollen allergy using GT, the term "ragweed" as a plant is required and the translation of "ragweed" in the native language needed. © 2018 S. Karger AG, Basel.
Magnostics: Image-Based Search of Interesting Matrix Views for Guided Network Exploration.
Behrisch, Michael; Bach, Benjamin; Hund, Michael; Delz, Michael; Von Ruden, Laura; Fekete, Jean-Daniel; Schreck, Tobias
2017-01-01
In this work we address the problem of retrieving potentially interesting matrix views to support the exploration of networks. We introduce Matrix Diagnostics (or Magnostics), following in spirit related approaches for rating and ranking other visualization techniques, such as Scagnostics for scatter plots. Our approach ranks matrix views according to the appearance of specific visual patterns, such as blocks and lines, indicating the existence of topological motifs in the data, such as clusters, bi-graphs, or central nodes. Magnostics can be used to analyze, query, or search for visually similar matrices in large collections, or to assess the quality of matrix reordering algorithms. While many feature descriptors for image analyzes exist, there is no evidence how they perform for detecting patterns in matrices. In order to make an informed choice of feature descriptors for matrix diagnostics, we evaluate 30 feature descriptors-27 existing ones and three new descriptors that we designed specifically for MAGNOSTICS-with respect to four criteria: pattern response, pattern variability, pattern sensibility, and pattern discrimination. We conclude with an informed set of six descriptors as most appropriate for Magnostics and demonstrate their application in two scenarios; exploring a large collection of matrices and analyzing temporal networks.
Analyzing locomotion synthesis with feature-based motion graphs.
Mahmudi, Mentar; Kallmann, Marcelo
2013-05-01
We propose feature-based motion graphs for realistic locomotion synthesis among obstacles. Among several advantages, feature-based motion graphs achieve improved results in search queries, eliminate the need of postprocessing for foot skating removal, and reduce the computational requirements in comparison to traditional motion graphs. Our contributions are threefold. First, we show that choosing transitions based on relevant features significantly reduces graph construction time and leads to improved search performances. Second, we employ a fast channel search method that confines the motion graph search to a free channel with guaranteed clearance among obstacles, achieving faster and improved results that avoid expensive collision checking. Lastly, we present a motion deformation model based on Inverse Kinematics applied over the transitions of a solution branch. Each transition is assigned a continuous deformation range that does not exceed the original transition cost threshold specified by the user for the graph construction. The obtained deformation improves the reachability of the feature-based motion graph and in turn also reduces the time spent during search. The results obtained by the proposed methods are evaluated and quantified, and they demonstrate significant improvements in comparison to traditional motion graph techniques.
A Natural Language Interface Concordant with a Knowledge Base.
Han, Yong-Jin; Park, Seong-Bae; Park, Se-Young
2016-01-01
The discordance between expressions interpretable by a natural language interface (NLI) system and those answerable by a knowledge base is a critical problem in the field of NLIs. In order to solve this discordance problem, this paper proposes a method to translate natural language questions into formal queries that can be generated from a graph-based knowledge base. The proposed method considers a subgraph of a knowledge base as a formal query. Thus, all formal queries corresponding to a concept or a predicate in the knowledge base can be generated prior to query time and all possible natural language expressions corresponding to each formal query can also be collected in advance. A natural language expression has a one-to-one mapping with a formal query. Hence, a natural language question is translated into a formal query by matching the question with the most appropriate natural language expression. If the confidence of this matching is not sufficiently high the proposed method rejects the question and does not answer it. Multipredicate queries are processed by regarding them as a set of collected expressions. The experimental results show that the proposed method thoroughly handles answerable questions from the knowledge base and rejects unanswerable ones effectively.
RelFinder: Revealing Relationships in RDF Knowledge Bases
NASA Astrophysics Data System (ADS)
Heim, Philipp; Hellmann, Sebastian; Lehmann, Jens; Lohmann, Steffen; Stegemann, Timo
The Semantic Web has recently seen a rise of large knowledge bases (such as DBpedia) that are freely accessible via SPARQL endpoints. The structured representation of the contained information opens up new possibilities in the way it can be accessed and queried. In this paper, we present an approach that extracts a graph covering relationships between two objects of interest. We show an interactive visualization of this graph that supports the systematic analysis of the found relationships by providing highlighting, previewing, and filtering features.
NASA Astrophysics Data System (ADS)
Arenas, Marcelo; Gutierrez, Claudio; Pérez, Jorge
The goal of this paper is to give an overview of the basics of the theory of RDF databases. We provide a formal definition of RDF that includes the features that distinguish this model from other graph data models. We then move into the fundamental issue of querying RDF data. We start by considering the RDF query language SPARQL, which is a W3C Recommendation since January 2008. We provide an algebraic syntax and a compositional semantics for this language, study the complexity of the evaluation problem for different fragments of SPARQL, and consider the problem of optimizing the evaluation of SPARQL queries, showing that a natural fragment of this language has some good properties in this respect. We furthermore study the expressive power of SPARQL, by comparing it with some well-known query languages such as relational algebra. We conclude by considering the issue of querying RDF data in the presence of RDFS vocabulary. In particular, we present a recently proposed extension of SPARQL with navigational capabilities.
BioFed: federated query processing over life sciences linked open data.
Hasnain, Ali; Mehmood, Qaiser; Sana E Zainab, Syeda; Saleem, Muhammad; Warren, Claude; Zehra, Durre; Decker, Stefan; Rebholz-Schuhmann, Dietrich
2017-03-15
Biomedical data, e.g. from knowledge bases and ontologies, is increasingly made available following open linked data principles, at best as RDF triple data. This is a necessary step towards unified access to biological data sets, but this still requires solutions to query multiple endpoints for their heterogeneous data to eventually retrieve all the meaningful information. Suggested solutions are based on query federation approaches, which require the submission of SPARQL queries to endpoints. Due to the size and complexity of available data, these solutions have to be optimised for efficient retrieval times and for users in life sciences research. Last but not least, over time, the reliability of data resources in terms of access and quality have to be monitored. Our solution (BioFed) federates data over 130 SPARQL endpoints in life sciences and tailors query submission according to the provenance information. BioFed has been evaluated against the state of the art solution FedX and forms an important benchmark for the life science domain. The efficient cataloguing approach of the federated query processing system 'BioFed', the triple pattern wise source selection and the semantic source normalisation forms the core to our solution. It gathers and integrates data from newly identified public endpoints for federated access. Basic provenance information is linked to the retrieved data. Last but not least, BioFed makes use of the latest SPARQL standard (i.e., 1.1) to leverage the full benefits for query federation. The evaluation is based on 10 simple and 10 complex queries, which address data in 10 major and very popular data sources (e.g., Dugbank, Sider). BioFed is a solution for a single-point-of-access for a large number of SPARQL endpoints providing life science data. It facilitates efficient query generation for data access and provides basic provenance information in combination with the retrieved data. BioFed fully supports SPARQL 1.1 and gives access to the endpoint's availability based on the EndpointData graph. Our evaluation of BioFed against FedX is based on 20 heterogeneous federated SPARQL queries and shows competitive execution performance in comparison to FedX, which can be attributed to the provision of provenance information for the source selection. Developing and testing federated query engines for life sciences data is still a challenging task. According to our findings, it is advantageous to optimise the source selection. The cataloguing of SPARQL endpoints, including type and property indexing, leads to efficient querying of data resources over the Web of Data. This could even be further improved through the use of ontologies, e.g., for abstract normalisation of query terms.
EasyKSORD: A Platform of Keyword Search Over Relational Databases
NASA Astrophysics Data System (ADS)
Peng, Zhaohui; Li, Jing; Wang, Shan
Keyword Search Over Relational Databases (KSORD) enables casual users to use keyword queries (a set of keywords) to search relational databases just like searching the Web, without any knowledge of the database schema or any need of writing SQL queries. Based on our previous work, we design and implement a novel KSORD platform named EasyKSORD for users and system administrators to use and manage different KSORD systems in a novel and simple manner. EasyKSORD supports advanced queries, efficient data-graph-based search engines, multiform result presentations, and system logging and analysis. Through EasyKSORD, users can search relational databases easily and read search results conveniently, and system administrators can easily monitor and analyze the operations of KSORD and manage KSORD systems much better.
Time series patterns and language support in DBMS
NASA Astrophysics Data System (ADS)
Telnarova, Zdenka
2017-07-01
This contribution is focused on pattern type Time Series as a rich in semantics representation of data. Some example of implementation of this pattern type in traditional Data Base Management Systems is briefly presented. There are many approaches how to manipulate with patterns and query patterns. Crucial issue can be seen in systematic approach to pattern management and specific pattern query language which takes into consideration semantics of patterns. Query language SQL-TS for manipulating with patterns is shown on Time Series data.
Equivalence of Szegedy's and coined quantum walks
NASA Astrophysics Data System (ADS)
Wong, Thomas G.
2017-09-01
Szegedy's quantum walk is a quantization of a classical random walk or Markov chain, where the walk occurs on the edges of the bipartite double cover of the original graph. To search, one can simply quantize a Markov chain with absorbing vertices. Recently, Santos proposed two alternative search algorithms that instead utilize the sign-flip oracle in Grover's algorithm rather than absorbing vertices. In this paper, we show that these two algorithms are exactly equivalent to two algorithms involving coined quantum walks, which are walks on the vertices of the original graph with an internal degree of freedom. The first scheme is equivalent to a coined quantum walk with one walk step per query of Grover's oracle, and the second is equivalent to a coined quantum walk with two walk steps per query of Grover's oracle. These equivalences lie outside the previously known equivalence of Szegedy's quantum walk with absorbing vertices and the coined quantum walk with the negative identity operator as the coin for marked vertices, whose precise relationships we also investigate.
GenomeGraphs: integrated genomic data visualization with R.
Durinck, Steffen; Bullard, James; Spellman, Paul T; Dudoit, Sandrine
2009-01-06
Biological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses. We developed GenomeGraphs, as an add-on software package for the statistical programming environment R, to facilitate integrated visualization of genomic datasets. GenomeGraphs uses the biomaRt package to perform on-line annotation queries to Ensembl and translates these to gene/transcript structures in viewports of the grid graphics package. This allows genomic annotation to be plotted together with experimental data. GenomeGraphs can also be used to plot custom annotation tracks in combination with different experimental data types together in one plot using the same genomic coordinate system. GenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R.
Streaming data analytics via message passing with application to graph algorithms
Plimpton, Steven J.; Shead, Tim
2014-05-06
The need to process streaming data, which arrives continuously at high-volume in real-time, arises in a variety of contexts including data produced by experiments, collections of environmental or network sensors, and running simulations. Streaming data can also be formulated as queries or transactions which operate on a large dynamic data store, e.g. a distributed database. We describe a lightweight, portable framework named PHISH which enables a set of independent processes to compute on a stream of data in a distributed-memory parallel manner. Datums are routed between processes in patterns defined by the application. PHISH can run on top of eithermore » message-passing via MPI or sockets via ZMQ. The former means streaming computations can be run on any parallel machine which supports MPI; the latter allows them to run on a heterogeneous, geographically dispersed network of machines. We illustrate how PHISH can support streaming MapReduce operations, and describe streaming versions of three algorithms for large, sparse graph analytics: triangle enumeration, subgraph isomorphism matching, and connected component finding. Lastly, we also provide benchmark timings for MPI versus socket performance of several kernel operations useful in streaming algorithms.« less
Chaotic Traversal (CHAT): Very Large Graphs Traversal Using Chaotic Dynamics
NASA Astrophysics Data System (ADS)
Changaival, Boonyarit; Rosalie, Martin; Danoy, Grégoire; Lavangnananda, Kittichai; Bouvry, Pascal
2017-12-01
Graph Traversal algorithms can find their applications in various fields such as routing problems, natural language processing or even database querying. The exploration can be considered as a first stepping stone into knowledge extraction from the graph which is now a popular topic. Classical solutions such as Breadth First Search (BFS) and Depth First Search (DFS) require huge amounts of memory for exploring very large graphs. In this research, we present a novel memoryless graph traversal algorithm, Chaotic Traversal (CHAT) which integrates chaotic dynamics to traverse large unknown graphs via the Lozi map and the Rössler system. To compare various dynamics effects on our algorithm, we present an original way to perform the exploration of a parameter space using a bifurcation diagram with respect to the topological structure of attractors. The resulting algorithm is an efficient and nonresource demanding algorithm, and is therefore very suitable for partial traversal of very large and/or unknown environment graphs. CHAT performance using Lozi map is proven superior than the, commonly known, Random Walk, in terms of number of nodes visited (coverage percentage) and computation time where the environment is unknown and memory usage is restricted.
A Graph Approach to Mining Biological Patterns in the Binding Interfaces.
Cheng, Wen; Yan, Changhui
2017-01-01
Protein-RNA interactions play important roles in the biological systems. Searching for regular patterns in the Protein-RNA binding interfaces is important for understanding how protein and RNA recognize each other and bind to form a complex. Herein, we present a graph-mining method for discovering biological patterns in the protein-RNA interfaces. We represented known protein-RNA interfaces using graphs and then discovered graph patterns enriched in the interfaces. Comparison of the discovered graph patterns with UniProt annotations showed that the graph patterns had a significant overlap with residue sites that had been proven crucial for the RNA binding by experimental methods. Using 200 patterns as input features, a support vector machine method was able to classify protein surface patches into RNA-binding sites and non-RNA-binding sites with 84.0% accuracy and 88.9% precision. We built a simple scoring function that calculated the total number of the graph patterns that occurred in a protein-RNA interface. That scoring function was able to discriminate near-native protein-RNA complexes from docking decoys with a performance comparable with that of a state-of-the-art complex scoring function. Our work also revealed possible patterns that might be important for binding affinity.
A Gene Ontology Tutorial in Python.
Vesztrocy, Alex Warwick; Dessimoz, Christophe
2017-01-01
This chapter is a tutorial on using Gene Ontology resources in the Python programming language. This entails querying the Gene Ontology graph, retrieving Gene Ontology annotations, performing gene enrichment analyses, and computing basic semantic similarity between GO terms. An interactive version of the tutorial, including solutions, is available at http://gohandbook.org .
Modeling Spatial Relationships within a Fuzzy Framework.
ERIC Educational Resources Information Center
Petry, Frederick E.; Cobb, Maria A.
1998-01-01
Presents a model for representing and storing binary topological and directional relationships between 2-dimensional objects that is used to provide a basis for fuzzy querying capabilities. A data structure called an abstract spatial graph (ASG) is defined for the binary relationships that maintains all necessary information regarding topology and…
Insertion algorithms for network model database management systems
NASA Astrophysics Data System (ADS)
Mamadolimov, Abdurashid; Khikmat, Saburov
2017-12-01
The network model is a database model conceived as a flexible way of representing objects and their relationships. Its distinguishing feature is that the schema, viewed as a graph in which object types are nodes and relationship types are arcs, forms partial order. When a database is large and a query comparison is expensive then the efficiency requirement of managing algorithms is minimizing the number of query comparisons. We consider updating operation for network model database management systems. We develop a new sequantial algorithm for updating operation. Also we suggest a distributed version of the algorithm.
Query-Adaptive Reciprocal Hash Tables for Nearest Neighbor Search.
Liu, Xianglong; Deng, Cheng; Lang, Bo; Tao, Dacheng; Li, Xuelong
2016-02-01
Recent years have witnessed the success of binary hashing techniques in approximate nearest neighbor search. In practice, multiple hash tables are usually built using hashing to cover more desired results in the hit buckets of each table. However, rare work studies the unified approach to constructing multiple informative hash tables using any type of hashing algorithms. Meanwhile, for multiple table search, it also lacks of a generic query-adaptive and fine-grained ranking scheme that can alleviate the binary quantization loss suffered in the standard hashing techniques. To solve the above problems, in this paper, we first regard the table construction as a selection problem over a set of candidate hash functions. With the graph representation of the function set, we propose an efficient solution that sequentially applies normalized dominant set to finding the most informative and independent hash functions for each table. To further reduce the redundancy between tables, we explore the reciprocal hash tables in a boosting manner, where the hash function graph is updated with high weights emphasized on the misclassified neighbor pairs of previous hash tables. To refine the ranking of the retrieved buckets within a certain Hamming radius from the query, we propose a query-adaptive bitwise weighting scheme to enable fine-grained bucket ranking in each hash table, exploiting the discriminative power of its hash functions and their complement for nearest neighbor search. Moreover, we integrate such scheme into the multiple table search using a fast, yet reciprocal table lookup algorithm within the adaptive weighted Hamming radius. In this paper, both the construction method and the query-adaptive search method are general and compatible with different types of hashing algorithms using different feature spaces and/or parameter settings. Our extensive experiments on several large-scale benchmarks demonstrate that the proposed techniques can significantly outperform both the naive construction methods and the state-of-the-art hashing algorithms.
Big Data and Dysmenorrhea: What Questions Do Women and Men Ask About Menstrual Pain?
Chen, Chen X; Groves, Doyle; Miller, Wendy R; Carpenter, Janet S
2018-04-30
Menstrual pain is highly prevalent among women of reproductive age. As the general public increasingly obtains health information online, Big Data from online platforms provide novel sources to understand the public's perspectives and information needs about menstrual pain. The study's purpose was to describe salient queries about dysmenorrhea using Big Data from a question and answer platform. We performed text-mining of 1.9 billion queries from ChaCha, a United States-based question and answer platform. Dysmenorrhea-related queries were identified by using keyword searching. Each relevant query was split into token words (i.e., meaningful words or phrases) and stop words (i.e., not meaningful functional words). Word Adjacency Graph (WAG) modeling was used to detect clusters of queries and visualize the range of dysmenorrhea-related topics. We constructed two WAG models respectively from queries by women of reproductive age and bymen. Salient themes were identified through inspecting clusters of WAG models. We identified two subsets of queries: Subset 1 contained 507,327 queries from women aged 13-50 years. Subset 2 contained 113,888 queries from men aged 13 or above. WAG modeling revealed topic clusters for each subset. Between female and male subsets, topic clusters overlapped on dysmenorrhea symptoms and management. Among female queries, there were distinctive topics on approaching menstrual pain at school and menstrual pain-related conditions; while among male queries, there was a distinctive cluster of queries on menstrual pain from male's perspectives. Big Data mining of the ChaCha ® question and answer service revealed a series of information needs among women and men on menstrual pain. Findings may be useful in structuring the content and informing the delivery platform for educational interventions.
Toward a Data Scalable Solution for Facilitating Discovery of Science Resources
DOE Office of Scientific and Technical Information (OSTI.GOV)
Weaver, Jesse R.; Castellana, Vito G.; Morari, Alessandro
Science is increasingly motivated by the need to process larger quantities of data. It is facing severe challenges in data collection, management, and processing, so much so that the computational demands of “data scaling” are competing with, and in many fields surpassing, the traditional objective of decreasing processing time. Example domains with large datasets include astronomy, biology, genomics, climate/weather, and material sciences. This paper presents a real-world use case in which we wish to answer queries pro- vided by domain scientists in order to facilitate discovery of relevant science resources. The problem is that the metadata for these science resourcesmore » is very large and is growing quickly, rapidly increasing the need for a data scaling solution. We propose a system – SGEM – designed for answering graph-based queries over large datasets on cluster architectures, and we re- port performance results for queries on the current RDESC dataset of nearly 1.4 billion triples, and on the well-known BSBM SPARQL query benchmark.« less
HBVPathDB: a database of HBV infection-related molecular interaction network.
Zhang, Yi; Bo, Xiao-Chen; Yang, Jing; Wang, Sheng-Qi
2005-03-21
To describe molecules or genes interaction between hepatitis B viruses (HBV) and host, for understanding how virus' and host's genes and molecules are networked to form a biological system and for perceiving mechanism of HBV infection. The knowledge of HBV infection-related reactions was organized into various kinds of pathways with carefully drawn graphs in HBVPathDB. Pathway information is stored with relational database management system (DBMS), which is currently the most efficient way to manage large amounts of data and query is implemented with powerful Structured Query Language (SQL). The search engine is written using Personal Home Page (PHP) with SQL embedded and web retrieval interface is developed for searching with Hypertext Markup Language (HTML). We present the first version of HBVPathDB, which is a HBV infection-related molecular interaction network database composed of 306 pathways with 1 050 molecules involved. With carefully drawn graphs, pathway information stored in HBVPathDB can be browsed in an intuitive way. We develop an easy-to-use interface for flexible accesses to the details of database. Convenient software is implemented to query and browse the pathway information of HBVPathDB. Four search page layout options-category search, gene search, description search, unitized search-are supported by the search engine of the database. The database is freely available at http://www.bio-inf.net/HBVPathDB/HBV/. The conventional perspective HBVPathDB have already contained a considerable amount of pathway information with HBV infection related, which is suitable for in-depth analysis of molecular interaction network of virus and host. HBVPathDB integrates pathway data-sets with convenient software for query, browsing, visualization, that provides users more opportunity to identify regulatory key molecules as potential drug targets and to explore the possible mechanism of HBV infection based on gene expression datasets.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
NASA Astrophysics Data System (ADS)
Farhan Husain, Mohammad; Doshi, Pankil; Khan, Latifur; Thuraisingham, Bhavani
Handling huge amount of data scalably is a matter of concern for a long time. Same is true for semantic web data. Current semantic web frameworks lack this ability. In this paper, we describe a framework that we built using Hadoop to store and retrieve large number of RDF triples. We describe our schema to store RDF data in Hadoop Distribute File System. We also present our algorithms to answer a SPARQL query. We make use of Hadoop's MapReduce framework to actually answer the queries. Our results reveal that we can store huge amount of semantic web data in Hadoop clusters built mostly by cheap commodity class hardware and still can answer queries fast enough. We conclude that ours is a scalable framework, able to handle large amount of RDF data efficiently.
Ma, Ling; Liu, Xiabi; Gao, Yan; Zhao, Yanfeng; Zhao, Xinming; Zhou, Chunwu
2017-02-01
This paper proposes a new method of content based medical image retrieval through considering fused, context-sensitive similarity. Firstly, we fuse the semantic and visual similarities between the query image and each image in the database as their pairwise similarities. Then, we construct a weighted graph whose nodes represent the images and edges measure their pairwise similarities. By using the shortest path algorithm over the weighted graph, we obtain a new similarity measure, context-sensitive similarity measure, between the query image and each database image to complete the retrieval process. Actually, we use the fused pairwise similarity to narrow down the semantic gap for obtaining a more accurate pairwise similarity measure, and spread it on the intrinsic data manifold to achieve the context-sensitive similarity for a better retrieval performance. The proposed method has been evaluated on the retrieval of the Common CT Imaging Signs of Lung Diseases (CISLs) and achieved not only better retrieval results but also the satisfactory computation efficiency. Copyright © 2017 Elsevier Inc. All rights reserved.
Combining computational models, semantic annotations and simulation experiments in a graph database
Henkel, Ron; Wolkenhauer, Olaf; Waltemath, Dagmar
2015-01-01
Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models’ structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/ PMID:25754863
Graphing evolutionary pattern and process: a history of techniques in archaeology and paleobiology.
Lyman, R Lee
2009-02-01
Graphs displaying evolutionary patterns are common in paleontology and in United States archaeology. Both disciplines subscribed to a transformational theory of evolution and graphed evolution as a sequence of archetypes in the late nineteenth and early twentieth centuries. U.S. archaeologists in the second decade of the twentieth century, and paleontologists shortly thereafter, developed distinct graphic styles that reflected the Darwinian variational model of evolution. Paleobiologists adopted the view of a species as a set of phenotypically variant individuals and graphed those variations either as central tendencies or as histograms of frequencies of variants. Archaeologists presumed their artifact types reflected cultural norms of prehistoric artisans and the frequency of specimens in each type reflected human choice and type popularity. They graphed cultural evolution as shifts in frequencies of specimens representing each of several artifact types. Confusion of pattern and process is exemplified by a paleobiologist misinterpreting the process illustrated by an archaeological graph, and an archaeologist misinterpreting the process illustrated by a paleobiological graph. Each style of graph displays particular evolutionary patterns and implies particular evolutionary processes. Graphs of a multistratum collection of prehistoric mammal remains and a multistratum collection of artifacts demonstrate that many graph styles can be used for both kinds of collections.
A GRAPH PARTITIONING APPROACH TO PREDICTING PATTERNS IN LATERAL INHIBITION SYSTEMS
RUFINO FERREIRA, ANA S.; ARCAK, MURAT
2017-01-01
We analyze spatial patterns on networks of cells where adjacent cells inhibit each other through contact signaling. We represent the network as a graph where each vertex represents the dynamics of identical individual cells and where graph edges represent cell-to-cell signaling. To predict steady-state patterns we find equitable partitions of the graph vertices and assign them into disjoint classes. We then use results from monotone systems theory to prove the existence of patterns that are structured in such a way that all the cells in the same class have the same final fate. To study the stability properties of these patterns, we rely on the graph partition to perform a block decomposition of the system. Then, to guarantee stability, we provide a small-gain type criterion that depends on the input-output properties of each cell in the reduced system. Finally, we discuss pattern formation in stochastic models. With the help of a modal decomposition we show that noise can enhance the parameter region where patterning occurs. PMID:29225552
Digital Workflows for a 3d Semantic Representation of AN Ancient Mining Landscape
NASA Astrophysics Data System (ADS)
Hiebel, G.; Hanke, K.
2017-08-01
The ancient mining landscape of Schwaz/Brixlegg in the Tyrol, Austria witnessed mining from prehistoric times to modern times creating a first order cultural landscape when it comes to one of the most important inventions in human history: the production of metal. In 1991 a part of this landscape was lost due to an enormous landslide that reshaped part of the mountain. With our work we want to propose a digital workflow to create a 3D semantic representation of this ancient mining landscape with its mining structures to preserve it for posterity. First, we define a conceptual model to integrate the data. It is based on the CIDOC CRM ontology and CRMgeo for geometric data. To transform our information sources to a formal representation of the classes and properties of the ontology we applied semantic web technologies and created a knowledge graph in RDF (Resource Description Framework). Through the CRMgeo extension coordinate information of mining features can be integrated into the RDF graph and thus related to the detailed digital elevation model that may be visualized together with the mining structures using Geoinformation systems or 3D visualization tools. The RDF network of the triple store can be queried using the SPARQL query language. We created a snapshot of mining, settlement and burial sites in the Bronze Age. The results of the query were loaded into a Geoinformation system and a visualization of known bronze age sites related to mining, settlement and burial activities was created.
NASA Astrophysics Data System (ADS)
Boulicaut, Jean-Francois; Jeudy, Baptiste
Knowledge Discovery in Databases (KDD) is a complex interactive process. The promising theoretical framework of inductive databases considers this is essentially a querying process. It is enabled by a query language which can deal either with raw data or patterns which hold in the data. Mining patterns turns to be the so-called inductive query evaluation process for which constraint-based Data Mining techniques have to be designed. An inductive query specifies declaratively the desired constraints and algorithms are used to compute the patterns satisfying the constraints in the data. We survey important results of this active research domain. This chapter emphasizes a real breakthrough for hard problems concerning local pattern mining under various constraints and it points out the current directions of research as well.
Genecentric: a package to uncover graph-theoretic structure in high-throughput epistasis data.
Gallant, Andrew; Leiserson, Mark D M; Kachalov, Maxim; Cowen, Lenore J; Hescott, Benjamin J
2013-01-18
New technology has resulted in high-throughput screens for pairwise genetic interactions in yeast and other model organisms. For each pair in a collection of non-essential genes, an epistasis score is obtained, representing how much sicker (or healthier) the double-knockout organism will be compared to what would be expected from the sickness of the component single knockouts. Recent algorithmic work has identified graph-theoretic patterns in this data that can indicate functional modules, and even sets of genes that may occur in compensatory pathways, such as a BPM-type schema first introduced by Kelley and Ideker. However, to date, any algorithms for finding such patterns in the data were implemented internally, with no software being made publically available. Genecentric is a new package that implements a parallelized version of the Leiserson et al. algorithm (J Comput Biol 18:1399-1409, 2011) for generating generalized BPMs from high-throughput genetic interaction data. Given a matrix of weighted epistasis values for a set of double knock-outs, Genecentric returns a list of generalized BPMs that may represent compensatory pathways. Genecentric also has an extension, GenecentricGO, to query FuncAssociate (Bioinformatics 25:3043-3044, 2009) to retrieve GO enrichment statistics on generated BPMs. Python is the only dependency, and our web site provides working examples and documentation. We find that Genecentric can be used to find coherent functional and perhaps compensatory gene sets from high throughput genetic interaction data. Genecentric is made freely available for download under the GPLv2 from http://bcb.cs.tufts.edu/genecentric.
Genecentric: a package to uncover graph-theoretic structure in high-throughput epistasis data
2013-01-01
Background New technology has resulted in high-throughput screens for pairwise genetic interactions in yeast and other model organisms. For each pair in a collection of non-essential genes, an epistasis score is obtained, representing how much sicker (or healthier) the double-knockout organism will be compared to what would be expected from the sickness of the component single knockouts. Recent algorithmic work has identified graph-theoretic patterns in this data that can indicate functional modules, and even sets of genes that may occur in compensatory pathways, such as a BPM-type schema first introduced by Kelley and Ideker. However, to date, any algorithms for finding such patterns in the data were implemented internally, with no software being made publically available. Results Genecentric is a new package that implements a parallelized version of the Leiserson et al. algorithm (J Comput Biol 18:1399-1409, 2011) for generating generalized BPMs from high-throughput genetic interaction data. Given a matrix of weighted epistasis values for a set of double knock-outs, Genecentric returns a list of generalized BPMs that may represent compensatory pathways. Genecentric also has an extension, GenecentricGO, to query FuncAssociate (Bioinformatics 25:3043-3044, 2009) to retrieve GO enrichment statistics on generated BPMs. Python is the only dependency, and our web site provides working examples and documentation. Conclusion We find that Genecentric can be used to find coherent functional and perhaps compensatory gene sets from high throughput genetic interaction data. Genecentric is made freely available for download under the GPLv2 from http://bcb.cs.tufts.edu/genecentric. PMID:23331614
Neuro-symbolic representation learning on biological knowledge graphs.
Alshahrani, Mona; Khan, Mohammad Asif; Maddouri, Omar; Kinjo, Akira R; Queralt-Rosinach, Núria; Hoehndorf, Robert
2017-09-01
Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. In the past years, feature learning methods that are applicable to graph-structured data are becoming available, but have not yet widely been applied and evaluated on structured biological knowledge. Results: We develop a novel method for feature learning on biological knowledge graphs. Our method combines symbolic methods, in particular knowledge representation using symbolic logic and automated reasoning, with neural networks to generate embeddings of nodes that encode for related information within knowledge graphs. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. We apply these embeddings to the prediction of edges in the knowledge graph representing problems of function prediction, finding candidate genes of diseases, protein-protein interactions, or drug target relations, and demonstrate performance that matches and sometimes outperforms traditional approaches based on manually crafted features. Our method can be applied to any biological knowledge graph, and will thereby open up the increasing amount of Semantic Web based knowledge bases in biology to use in machine learning and data analytics. https://github.com/bio-ontology-research-group/walking-rdf-and-owl. robert.hoehndorf@kaust.edu.sa. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Collaborative mining and transfer learning for relational data
NASA Astrophysics Data System (ADS)
Levchuk, Georgiy; Eslami, Mohammed
2015-06-01
Many of the real-world problems, - including human knowledge, communication, biological, and cyber network analysis, - deal with data entities for which the essential information is contained in the relations among those entities. Such data must be modeled and analyzed as graphs, with attributes on both objects and relations encode and differentiate their semantics. Traditional data mining algorithms were originally designed for analyzing discrete objects for which a set of features can be defined, and thus cannot be easily adapted to deal with graph data. This gave rise to the relational data mining field of research, of which graph pattern learning is a key sub-domain [11]. In this paper, we describe a model for learning graph patterns in collaborative distributed manner. Distributed pattern learning is challenging due to dependencies between the nodes and relations in the graph, and variability across graph instances. We present three algorithms that trade-off benefits of parallelization and data aggregation, compare their performance to centralized graph learning, and discuss individual benefits and weaknesses of each model. Presented algorithms are designed for linear speedup in distributed computing environments, and learn graph patterns that are both closer to ground truth and provide higher detection rates than centralized mining algorithm.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sukumar, Sreenivas R.; Hong, Seokyong; Lee, Sangkeun
2016-06-01
GraphBench is a benchmark suite for graph pattern mining and graph analysis systems. The benchmark suite is a significant addition to conducting apples-apples comparison of graph analysis software (databases, in-memory tools, triple stores, etc.)
Mining Longitudinal Web Queries: Trends and Patterns.
ERIC Educational Resources Information Center
Wang, Peiling; Berry, Michael W.; Yang, Yiheng
2003-01-01
Analyzed user queries submitted to an academic Web site during a four-year period, using a relational database, to examine users' query behavior, to identify problems they encounter, and to develop techniques for optimizing query analysis and mining. Linguistic analyses focus on query structures, lexicon, and word associations using statistical…
Xiao, Fuyuan; Aritsugi, Masayoshi; Wang, Qing; Zhang, Rong
2016-09-01
For efficient and sophisticated analysis of complex event patterns that appear in streams of big data from health care information systems and support for decision-making, a triaxial hierarchical model is proposed in this paper. Our triaxial hierarchical model is developed by focusing on hierarchies among nested event pattern queries with an event concept hierarchy, thereby allowing us to identify the relationships among the expressions and sub-expressions of the queries extensively. We devise a cost-based heuristic by means of the triaxial hierarchical model to find an optimised query execution plan in terms of the costs of both the operators and the communications between them. According to the triaxial hierarchical model, we can also calculate how to reuse the results of the common sub-expressions in multiple queries. By integrating the optimised query execution plan with the reuse schemes, a multi-query optimisation strategy is developed to accomplish efficient processing of multiple nested event pattern queries. We present empirical studies in which the performance of multi-query optimisation strategy was examined under various stream input rates and workloads. Specifically, the workloads of pattern queries can be used for supporting monitoring patients' conditions. On the other hand, experiments with varying input rates of streams can correspond to changes of the numbers of patients that a system should manage, whereas burst input rates can correspond to changes of rushes of patients to be taken care of. The experimental results have shown that, in Workload 1, our proposal can improve about 4 and 2 times throughput comparing with the relative works, respectively; in Workload 2, our proposal can improve about 3 and 2 times throughput comparing with the relative works, respectively; in Workload 3, our proposal can improve about 6 times throughput comparing with the relative work. The experimental results demonstrated that our proposal was able to process complex queries efficiently which can support health information systems and further decision-making. Copyright © 2016 Elsevier B.V. All rights reserved.
Alternative Fuels Data Center: Maps and Data
Biofuelsatlas BioFuels Atlas is an interactive map for comparing biomass feedstocks and biofuels by location . This tool helps users select from and apply biomass data layers to a map, as well as query and download State Biodiesel-stations View Map Graph E85-stations-map E85 Fueling Station Locations by State E85
Statistically significant relational data mining :
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berry, Jonathan W.; Leung, Vitus Joseph; Phillips, Cynthia Ann
This report summarizes the work performed under the project (3z(BStatitically significant relational data mining.(3y (BThe goal of the project was to add more statistical rigor to the fairly ad hoc area of data mining on graphs. Our goal was to develop better algorithms and better ways to evaluate algorithm quality. We concetrated on algorithms for community detection, approximate pattern matching, and graph similarity measures. Approximate pattern matching involves finding an instance of a relatively small pattern, expressed with tolerance, in a large graph of data observed with uncertainty. This report gathers the abstracts and references for the eight refereed publicationsmore » that have appeared as part of this work. We then archive three pieces of research that have not yet been published. The first is theoretical and experimental evidence that a popular statistical measure for comparison of community assignments favors over-resolved communities over approximations to a ground truth. The second are statistically motivated methods for measuring the quality of an approximate match of a small pattern in a large graph. The third is a new probabilistic random graph model. Statisticians favor these models for graph analysis. The new local structure graph model overcomes some of the issues with popular models such as exponential random graph models and latent variable models.« less
Man-Made Object Extraction from Remote Sensing Imagery by Graph-Based Manifold Ranking
NASA Astrophysics Data System (ADS)
He, Y.; Wang, X.; Hu, X. Y.; Liu, S. H.
2018-04-01
The automatic extraction of man-made objects from remote sensing imagery is useful in many applications. This paper proposes an algorithm for extracting man-made objects automatically by integrating a graph model with the manifold ranking algorithm. Initially, we estimate a priori value of the man-made objects with the use of symmetric and contrast features. The graph model is established to represent the spatial relationships among pre-segmented superpixels, which are used as the graph nodes. Multiple characteristics, namely colour, texture and main direction, are used to compute the weights of the adjacent nodes. Manifold ranking effectively explores the relationships among all the nodes in the feature space as well as initial query assignment; thus, it is applied to generate a ranking map, which indicates the scores of the man-made objects. The man-made objects are then segmented on the basis of the ranking map. Two typical segmentation algorithms are compared with the proposed algorithm. Experimental results show that the proposed algorithm can extract man-made objects with high recognition rate and low omission rate.
ERIC Educational Resources Information Center
Johnson, Millie
1997-01-01
Graphs from media sources and questions developed from them can be used in the middle school mathematics classroom. Graphs depict storage temperature on a milk carton; air pressure measurements on a package of shock absorbers; sleep-wake patterns of an infant; a dog's breathing patterns; and the angle, velocity, and radius of a leaning bicyclist…
Representation of activity in images using geospatial temporal graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brost, Randolph; McLendon, III, William C.; Parekh, Ojas D.
Various technologies pertaining to modeling patterns of activity observed in remote sensing images using geospatial-temporal graphs are described herein. Graphs are constructed by representing objects in remote sensing images as nodes, and connecting nodes with undirected edges representing either distance or adjacency relationships between objects and directed edges representing changes in time. Activity patterns may be discerned from the graphs by coding nodes representing persistent objects like buildings differently from nodes representing ephemeral objects like vehicles, and examining the geospatial-temporal relationships of ephemeral nodes within the graph.
Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs
Gómez-Adorno, Helena; Sidorov, Grigori; Pinto, David; Vilariño, Darnes; Gelbukh, Alexander
2016-01-01
We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of documents. On average, our method outperforms the state of the art approaches and gives consistently high results across different corpora, unlike existing methods. Our results show that our textual patterns are useful for the task of authorship attribution. PMID:27589740
Caetano, Tibério S; McAuley, Julian J; Cheng, Li; Le, Quoc V; Smola, Alex J
2009-06-01
As a fundamental problem in pattern recognition, graph matching has applications in a variety of fields, from computer vision to computational biology. In graph matching, patterns are modeled as graphs and pattern recognition amounts to finding a correspondence between the nodes of different graphs. Many formulations of this problem can be cast in general as a quadratic assignment problem, where a linear term in the objective function encodes node compatibility and a quadratic term encodes edge compatibility. The main research focus in this theme is about designing efficient algorithms for approximately solving the quadratic assignment problem, since it is NP-hard. In this paper we turn our attention to a different question: how to estimate compatibility functions such that the solution of the resulting graph matching problem best matches the expected solution that a human would manually provide. We present a method for learning graph matching: the training examples are pairs of graphs and the 'labels' are matches between them. Our experimental results reveal that learning can substantially improve the performance of standard graph matching algorithms. In particular, we find that simple linear assignment with such a learning scheme outperforms Graduated Assignment with bistochastic normalisation, a state-of-the-art quadratic assignment relaxation algorithm.
DGEM--a microarray gene expression database for primary human disease tissues.
Xia, Yuni; Campen, Andrew; Rigsby, Dan; Guo, Ying; Feng, Xingdong; Su, Eric W; Palakal, Mathew; Li, Shuyu
2007-01-01
Gene expression patterns can reflect gene regulations in human tissues under normal or pathologic conditions. Gene expression profiling data from studies of primary human disease samples are particularly valuable since these studies often span many years in order to collect patient clinical information and achieve a large sample size. Disease-to-Gene Expression Mapper (DGEM) provides a beneficial community resource to access and analyze these data; it currently includes Affymetrix oligonucleotide array datasets for more than 40 human diseases and 1400 samples. The data are normalized to the same scale and stored in a relational database. A statistical-analysis pipeline was implemented to identify genes abnormally expressed in disease tissues or genes whose expressions are associated with clinical parameters such as cancer patient survival. Data-mining results can be queried through a web-based interface at http://dgem.dhcp.iupui.edu/. The query tool enables dynamic generation of graphs and tables that are further linked to major gene and pathway resources that connect the data to relevant biology, including Entrez Gene and Kyoto Encyclopedia of Genes and Genomes (KEGG). In summary, DGEM provides scientists and physicians a valuable tool to study disease mechanisms, to discover potential disease biomarkers for diagnosis and prognosis, and to identify novel gene targets for drug discovery. The source code is freely available for non-profit use, on request to the authors.
Bakal, Gokhan; Talari, Preetham; Kakani, Elijah V; Kavuluru, Ramakanth
2018-06-01
Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying different causal relations between biomedical entities is also critical to understand biomedical processes. Generally, natural language processing (NLP) and machine learning are used to predict specific relations between any given pair of entities using the distant supervision approach. To build high accuracy supervised predictive models to predict previously unknown treatment and causative relations between biomedical entities based only on semantic graph pattern features extracted from biomedical knowledge graphs. We used 7000 treats and 2918 causes hand-curated relations from the UMLS Metathesaurus to train and test our models. Our graph pattern features are extracted from simple paths connecting biomedical entities in the SemMedDB graph (based on the well-known SemMedDB database made available by the U.S. National Library of Medicine). Using these graph patterns connecting biomedical entities as features of logistic regression and decision tree models, we computed mean performance measures (precision, recall, F-score) over 100 distinct 80-20% train-test splits of the datasets. For all experiments, we used a positive:negative class imbalance of 1:10 in the test set to model relatively more realistic scenarios. Our models predict treats and causes relations with high F-scores of 99% and 90% respectively. Logistic regression model coefficients also help us identify highly discriminative patterns that have an intuitive interpretation. We are also able to predict some new plausible relations based on false positives that our models scored highly based on our collaborations with two physician co-authors. Finally, our decision tree models are able to retrieve over 50% of treatment relations from a recently created external dataset. We employed semantic graph patterns connecting pairs of candidate biomedical entities in a knowledge graph as features to predict treatment/causative relations between them. We provide what we believe is the first evidence in direct prediction of biomedical relations based on graph features. Our work complements lexical pattern based approaches in that the graph patterns can be used as additional features for weakly supervised relation prediction. Copyright © 2018 Elsevier Inc. All rights reserved.
Graph rigidity, cyclic belief propagation, and point pattern matching.
McAuley, Julian J; Caetano, Tibério S; Barbosa, Marconi S
2008-11-01
A recent paper [1] proposed a provably optimal polynomial time method for performing near-isometric point pattern matching by means of exact probabilistic inference in a chordal graphical model. Its fundamental result is that the chordal graph in question is shown to be globally rigid, implying that exact inference provides the same matching solution as exact inference in a complete graphical model. This implies that the algorithm is optimal when there is no noise in the point patterns. In this paper, we present a new graph that is also globally rigid but has an advantage over the graph proposed in [1]: Its maximal clique size is smaller, rendering inference significantly more efficient. However, this graph is not chordal, and thus, standard Junction Tree algorithms cannot be directly applied. Nevertheless, we show that loopy belief propagation in such a graph converges to the optimal solution. This allows us to retain the optimality guarantee in the noiseless case, while substantially reducing both memory requirements and processing time. Our experimental results show that the accuracy of the proposed solution is indistinguishable from that in [1] when there is noise in the point patterns.
Agile Datacube Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Merticariu, Vlad; Baumann, Peter
2017-04-01
Metadata are considered small, smart, and queryable; data, on the other hand, are known as big, clumsy, hard to analyze. Consequently, gridded data - such as images, image timeseries, and climate datacubes - are managed separately from the metadata, and with different, restricted retrieval capabilities. One reason for this silo approach is that databases, while good at tables, XML hierarchies, RDF graphs, etc., traditionally do not support multi-dimensional arrays well. This gap is being closed by Array Databases which extend the SQL paradigm of "any query, anytime" to NoSQL arrays. They introduce semantically rich modelling combined with declarative, high-level query languages on n-D arrays. On Server side, such queries can be optimized, parallelized, and distributed based on partitioned array storage. This way, they offer new vistas in flexibility, scalability, performance, and data integration. In this respect, the forthcoming ISO SQL extension MDA ("Multi-dimensional Arrays") will be a game changer in Big Data Analytics. We introduce concepts and opportunities through the example of rasdaman ("raster data manager") which in fact has pioneered the field of Array Databases and forms the blueprint for ISO SQL/MDA and further Big Data standards, such as OGC WCPS for querying spatio-temporal Earth datacubes. With operational installations exceeding 140 TB queries have been split across more than one thousand cloud nodes, using CPUs as well as GPUs. Installations can easily be mashed up securely, enabling large-scale location-transparent query processing in federations. Federation queries have been demonstrated live at EGU 2016 spanning Europe and Australia in the context of the intercontinental EarthServer initiative, visualized through NASA WorldWind.
Agile Datacube Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Baumann, P.
2016-12-01
Metadata are considered small, smart, and queryable; data, on the other hand, are known as big, clumsy, hard to analyze. Consequently, gridded data - such as images, image timeseries, and climate datacubes - are managed separately from the metadata, and with different, restricted retrieval capabilities. One reason for this silo approach is that databases, while good at tables, XML hierarchies, RDF graphs, etc., traditionally do not support multi-dimensional arrays well.This gap is being closed by Array Databases which extend the SQL paradigm of "any query, anytime" to NoSQL arrays. They introduce semantically rich modelling combined with declarative, high-level query languages on n-D arrays. On Server side, such queries can be optimized, parallelized, and distributed based on partitioned array storage. This way, they offer new vistas in flexibility, scalability, performance, and data integration. In this respect, the forthcoming ISO SQL extension MDA ("Multi-dimensional Arrays") will be a game changer in Big Data Analytics.We introduce concepts and opportunities through the example of rasdaman ("raster data manager") which in fact has pioneered the field of Array Databases and forms the blueprint for ISO SQL/MDA and further Big Data standards, such as OGC WCPS for querying spatio-temporal Earth datacubes. With operational installations exceeding 140 TB queries have been split across more than one thousand cloud nodes, using CPUs as well as GPUs. Installations can easily be mashed up securely, enabling large-scale location-transparent query processing in federations. Federation queries have been demonstrated live at EGU 2016 spanning Europe and Australia in the context of the intercontinental EarthServer initiative, visualized through NASA WorldWind.
2012-01-01
Background Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems. PMID:22551152
An MBO Scheme for Minimizing the Graph Ohta-Kawasaki Functional
NASA Astrophysics Data System (ADS)
van Gennip, Yves
2018-06-01
We study a graph-based version of the Ohta-Kawasaki functional, which was originally introduced in a continuum setting to model pattern formation in diblock copolymer melts and has been studied extensively as a paradigmatic example of a variational model for pattern formation. Graph-based problems inspired by partial differential equations (PDEs) and variational methods have been the subject of many recent papers in the mathematical literature, because of their applications in areas such as image processing and data classification. This paper extends the area of PDE inspired graph-based problems to pattern-forming models, while continuing in the tradition of recent papers in the field. We introduce a mass conserving Merriman-Bence-Osher (MBO) scheme for minimizing the graph Ohta-Kawasaki functional with a mass constraint. We present three main results: (1) the Lyapunov functionals associated with this MBO scheme Γ -converge to the Ohta-Kawasaki functional (which includes the standard graph-based MBO scheme and total variation as a special case); (2) there is a class of graphs on which the Ohta-Kawasaki MBO scheme corresponds to a standard MBO scheme on a transformed graph and for which generalized comparison principles hold; (3) this MBO scheme allows for the numerical computation of (approximate) minimizers of the graph Ohta-Kawasaki functional with a mass constraint.
Plaisant, Catherine; Lam, Stanley; Shneiderman, Ben; Smith, Mark S.; Roseman, David; Marchand, Greg; Gillam, Michael; Feied, Craig; Handler, Jonathan; Rappaport, Hank
2008-01-01
As electronic health records (EHR) become more widespread, they enable clinicians and researchers to pose complex queries that can benefit immediate patient care and deepen understanding of medical treatment and outcomes. However, current query tools make complex temporal queries difficult to pose, and physicians have to rely on computer professionals to specify the queries for them. This paper describes our efforts to develop a novel query tool implemented in a large operational system at the Washington Hospital Center (Microsoft Amalga, formerly known as Azyxxi). We describe our design of the interface to specify temporal patterns and the visual presentation of results, and report on a pilot user study looking for adverse reactions following radiology studies using contrast. PMID:18999158
YAHA: fast and flexible long-read alignment with optimal breakpoint detection.
Faust, Gregory G; Hall, Ira M
2012-10-01
With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this. We present YAHA, a fast and flexible hash-based aligner. YAHA is as fast and accurate as BWA-SW at finding the single best alignment per query and is dramatically faster and more sensitive than both SSAHA2 and MegaBLAST at finding all possible alignments. Unlike other aligners that report all, or one, alignment per query, or that use simple heuristics to select alignments, YAHA uses a directed acyclic graph to find the optimal set of alignments that cover a query using a biologically relevant breakpoint penalty. YAHA can also report multiple mappings per defined segment of the query. We show that YAHA detects more breakpoints in less time than BWA-SW across all SV classes, and especially excels at complex SVs comprising multiple breakpoints. YAHA is currently supported on 64-bit Linux systems. Binaries and sample data are freely available for download from http://faculty.virginia.edu/irahall/YAHA. imh4y@virginia.edu.
NASA Astrophysics Data System (ADS)
Zheng, Yan
2015-03-01
Internet of things (IoT), focusing on providing users with information exchange and intelligent control, attracts a lot of attention of researchers from all over the world since the beginning of this century. IoT is consisted of large scale of sensor nodes and data processing units, and the most important features of IoT can be illustrated as energy confinement, efficient communication and high redundancy. With the sensor nodes increment, the communication efficiency and the available communication band width become bottle necks. Many research work is based on the instance which the number of joins is less. However, it is not proper to the increasing multi-join query in whole internet of things. To improve the communication efficiency between parallel units in the distributed sensor network, this paper proposed parallel query optimization algorithm based on distribution attributes cost graph. The storage information relations and the network communication cost are considered in this algorithm, and an optimized information changing rule is established. The experimental result shows that the algorithm has good performance, and it would effectively use the resource of each node in the distributed sensor network. Therefore, executive efficiency of multi-join query between different nodes could be improved.
XLWrap - Querying and Integrating Arbitrary Spreadsheets with SPARQL
NASA Astrophysics Data System (ADS)
Langegger, Andreas; Wöß, Wolfram
In this paper a novel approach is presented for generating RDF graphs of arbitrary complexity from various spreadsheet layouts. Currently, none of the available spreadsheet-to-RDF wrappers supports cross tables and tables where data is not aligned in rows. Similar to RDF123, XLWrap is based on template graphs where fragments of triples can be mapped to specific cells of a spreadsheet. Additionally, it features a full expression algebra based on the syntax of OpenOffice Calc and various shift operations, which can be used to repeat similar mappings in order to wrap cross tables including multiple sheets and spreadsheet files. The set of available expression functions includes most of the native functions of OpenOffice Calc and can be easily extended by users of XLWrap.
Information Retrieval and Graph Analysis Approaches for Book Recommendation.
Benkoussas, Chahinez; Bellot, Patrice
2015-01-01
A combination of multiple information retrieval approaches is proposed for the purpose of book recommendation. In this paper, book recommendation is based on complex user's query. We used different theoretical retrieval models: probabilistic as InL2 (Divergence from Randomness model) and language model and tested their interpolated combination. Graph analysis algorithms such as PageRank have been successful in Web environments. We consider the application of this algorithm in a new retrieval approach to related document network comprised of social links. We called Directed Graph of Documents (DGD) a network constructed with documents and social information provided from each one of them. Specifically, this work tackles the problem of book recommendation in the context of INEX (Initiative for the Evaluation of XML retrieval) Social Book Search track. A series of reranking experiments demonstrate that combining retrieval models yields significant improvements in terms of standard ranked retrieval metrics. These results extend the applicability of link analysis algorithms to different environments.
Information Retrieval and Graph Analysis Approaches for Book Recommendation
Benkoussas, Chahinez; Bellot, Patrice
2015-01-01
A combination of multiple information retrieval approaches is proposed for the purpose of book recommendation. In this paper, book recommendation is based on complex user's query. We used different theoretical retrieval models: probabilistic as InL2 (Divergence from Randomness model) and language model and tested their interpolated combination. Graph analysis algorithms such as PageRank have been successful in Web environments. We consider the application of this algorithm in a new retrieval approach to related document network comprised of social links. We called Directed Graph of Documents (DGD) a network constructed with documents and social information provided from each one of them. Specifically, this work tackles the problem of book recommendation in the context of INEX (Initiative for the Evaluation of XML retrieval) Social Book Search track. A series of reranking experiments demonstrate that combining retrieval models yields significant improvements in terms of standard ranked retrieval metrics. These results extend the applicability of link analysis algorithms to different environments. PMID:26504899
Mining and Querying Multimedia Data
2011-09-29
able to capture more subtle spatial variations such as repetitiveness. Local feature descriptors such as SIFT [74] and SURF [12] have also been widely...empirically set to s = 90%, r = 50%, K = 20, where small variations lead to little perturbation of the output. The pseudo-code of the algorithm is...by constructing a three-layer graph based on clustering outputs, and executing a slight variation of random walk with restart algorithm. It provided
Sehnal, David; Pravda, Lukáš; Svobodová Vařeková, Radka; Ionescu, Crina-Maria; Koča, Jaroslav
2015-07-01
Well defined biomacromolecular patterns such as binding sites, catalytic sites, specific protein or nucleic acid sequences, etc. precisely modulate many important biological phenomena. We introduce PatternQuery, a web-based application designed for detection and fast extraction of such patterns. The application uses a unique query language with Python-like syntax to define the patterns that will be extracted from datasets provided by the user, or from the entire Protein Data Bank (PDB). Moreover, the database-wide search can be restricted using a variety of criteria, such as PDB ID, resolution, and organism of origin, to provide only relevant data. The extraction generally takes a few seconds for several hundreds of entries, up to approximately one hour for the whole PDB. The detected patterns are made available for download to enable further processing, as well as presented in a clear tabular and graphical form directly in the browser. The unique design of the language and the provided service could pave the way towards novel PDB-wide analyses, which were either difficult or unfeasible in the past. The application is available free of charge at http://ncbr.muni.cz/PatternQuery. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Manchester visual query language
NASA Astrophysics Data System (ADS)
Oakley, John P.; Davis, Darryl N.; Shann, Richard T.
1993-04-01
We report a database language for visual retrieval which allows queries on image feature information which has been computed and stored along with images. The language is novel in that it provides facilities for dealing with feature data which has actually been obtained from image analysis. Each line in the Manchester Visual Query Language (MVQL) takes a set of objects as input and produces another, usually smaller, set as output. The MVQL constructs are mainly based on proven operators from the field of digital image analysis. An example is the Hough-group operator which takes as input a specification for the objects to be grouped, a specification for the relevant Hough space, and a definition of the voting rule. The output is a ranked list of high scoring bins. The query could be directed towards one particular image or an entire image database, in the latter case the bins in the output list would in general be associated with different images. We have implemented MVQL in two layers. The command interpreter is a Lisp program which maps each MVQL line to a sequence of commands which are used to control a specialized database engine. The latter is a hybrid graph/relational system which provides low-level support for inheritance and schema evolution. In the paper we outline the language and provide examples of useful queries. We also describe our solution to the engineering problems associated with the implementation of MVQL.
A fuzzy pattern matching method based on graph kernel for lithography hotspot detection
NASA Astrophysics Data System (ADS)
Nitta, Izumi; Kanazawa, Yuzi; Ishida, Tsutomu; Banno, Koji
2017-03-01
In advanced technology nodes, lithography hotspot detection has become one of the most significant issues in design for manufacturability. Recently, machine learning based lithography hotspot detection has been widely investigated, but it has trade-off between detection accuracy and false alarm. To apply machine learning based technique to the physical verification phase, designers require minimizing undetected hotspots to avoid yield degradation. They also need a ranking of similar known patterns with a detected hotspot to prioritize layout pattern to be corrected. To achieve high detection accuracy and to prioritize detected hotspots, we propose a novel lithography hotspot detection method using Delaunay triangulation and graph kernel based machine learning. Delaunay triangulation extracts features of hotspot patterns where polygons locate irregularly and closely one another, and graph kernel expresses inner structure of graphs. Additionally, our method provides similarity between two patterns and creates a list of similar training patterns with a detected hotspot. Experiments results on ICCAD 2012 benchmarks show that our method achieves high accuracy with allowable range of false alarm. We also show the ranking of the similar known patterns with a detected hotspot.
Monitoring operational data production applying Big Data tooling
NASA Astrophysics Data System (ADS)
Som de Cerff, Wim; de Jong, Hotze; van den Berg, Roy; Bos, Jeroen; Oosterhoff, Rijk; Klein Ikkink, Henk Jan; Haga, Femke; Elsten, Tom; Verhoef, Hans; Koutek, Michal; van de Vegte, John
2015-04-01
Within the KNMI Deltaplan programme for improving the KNMI operational infrastructure an new fully automated system for monitoring the KNMI operational data production systems is being developed: PRISMA (PRocessflow Infrastructure Surveillance and Monitoring Application). Currently the KNMI operational (24/7) production systems consist of over 60 applications, running on different hardware systems and platforms. They are interlinked for the production of numerous data products, which are delivered to internal and external customers. All applications are individually monitored by different applications, complicating root cause and impact analysis. Also, the underlying hardware and network is monitored separately using Zabbix. Goal of the new system is to enable production chain monitoring, which enables root cause analysis (what is the root cause of the disruption) and impact analysis (what other products will be effected). The PRISMA system will make it possible to dispose all the existing monitoring applications, providing one interface for monitoring the data production. For modeling the production chain, the Neo4j Graph database is used to store and query the model. The model can be edited through the PRISMA web interface, but is mainly automatically provided by the applications and systems which are to be monitored. The graph enables us to do root case and impact analysis. The graph can be visualized in the PRISMA web interface on different levels. Each 'monitored object' in the model will have a status (OK, error, warning, unknown). This status is derived by combing all log information available. For collecting and querying the log information Splunk is used. The system is developed using Scrum, by a multi-disciplinary team consisting of analysts, developers, a tester and interaction designer. In the presentation we will focus on the lessons learned working with the 'Big data' tooling Splunk and Neo4J.
Visualizing and Validating Metadata Traceability within the CDISC Standards.
Hume, Sam; Sarnikar, Surendra; Becnel, Lauren; Bennett, Dorine
2017-01-01
The Food & Drug Administration has begun requiring that electronic submissions of regulated clinical studies utilize the Clinical Data Information Standards Consortium data standards. Within regulated clinical research, traceability is a requirement and indicates that the analysis results can be traced back to the original source data. Current solutions for clinical research data traceability are limited in terms of querying, validation and visualization capabilities. This paper describes (1) the development of metadata models to support computable traceability and traceability visualizations that are compatible with industry data standards for the regulated clinical research domain, (2) adaptation of graph traversal algorithms to make them capable of identifying traceability gaps and validating traceability across the clinical research data lifecycle, and (3) development of a traceability query capability for retrieval and visualization of traceability information.
Visualizing and Validating Metadata Traceability within the CDISC Standards
Hume, Sam; Sarnikar, Surendra; Becnel, Lauren; Bennett, Dorine
2017-01-01
The Food & Drug Administration has begun requiring that electronic submissions of regulated clinical studies utilize the Clinical Data Information Standards Consortium data standards. Within regulated clinical research, traceability is a requirement and indicates that the analysis results can be traced back to the original source data. Current solutions for clinical research data traceability are limited in terms of querying, validation and visualization capabilities. This paper describes (1) the development of metadata models to support computable traceability and traceability visualizations that are compatible with industry data standards for the regulated clinical research domain, (2) adaptation of graph traversal algorithms to make them capable of identifying traceability gaps and validating traceability across the clinical research data lifecycle, and (3) development of a traceability query capability for retrieval and visualization of traceability information. PMID:28815125
cMapper: gene-centric connectivity mapper for EBI-RDF platform.
Shoaib, Muhammad; Ansari, Adnan Ahmad; Ahn, Sung-Min
2017-01-15
In this era of biological big data, data integration has become a common task and a challenge for biologists. The Resource Description Framework (RDF) was developed to enable interoperability of heterogeneous datasets. The EBI-RDF platform enables an efficient data integration of six independent biological databases using RDF technologies and shared ontologies. However, to take advantage of this platform, biologists need to be familiar with RDF technologies and SPARQL query language. To overcome this practical limitation of the EBI-RDF platform, we developed cMapper, a web-based tool that enables biologists to search the EBI-RDF databases in a gene-centric manner without a thorough knowledge of RDF and SPARQL. cMapper allows biologists to search data entities in the EBI-RDF platform that are connected to genes or small molecules of interest in multiple biological contexts. The input to cMapper consists of a set of genes or small molecules, and the output are data entities in six independent EBI-RDF databases connected with the given genes or small molecules in the user's query. cMapper provides output to users in the form of a graph in which nodes represent data entities and the edges represent connections between data entities and inputted set of genes or small molecules. Furthermore, users can apply filters based on database, taxonomy, organ and pathways in order to focus on a core connectivity graph of their interest. Data entities from multiple databases are differentiated based on background colors. cMapper also enables users to investigate shared connections between genes or small molecules of interest. Users can view the output graph on a web browser or download it in either GraphML or JSON formats. cMapper is available as a web application with an integrated MySQL database. The web application was developed using Java and deployed on Tomcat server. We developed the user interface using HTML5, JQuery and the Cytoscape Graph API. cMapper can be accessed at http://cmapper.ewostech.net Readers can download the development manual from the website http://cmapper.ewostech.net/docs/cMapperDocumentation.pdf. Source Code is available at https://github.com/muhammadshoaib/cmapperContact:smahn@gachon.ac.krSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
RAG-3D: A search tool for RNA 3D substructures
Zahran, Mai; Sevim Bayrak, Cigdem; Elmetwaly, Shereef; ...
2015-08-24
In this study, to address many challenges in RNA structure/function prediction, the characterization of RNA's modular architectural units is required. Using the RNA-As-Graphs (RAG) database, we have previously explored the existence of secondary structure (2D) submotifs within larger RNA structures. Here we present RAG-3D—a dataset of RNA tertiary (3D) structures and substructures plus a web-based search tool—designed to exploit graph representations of RNAs for the goal of searching for similar 3D structural fragments. The objects in RAG-3D consist of 3D structures translated into 3D graphs, cataloged based on the connectivity between their secondary structure elements. Each graph is additionally describedmore » in terms of its subgraph building blocks. The RAG-3D search tool then compares a query RNA 3D structure to those in the database to obtain structurally similar structures and substructures. This comparison reveals conserved 3D RNA features and thus may suggest functional connections. Though RNA search programs based on similarity in sequence, 2D, and/or 3D structural elements are available, our graph-based search tool may be advantageous for illuminating similarities that are not obvious; using motifs rather than sequence space also reduces search times considerably. Ultimately, such substructuring could be useful for RNA 3D structure prediction, structure/function inference and inverse folding.« less
RAG-3D: a search tool for RNA 3D substructures
Zahran, Mai; Sevim Bayrak, Cigdem; Elmetwaly, Shereef; Schlick, Tamar
2015-01-01
To address many challenges in RNA structure/function prediction, the characterization of RNA's modular architectural units is required. Using the RNA-As-Graphs (RAG) database, we have previously explored the existence of secondary structure (2D) submotifs within larger RNA structures. Here we present RAG-3D—a dataset of RNA tertiary (3D) structures and substructures plus a web-based search tool—designed to exploit graph representations of RNAs for the goal of searching for similar 3D structural fragments. The objects in RAG-3D consist of 3D structures translated into 3D graphs, cataloged based on the connectivity between their secondary structure elements. Each graph is additionally described in terms of its subgraph building blocks. The RAG-3D search tool then compares a query RNA 3D structure to those in the database to obtain structurally similar structures and substructures. This comparison reveals conserved 3D RNA features and thus may suggest functional connections. Though RNA search programs based on similarity in sequence, 2D, and/or 3D structural elements are available, our graph-based search tool may be advantageous for illuminating similarities that are not obvious; using motifs rather than sequence space also reduces search times considerably. Ultimately, such substructuring could be useful for RNA 3D structure prediction, structure/function inference and inverse folding. PMID:26304547
RAG-3D: A search tool for RNA 3D substructures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zahran, Mai; Sevim Bayrak, Cigdem; Elmetwaly, Shereef
In this study, to address many challenges in RNA structure/function prediction, the characterization of RNA's modular architectural units is required. Using the RNA-As-Graphs (RAG) database, we have previously explored the existence of secondary structure (2D) submotifs within larger RNA structures. Here we present RAG-3D—a dataset of RNA tertiary (3D) structures and substructures plus a web-based search tool—designed to exploit graph representations of RNAs for the goal of searching for similar 3D structural fragments. The objects in RAG-3D consist of 3D structures translated into 3D graphs, cataloged based on the connectivity between their secondary structure elements. Each graph is additionally describedmore » in terms of its subgraph building blocks. The RAG-3D search tool then compares a query RNA 3D structure to those in the database to obtain structurally similar structures and substructures. This comparison reveals conserved 3D RNA features and thus may suggest functional connections. Though RNA search programs based on similarity in sequence, 2D, and/or 3D structural elements are available, our graph-based search tool may be advantageous for illuminating similarities that are not obvious; using motifs rather than sequence space also reduces search times considerably. Ultimately, such substructuring could be useful for RNA 3D structure prediction, structure/function inference and inverse folding.« less
Huang, Xiaoke; Zhao, Ye; Yang, Jing; Zhang, Chong; Ma, Chao; Ye, Xinyue
2016-01-01
We propose TrajGraph, a new visual analytics method, for studying urban mobility patterns by integrating graph modeling and visual analysis with taxi trajectory data. A special graph is created to store and manifest real traffic information recorded by taxi trajectories over city streets. It conveys urban transportation dynamics which can be discovered by applying graph analysis algorithms. To support interactive, multiscale visual analytics, a graph partitioning algorithm is applied to create region-level graphs which have smaller size than the original street-level graph. Graph centralities, including Pagerank and betweenness, are computed to characterize the time-varying importance of different urban regions. The centralities are visualized by three coordinated views including a node-link graph view, a map view and a temporal information view. Users can interactively examine the importance of streets to discover and assess city traffic patterns. We have implemented a fully working prototype of this approach and evaluated it using massive taxi trajectories of Shenzhen, China. TrajGraph's capability in revealing the importance of city streets was evaluated by comparing the calculated centralities with the subjective evaluations from a group of drivers in Shenzhen. Feedback from a domain expert was collected. The effectiveness of the visual interface was evaluated through a formal user study. We also present several examples and a case study to demonstrate the usefulness of TrajGraph in urban transportation analysis.
NASA Astrophysics Data System (ADS)
McGibbney, L. J.; Jiang, Y.; Burgess, A. B.
2017-12-01
Big Earth observation data have been produced, archived and made available online, but discovering the right data in a manner that precisely and efficiently satisfies user needs presents a significant challenge to the Earth Science (ES) community. An emerging trend in information retrieval community is to utilize knowledge graphs to assist users in quickly finding desired information from across knowledge sources. This is particularly prevalent within the fields of social media and complex multimodal information processing to name but a few, however building a domain-specific knowledge graph is labour-intensive and hard to keep up-to-date. In this work, we update our progress on the Earth Science Knowledge Graph (ESKG) project; an ESIP-funded testbed project which provides an automatic approach to building a dynamic knowledge graph for ES to improve interdisciplinary data discovery by leveraging implicit, latent existing knowledge present within across several U.S Federal Agencies e.g. NASA, NOAA and USGS. ESKG strengthens ties between observations and user communities by: 1) developing a knowledge graph derived from various sources e.g. Web pages, Web Services, etc. via natural language processing and knowledge extraction techniques; 2) allowing users to traverse, explore, query, reason and navigate ES data via knowledge graph interaction. ESKG has the potential to revolutionize the way in which ES communities interact with ES data in the open world through the entity, spatial and temporal linkages and characteristics that make it up. This project enables the advancement of ESIP collaboration areas including both Discovery and Semantic Technologies by putting graph information right at our fingertips in an interactive, modern manner and reducing the efforts to constructing ontology. To demonstrate the ESKG concept, we will demonstrate use of our framework across NASA JPL's PO.DAAC, NOAA's Earth Observation Requirements Evaluation System (EORES) and various USGS systems.
NASA Technical Reports Server (NTRS)
Aspinall, David; Denney, Ewen; Lueth, Christoph
2012-01-01
We motivate and introduce a query language PrQL designed for inspecting machine representations of proofs. PrQL natively supports hiproofs which express proof structure using hierarchical nested labelled trees. The core language presented in this paper is locally structured (first-order), with queries built using recursion and patterns over proof structure and rule names. We define the syntax and semantics of locally structured queries, demonstrate their power, and sketch some implementation experiments.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Guo Qiang; Luo, Lingyun; Ogbuji, Chime
The interaction of multiple types of relationships among anatomical classes in the Foundational Model of Anatomy (FMA) can provide inferred information valuable for quality assurance. This paper introduces a method called Motif Checking (MOCH) to study the effects of such multi-relation type interactions. MOCH represents patterns of multitype interaction as small labeled sub-graph motifs, whose nodes represent class variables, and labeled edges represent relational types. By representing FMA as an RDF graph and motifs as SPARQL queries, fragments of FMA are automatically obtained as auditing candidates. Leveraging the scalability and reconfigurability of Semantic Web Technology (OWL, RDF and SPARQL) andmore » Virtuoso, we performed exhaustive analyses of three 2-node motifs, resulting in 638 matching FMA configurations; twelve 3-node motifs, resulting in 202,960 configurations. Using the Principal Ideal Explorer (PIE) methodology as an extension of MOCH, we were able to identify 755 root nodes with 4,100 respective descendants with opposing antonyms in their class names for arbitrary-length motifs. With possible disjointness implied by antonyms, we performed manual inspection of a subset of the resulting FMA fragments and tracked down a source of abnormal inferred conclusions (captured by the motifs), coming from a gender-neutral class being modeled as a part of gender-specific class, such as “Urinary system” is a part of “Female human body.” Our results demonstrate that MOCH and PIE provide a unique source of valuable information for quality assurance. Since our approach is general, it is applicable to any ontological system with an OWL representation.« less
Knowledge Discovery from Biomedical Ontologies in Cross Domains.
Shen, Feichen; Lee, Yugyung
2016-01-01
In recent years, there is an increasing demand for sharing and integration of medical data in biomedical research. In order to improve a health care system, it is required to support the integration of data by facilitating semantic interoperability systems and practices. Semantic interoperability is difficult to achieve in these systems as the conceptual models underlying datasets are not fully exploited. In this paper, we propose a semantic framework, called Medical Knowledge Discovery and Data Mining (MedKDD), that aims to build a topic hierarchy and serve the semantic interoperability between different ontologies. For the purpose, we fully focus on the discovery of semantic patterns about the association of relations in the heterogeneous information network representing different types of objects and relationships in multiple biological ontologies and the creation of a topic hierarchy through the analysis of the discovered patterns. These patterns are used to cluster heterogeneous information networks into a set of smaller topic graphs in a hierarchical manner and then to conduct cross domain knowledge discovery from the multiple biological ontologies. Thus, patterns made a greater contribution in the knowledge discovery across multiple ontologies. We have demonstrated the cross domain knowledge discovery in the MedKDD framework using a case study with 9 primary biological ontologies from Bio2RDF and compared it with the cross domain query processing approach, namely SLAP. We have confirmed the effectiveness of the MedKDD framework in knowledge discovery from multiple medical ontologies.
Knowledge Discovery from Biomedical Ontologies in Cross Domains
Shen, Feichen; Lee, Yugyung
2016-01-01
In recent years, there is an increasing demand for sharing and integration of medical data in biomedical research. In order to improve a health care system, it is required to support the integration of data by facilitating semantic interoperability systems and practices. Semantic interoperability is difficult to achieve in these systems as the conceptual models underlying datasets are not fully exploited. In this paper, we propose a semantic framework, called Medical Knowledge Discovery and Data Mining (MedKDD), that aims to build a topic hierarchy and serve the semantic interoperability between different ontologies. For the purpose, we fully focus on the discovery of semantic patterns about the association of relations in the heterogeneous information network representing different types of objects and relationships in multiple biological ontologies and the creation of a topic hierarchy through the analysis of the discovered patterns. These patterns are used to cluster heterogeneous information networks into a set of smaller topic graphs in a hierarchical manner and then to conduct cross domain knowledge discovery from the multiple biological ontologies. Thus, patterns made a greater contribution in the knowledge discovery across multiple ontologies. We have demonstrated the cross domain knowledge discovery in the MedKDD framework using a case study with 9 primary biological ontologies from Bio2RDF and compared it with the cross domain query processing approach, namely SLAP. We have confirmed the effectiveness of the MedKDD framework in knowledge discovery from multiple medical ontologies. PMID:27548262
Evaluating structural pattern recognition for handwritten math via primitive label graphs
NASA Astrophysics Data System (ADS)
Zanibbi, Richard; MoucheÌre, Harold; Viard-Gaudin, Christian
2013-01-01
Currently, structural pattern recognizer evaluations compare graphs of detected structure to target structures (i.e. ground truth) using recognition rates, recall and precision for object segmentation, classification and relationships. In document recognition, these target objects (e.g. symbols) are frequently comprised of multiple primitives (e.g. connected components, or strokes for online handwritten data), but current metrics do not characterize errors at the primitive level, from which object-level structure is obtained. Primitive label graphs are directed graphs defined over primitives and primitive pairs. We define new metrics obtained by Hamming distances over label graphs, which allow classification, segmentation and parsing errors to be characterized separately, or using a single measure. Recall and precision for detected objects may also be computed directly from label graphs. We illustrate the new metrics by comparing a new primitive-level evaluation to the symbol-level evaluation performed for the CROHME 2012 handwritten math recognition competition. A Python-based set of utilities for evaluating, visualizing and translating label graphs is publicly available.
ERIC Educational Resources Information Center
Harris, David; Gomez Zwiep, Susan
2013-01-01
Graphs represent complex information. They show relationships and help students see patterns and compare data. Students often do not appreciate the illuminating power of graphs, interpreting them literally rather than as symbolic representations (Leinhardt, Zaslavsky, and Stein 1990). Students often read graphs point by point instead of seeing…
The Interplay of Graph and Text in the Acquisition of Historical Constructs
ERIC Educational Resources Information Center
Shand, Kristen
2009-01-01
Graphs are often conjoined with text passages in history textbooks to help students comprehend complex constructs. Four linkages connect text and graphs: appropriate elements, fitting patterns, suitable labels and causal markers. Graphs in current textbooks contain few such linkages and seldom mirror the construct under study. An experiment…
ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining.
Huan, Tianxiao; Sivachenko, Andrey Y; Harrison, Scott H; Chen, Jake Y
2008-08-12
New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed. We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network. The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
Jiang, Ying; Gao, Ge; Fang, Gang; Gustafson, Eric L; Laverty, Maureen; Yin, Yanbin; Zhang, Yong; Luo, Jingchu; Greene, Jonathan R; Bayne, Marvin L; Hedrick, Joseph A; Murgolo, Nicholas J
2003-05-01
PepPat, a hybrid method that combines pattern matching with similarity scoring, is described. We also report PepPat's application in the identification of a novel tachykinin-like peptide. PepPat takes as input a query peptide and a user-specified regular expression pattern within the peptide. It first performs a database pattern match and then ranks candidates on the basis of their similarity to the query peptide. PepPat calculates similarity over the pattern spanning region, enhancing PepPat's sensitivity for short query peptides. PepPat can also search for a user-specified number of occurrences of a repeated pattern within the target sequence. We illustrate PepPat's application in short peptide ligand mining. As a validation example, we report the identification of a novel tachykinin-like peptide, C14TKL-1, and show it is an NK1 (neuokinin receptor 1) agonist whose message is widely expressed in human periphery. PepPat is offered online at: http://peppat.cbi.pku.edu.cn.
The Amordad database engine for metagenomics.
Behnam, Ehsan; Smith, Andrew D
2014-10-15
Several technical challenges in metagenomic data analysis, including assembling metagenomic sequence data or identifying operational taxonomic units, are both significant and well known. These forms of analysis are increasingly cited as conceptually flawed, given the extreme variation within traditionally defined species and rampant horizontal gene transfer. Furthermore, computational requirements of such analysis have hindered content-based organization of metagenomic data at large scale. In this article, we introduce the Amordad database engine for alignment-free, content-based indexing of metagenomic datasets. Amordad places the metagenome comparison problem in a geometric context, and uses an indexing strategy that combines random hashing with a regular nearest neighbor graph. This framework allows refinement of the database over time by continual application of random hash functions, with the effect of each hash function encoded in the nearest neighbor graph. This eliminates the need to explicitly maintain the hash functions in order for query efficiency to benefit from the accumulated randomness. Results on real and simulated data show that Amordad can support logarithmic query time for identifying similar metagenomes even as the database size reaches into the millions. Source code, licensed under the GNU general public license (version 3) is freely available for download from http://smithlabresearch.org/amordad andrewds@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Improving the Accuracy of Attribute Extraction using the Relatedness between Attribute Values
NASA Astrophysics Data System (ADS)
Bollegala, Danushka; Tani, Naoki; Ishizuka, Mitsuru
Extracting attribute-values related to entities from web texts is an important step in numerous web related tasks such as information retrieval, information extraction, and entity disambiguation (namesake disambiguation). For example, for a search query that contains a personal name, we can not only return documents that contain that personal name, but if we have attribute-values such as the organization for which that person works, we can also suggest documents that contain information related to that organization, thereby improving the user's search experience. Despite numerous potential applications of attribute extraction, it remains a challenging task due to the inherent noise in web data -- often a single web page contains multiple entities and attributes. We propose a graph-based approach to select the correct attribute-values from a set of candidate attribute-values extracted for a particular entity. First, we build an undirected weighted graph in which, attribute-values are represented by nodes, and the edge that connects two nodes in the graph represents the degree of relatedness between the corresponding attribute-values. Next, we find the maximum spanning tree of this graph that connects exactly one attribute-value for each attribute-type. The proposed method outperforms previously proposed attribute extraction methods on a dataset that contains 5000 web pages.
Jing, Xia; Cimino, James J.
2011-01-01
Objective: To explore new graphical methods for reducing and analyzing large data sets in which the data are coded with a hierarchical terminology. Methods: We use a hierarchical terminology to organize a data set and display it in a graph. We reduce the size and complexity of the data set by considering the terminological structure and the data set itself (using a variety of thresholds) as well as contributions of child level nodes to parent level nodes. Results: We found that our methods can reduce large data sets to manageable size and highlight the differences among graphs. The thresholds used as filters to reduce the data set can be used alone or in combination. We applied our methods to two data sets containing information about how nurses and physicians query online knowledge resources. The reduced graphs make the differences between the two groups readily apparent. Conclusions: This is a new approach to reduce size and complexity of large data sets and to simplify visualization. This approach can be applied to any data sets that are coded with hierarchical terminologies. PMID:22195119
Combining conceptual graphs and argumentation for aiding in the teleexpertise.
Doumbouya, Mamadou Bilo; Kamsu-Foguem, Bernard; Kenfack, Hugues; Foguem, Clovis
2015-08-01
Current medical information systems are too complex to be meaningfully exploited. Hence there is a need to develop new strategies for maximising the exploitation of medical data to the benefit of medical professionals. It is against this backdrop that we want to propose a tangible contribution by providing a tool which combines conceptual graphs and Dung׳s argumentation system in order to assist medical professionals in their decision making process. The proposed tool allows medical professionals to easily manipulate and visualise queries and answers for making decisions during the practice of teleexpertise. The knowledge modelling is made using an open application programming interface (API) called CoGui, which offers the means for building structured knowledge bases with the dedicated functionalities of graph-based reasoning via retrieved data from different institutions (hospitals, national security centre, and nursing homes). The tool that we have described in this study supports a formal traceable structure of the reasoning with acceptable arguments to elucidate some ethical problems that occur very often in the telemedicine domain. Copyright © 2015 Elsevier Ltd. All rights reserved.
OLSVis: an animated, interactive visual browser for bio-ontologies
2012-01-01
Background More than one million terms from biomedical ontologies and controlled vocabularies are available through the Ontology Lookup Service (OLS). Although OLS provides ample possibility for querying and browsing terms, the visualization of parts of the ontology graphs is rather limited and inflexible. Results We created the OLSVis web application, a visualiser for browsing all ontologies available in the OLS database. OLSVis shows customisable subgraphs of the OLS ontologies. Subgraphs are animated via a real-time force-based layout algorithm which is fully interactive: each time the user makes a change, e.g. browsing to a new term, hiding, adding, or dragging terms, the algorithm performs smooth and only essential reorganisations of the graph. This assures an optimal viewing experience, because subsequent screen layouts are not grossly altered, and users can easily navigate through the graph. URL: http://ols.wordvis.com Conclusions The OLSVis web application provides a user-friendly tool to visualise ontologies from the OLS repository. It broadens the possibilities to investigate and select ontology subgraphs through a smooth visualisation method. PMID:22646023
Selecting materialized views using random algorithm
NASA Astrophysics Data System (ADS)
Zhou, Lijuan; Hao, Zhongxiao; Liu, Chi
2007-04-01
The data warehouse is a repository of information collected from multiple possibly heterogeneous autonomous distributed databases. The information stored at the data warehouse is in form of views referred to as materialized views. The selection of the materialized views is one of the most important decisions in designing a data warehouse. Materialized views are stored in the data warehouse for the purpose of efficiently implementing on-line analytical processing queries. The first issue for the user to consider is query response time. So in this paper, we develop algorithms to select a set of views to materialize in data warehouse in order to minimize the total view maintenance cost under the constraint of a given query response time. We call it query_cost view_ selection problem. First, cost graph and cost model of query_cost view_ selection problem are presented. Second, the methods for selecting materialized views by using random algorithms are presented. The genetic algorithm is applied to the materialized views selection problem. But with the development of genetic process, the legal solution produced become more and more difficult, so a lot of solutions are eliminated and producing time of the solutions is lengthened in genetic algorithm. Therefore, improved algorithm has been presented in this paper, which is the combination of simulated annealing algorithm and genetic algorithm for the purpose of solving the query cost view selection problem. Finally, in order to test the function and efficiency of our algorithms experiment simulation is adopted. The experiments show that the given methods can provide near-optimal solutions in limited time and works better in practical cases. Randomized algorithms will become invaluable tools for data warehouse evolution.
Multi-Centrality Graph Spectral Decompositions and Their Application to Cyber Intrusion Detection
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Pin-Yu; Choudhury, Sutanay; Hero, Alfred
Many modern datasets can be represented as graphs and hence spectral decompositions such as graph principal component analysis (PCA) can be useful. Distinct from previous graph decomposition approaches based on subspace projection of a single topological feature, e.g., the centered graph adjacency matrix (graph Laplacian), we propose spectral decomposition approaches to graph PCA and graph dictionary learning that integrate multiple features, including graph walk statistics, centrality measures and graph distances to reference nodes. In this paper we propose a new PCA method for single graph analysis, called multi-centrality graph PCA (MC-GPCA), and a new dictionary learning method for ensembles ofmore » graphs, called multi-centrality graph dictionary learning (MC-GDL), both based on spectral decomposition of multi-centrality matrices. As an application to cyber intrusion detection, MC-GPCA can be an effective indicator of anomalous connectivity pattern and MC-GDL can provide discriminative basis for attack classification.« less
Bottom-Up Evaluation of Twig Join Pattern Queries in XML Document Databases
NASA Astrophysics Data System (ADS)
Chen, Yangjun
Since the extensible markup language XML emerged as a new standard for information representation and exchange on the Internet, the problem of storing, indexing, and querying XML documents has been among the major issues of database research. In this paper, we study the twig pattern matching and discuss a new algorithm for processing ordered twig pattern queries. The time complexity of the algorithmis bounded by O(|D|·|Q| + |T|·leaf Q ) and its space overhead is by O(leaf T ·leaf Q ), where T stands for a document tree, Q for a twig pattern and D is a largest data stream associated with a node q of Q, which contains the database nodes that match the node predicate at q. leaf T (leaf Q ) represents the number of the leaf nodes of T (resp. Q). In addition, the algorithm can be adapted to an indexing environment with XB-trees being used.
A hierarchical graph neuron scheme for real-time pattern recognition.
Nasution, B B; Khan, A I
2008-02-01
The hierarchical graph neuron (HGN) implements a single cycle memorization and recall operation through a novel algorithmic design. The HGN is an improvement on the already published original graph neuron (GN) algorithm. In this improved approach, it recognizes incomplete/noisy patterns. It also resolves the crosstalk problem, which is identified in the previous publications, within closely matched patterns. To accomplish this, the HGN links multiple GN networks for filtering noise and crosstalk out of pattern data inputs. Intrinsically, the HGN is a lightweight in-network processing algorithm which does not require expensive floating point computations; hence, it is very suitable for real-time applications and tiny devices such as the wireless sensor networks. This paper describes that the HGN's pattern matching capability and the small response time remain insensitive to the increases in the number of stored patterns. Moreover, the HGN does not require definition of rules or setting of thresholds by the operator to achieve the desired results nor does it require heuristics entailing iterative operations for memorization and recall of patterns.
Analyzing engagement in a web-based intervention platform through visualizing log-data.
Morrison, Cecily; Doherty, Gavin
2014-11-13
Engagement has emerged as a significant cross-cutting concern within the development of Web-based interventions. There have been calls to institute a more rigorous approach to the design of Web-based interventions, to increase both the quantity and quality of engagement. One approach would be to use log-data to better understand the process of engagement and patterns of use. However, an important challenge lies in organizing log-data for productive analysis. Our aim was to conduct an initial exploration of the use of visualizations of log-data to enhance understanding of engagement with Web-based interventions. We applied exploratory sequential data analysis to highlight sequential aspects of the log data, such as time or module number, to provide insights into engagement. After applying a number of processing steps, a range of visualizations were generated from the log-data. We then examined the usefulness of these visualizations for understanding the engagement of individual users and the engagement of cohorts of users. The visualizations created are illustrated with two datasets drawn from studies using the SilverCloud Platform: (1) a small, detailed dataset with interviews (n=19) and (2) a large dataset (n=326) with 44,838 logged events. We present four exploratory visualizations of user engagement with a Web-based intervention, including Navigation Graph, Stripe Graph, Start-Finish Graph, and Next Action Heat Map. The first represents individual usage and the last three, specific aspects of cohort usage. We provide examples of each with a discussion of salient features. Log-data analysis through data visualization is an alternative way of exploring user engagement with Web-based interventions, which can yield different insights than more commonly used summative measures. We describe how understanding the process of engagement through visualizations can support the development and evaluation of Web-based interventions. Specifically, we show how visualizations can (1) allow inspection of content or feature usage in a temporal relationship to the overall program at different levels of granularity, (2) detect different patterns of use to consider personalization in the design process, (3) detect usability issues, (4) enable exploratory analysis to support the design of statistical queries to summarize the data, (5) provide new opportunities for real-time evaluation, and (6) examine assumptions about interactivity that underlie many summative measures in this field.
Analyzing Engagement in a Web-Based Intervention Platform Through Visualizing Log-Data
2014-01-01
Background Engagement has emerged as a significant cross-cutting concern within the development of Web-based interventions. There have been calls to institute a more rigorous approach to the design of Web-based interventions, to increase both the quantity and quality of engagement. One approach would be to use log-data to better understand the process of engagement and patterns of use. However, an important challenge lies in organizing log-data for productive analysis. Objective Our aim was to conduct an initial exploration of the use of visualizations of log-data to enhance understanding of engagement with Web-based interventions. Methods We applied exploratory sequential data analysis to highlight sequential aspects of the log data, such as time or module number, to provide insights into engagement. After applying a number of processing steps, a range of visualizations were generated from the log-data. We then examined the usefulness of these visualizations for understanding the engagement of individual users and the engagement of cohorts of users. The visualizations created are illustrated with two datasets drawn from studies using the SilverCloud Platform: (1) a small, detailed dataset with interviews (n=19) and (2) a large dataset (n=326) with 44,838 logged events. Results We present four exploratory visualizations of user engagement with a Web-based intervention, including Navigation Graph, Stripe Graph, Start–Finish Graph, and Next Action Heat Map. The first represents individual usage and the last three, specific aspects of cohort usage. We provide examples of each with a discussion of salient features. Conclusions Log-data analysis through data visualization is an alternative way of exploring user engagement with Web-based interventions, which can yield different insights than more commonly used summative measures. We describe how understanding the process of engagement through visualizations can support the development and evaluation of Web-based interventions. Specifically, we show how visualizations can (1) allow inspection of content or feature usage in a temporal relationship to the overall program at different levels of granularity, (2) detect different patterns of use to consider personalization in the design process, (3) detect usability issues, (4) enable exploratory analysis to support the design of statistical queries to summarize the data, (5) provide new opportunities for real-time evaluation, and (6) examine assumptions about interactivity that underlie many summative measures in this field. PMID:25406097
Exact and approximate graph matching using random walks.
Gori, Marco; Maggini, Marco; Sarti, Lorenzo
2005-07-01
In this paper, we propose a general framework for graph matching which is suitable for different problems of pattern recognition. The pattern representation we assume is at the same time highly structured, like for classic syntactic and structural approaches, and of subsymbolic nature with real-valued features, like for connectionist and statistic approaches. We show that random walk based models, inspired by Google's PageRank, give rise to a spectral theory that nicely enhances the graph topological features at node level. As a straightforward consequence, we derive a polynomial algorithm for the classic graph isomorphism problem, under the restriction of dealing with Markovian spectrally distinguishable graphs (MSD), a class of graphs that does not seem to be easily reducible to others proposed in the literature. The experimental results that we found on different test-beds of the TC-15 graph database show that the defined MSD class "almost always" covers the database, and that the proposed algorithm is significantly more efficient than top scoring VF algorithm on the same data. Most interestingly, the proposed approach is very well-suited for dealing with partial and approximate graph matching problems, derived for instance from image retrieval tasks. We consider the objects of the COIL-100 visual collection and provide a graph-based representation, whose node's labels contain appropriate visual features. We show that the adoption of classic bipartite graph matching algorithms offers a straightforward generalization of the algorithm given for graph isomorphism and, finally, we report very promising experimental results on the COIL-100 visual collection.
TreeNetViz: revealing patterns of networks over tree structures.
Gou, Liang; Zhang, Xiaolong Luke
2011-12-01
Network data often contain important attributes from various dimensions such as social affiliations and areas of expertise in a social network. If such attributes exhibit a tree structure, visualizing a compound graph consisting of tree and network structures becomes complicated. How to visually reveal patterns of a network over a tree has not been fully studied. In this paper, we propose a compound graph model, TreeNet, to support visualization and analysis of a network at multiple levels of aggregation over a tree. We also present a visualization design, TreeNetViz, to offer the multiscale and cross-scale exploration and interaction of a TreeNet graph. TreeNetViz uses a Radial, Space-Filling (RSF) visualization to represent the tree structure, a circle layout with novel optimization to show aggregated networks derived from TreeNet, and an edge bundling technique to reduce visual complexity. Our circular layout algorithm reduces both total edge-crossings and edge length and also considers hierarchical structure constraints and edge weight in a TreeNet graph. These experiments illustrate that the algorithm can reduce visual cluttering in TreeNet graphs. Our case study also shows that TreeNetViz has the potential to support the analysis of a compound graph by revealing multiscale and cross-scale network patterns. © 2011 IEEE
A study on PubMed search tag usage pattern: association rule mining of a full-day PubMed query log.
Mosa, Abu Saleh Mohammad; Yoo, Illhoi
2013-01-09
The practice of evidence-based medicine requires efficient biomedical literature search such as PubMed/MEDLINE. Retrieval performance relies highly on the efficient use of search field tags. The purpose of this study was to analyze PubMed log data in order to understand the usage pattern of search tags by the end user in PubMed/MEDLINE search. A PubMed query log file was obtained from the National Library of Medicine containing anonymous user identification, timestamp, and query text. Inconsistent records were removed from the dataset and the search tags were extracted from the query texts. A total of 2,917,159 queries were selected for this study issued by a total of 613,061 users. The analysis of frequent co-occurrences and usage patterns of the search tags was conducted using an association mining algorithm. The percentage of search tag usage was low (11.38% of the total queries) and only 2.95% of queries contained two or more tags. Three out of four users used no search tag and about two-third of them issued less than four queries. Among the queries containing at least one tagged search term, the average number of search tags was almost half of the number of total search terms. Navigational search tags are more frequently used than informational search tags. While no strong association was observed between informational and navigational tags, six (out of 19) informational tags and six (out of 29) navigational tags showed strong associations in PubMed searches. The low percentage of search tag usage implies that PubMed/MEDLINE users do not utilize the features of PubMed/MEDLINE widely or they are not aware of such features or solely depend on the high recall focused query translation by the PubMed's Automatic Term Mapping. The users need further education and interactive search application for effective use of the search tags in order to fulfill their biomedical information needs from PubMed/MEDLINE.
A Study on Pubmed Search Tag Usage Pattern: Association Rule Mining of a Full-day Pubmed Query Log
2013-01-01
Background The practice of evidence-based medicine requires efficient biomedical literature search such as PubMed/MEDLINE. Retrieval performance relies highly on the efficient use of search field tags. The purpose of this study was to analyze PubMed log data in order to understand the usage pattern of search tags by the end user in PubMed/MEDLINE search. Methods A PubMed query log file was obtained from the National Library of Medicine containing anonymous user identification, timestamp, and query text. Inconsistent records were removed from the dataset and the search tags were extracted from the query texts. A total of 2,917,159 queries were selected for this study issued by a total of 613,061 users. The analysis of frequent co-occurrences and usage patterns of the search tags was conducted using an association mining algorithm. Results The percentage of search tag usage was low (11.38% of the total queries) and only 2.95% of queries contained two or more tags. Three out of four users used no search tag and about two-third of them issued less than four queries. Among the queries containing at least one tagged search term, the average number of search tags was almost half of the number of total search terms. Navigational search tags are more frequently used than informational search tags. While no strong association was observed between informational and navigational tags, six (out of 19) informational tags and six (out of 29) navigational tags showed strong associations in PubMed searches. Conclusions The low percentage of search tag usage implies that PubMed/MEDLINE users do not utilize the features of PubMed/MEDLINE widely or they are not aware of such features or solely depend on the high recall focused query translation by the PubMed’s Automatic Term Mapping. The users need further education and interactive search application for effective use of the search tags in order to fulfill their biomedical information needs from PubMed/MEDLINE. PMID:23302604
Seasonality in seeking mental health information on Google.
Ayers, John W; Althouse, Benjamin M; Allem, Jon-Patrick; Rosenquist, J Niels; Ford, Daniel E
2013-05-01
Population mental health surveillance is an important challenge limited by resource constraints, long time lags in data collection, and stigma. One promising approach to bridge similar gaps elsewhere has been the use of passively generated digital data. This article assesses the viability of aggregate Internet search queries for real-time monitoring of several mental health problems, specifically in regard to seasonal patterns of seeking out mental health information. All Google mental health queries were monitored in the U.S. and Australia from 2006 to 2010. Additionally, queries were subdivided among those including the terms ADHD (attention deficit-hyperactivity disorder); anxiety; bipolar; depression; anorexia or bulimia (eating disorders); OCD (obsessive-compulsive disorder); schizophrenia; and suicide. A wavelet phase analysis was used to isolate seasonal components in the trends, and based on this model, the mean search volume in winter was compared with that in summer, as performed in 2012. All mental health queries followed seasonal patterns with winter peaks and summer troughs amounting to a 14% (95% CI=11%, 16%) difference in volume for the U.S. and 11% (95% CI=7%, 15%) for Australia. These patterns also were evident for all specific subcategories of illness or problem. For instance, seasonal differences ranged from 7% (95% CI=5%, 10%) for anxiety (followed by OCD, bipolar, depression, suicide, ADHD, schizophrenia) to 37% (95% CI=31%, 44%) for eating disorder queries in the U.S. Several nonclinical motivators for query seasonality (such as media trends or academic interest) were explored and rejected. Information seeking on Google across all major mental illnesses and/or problems followed seasonal patterns similar to those found for seasonal affective disorder. These are the first data published on patterns of seasonality in information seeking encompassing all the major mental illnesses, notable also because they likely would have gone undetected using traditional surveillance. Copyright © 2013. Published by Elsevier Inc.
Empirical Determination of Pattern Match Confidence in Labeled Graphs
2014-02-07
were explored; Erdős–Rényi [6] random graphs, Barabási–Albert preferential attachment graphs [2], and Watts– Strogatz [18] small world graphs. The ER...B. Erdos - Renyi Barabasi - Albert Gr ap h Ty pe Strogatz - Watts Direct Within 2 nodes Within 4 nodes Search Limit 1 10 100 1000 10000 100000 100...Barabási–Albert (BA, crosses) and Watts– Strogatz (WS, trian- gles) graphs were generated with sizes ranging from 50 to 2500 nodes, and labeled
Measuring Two-Event Structural Correlations on Graphs
2012-08-01
2012 to 00-00-2012 4. TITLE AND SUBTITLE Measuring Two-Event Structural Correlations on Graphs 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ...by event simulation on the DBLP graph. Then we examine the efficiency and scala - bility of the framework with a Twitter network. The third part of...correlation pattern mining for large graphs. In Proc. of the 8th Workshop on Mining and Learning with Graphs, pages 119–126, 2010. [23] T. Smith. A
Ontology based heterogeneous materials database integration and semantic query
NASA Astrophysics Data System (ADS)
Zhao, Shuai; Qian, Quan
2017-10-01
Materials digital data, high throughput experiments and high throughput computations are regarded as three key pillars of materials genome initiatives. With the fast growth of materials data, the integration and sharing of data is very urgent, that has gradually become a hot topic of materials informatics. Due to the lack of semantic description, it is difficult to integrate data deeply in semantic level when adopting the conventional heterogeneous database integration approaches such as federal database or data warehouse. In this paper, a semantic integration method is proposed to create the semantic ontology by extracting the database schema semi-automatically. Other heterogeneous databases are integrated to the ontology by means of relational algebra and the rooted graph. Based on integrated ontology, semantic query can be done using SPARQL. During the experiments, two world famous First Principle Computational databases, OQMD and Materials Project are used as the integration targets, which show the availability and effectiveness of our method.
Metric learning with spectral graph convolutions on brain connectivity networks.
Ktena, Sofia Ira; Parisot, Sarah; Ferrante, Enzo; Rajchl, Martin; Lee, Matthew; Glocker, Ben; Rueckert, Daniel
2018-04-01
Graph representations are often used to model structured data at an individual or population level and have numerous applications in pattern recognition problems. In the field of neuroscience, where such representations are commonly used to model structural or functional connectivity between a set of brain regions, graphs have proven to be of great importance. This is mainly due to the capability of revealing patterns related to brain development and disease, which were previously unknown. Evaluating similarity between these brain connectivity networks in a manner that accounts for the graph structure and is tailored for a particular application is, however, non-trivial. Most existing methods fail to accommodate the graph structure, discarding information that could be beneficial for further classification or regression analyses based on these similarities. We propose to learn a graph similarity metric using a siamese graph convolutional neural network (s-GCN) in a supervised setting. The proposed framework takes into consideration the graph structure for the evaluation of similarity between a pair of graphs, by employing spectral graph convolutions that allow the generalisation of traditional convolutions to irregular graphs and operates in the graph spectral domain. We apply the proposed model on two datasets: the challenging ABIDE database, which comprises functional MRI data of 403 patients with autism spectrum disorder (ASD) and 468 healthy controls aggregated from multiple acquisition sites, and a set of 2500 subjects from UK Biobank. We demonstrate the performance of the method for the tasks of classification between matching and non-matching graphs, as well as individual subject classification and manifold learning, showing that it leads to significantly improved results compared to traditional methods. Copyright © 2017 Elsevier Inc. All rights reserved.
Physical Samples Linked Data in Action
NASA Astrophysics Data System (ADS)
Ji, P.; Arko, R. A.; Lehnert, K.; Bristol, S.
2017-12-01
Most data and metadata related to physical samples currently reside in isolated relational databases driven by diverse data models. How to approach the challenge for sharing, interchanging and integrating data from these difference relational databases motivated us to publish Linked Open Data for collections of physical samples, using Semantic Web technologies including the Resource Description Framework (RDF), RDF Query Language (SPARQL), and Web Ontology Language (OWL). In last few years, we have released four knowledge graphs concentrated on physical samples, including System for Earth Sample Registration (SESAR), USGS National Geochemical Database (NGDC), Ocean Biogeographic Information System (OBIS), and Earthchem Database. Currently the four knowledge graphs contain over 12 million facets (triples) about objects of interest to the geoscience domain. Choosing appropriate domain ontologies for representing context of data is the core of the whole work. Geolink ontology developed by Earthcube Geolink project was used as top level to represent common concepts like person, organization, cruise, etc. Physical sample ontology developed by Interdisciplinary Earth Data Alliance (IEDA) and Darwin Core vocabulary were used as second level to describe details about geological samples and biological diversity. We also focused on finding and building best tool chains to support the whole life cycle of publishing linked data we have, including information retrieval, linked data browsing and data visualization. Currently, Morph, Virtuoso Server, LodView, LodLive, and YASGUI were employed for converting, storing, representing, and querying data in a knowledge base (RDF triplestore). Persistent digital identifier is another main point we concentrated on. Open Researcher & Contributor IDs (ORCIDs), International Geo Sample Numbers (IGSNs), Global Research Identifier Database (GRID) and other persistent identifiers were used to link different resources from various graphs with person, sample, organization, cruise, etc. This work is supported by the EarthCube "GeoLink" project (NSF# ICER14-40221 and others) and the "USGS-IEDA Partnership to Support a Data Lifecycle Framework and Tools" project (USGS# G13AC00381).
Seo, Dong-Woo; Sohn, Chang Hwan; Kim, Sung-Hoon; Ryoo, Seung Mok; Lee, Yoon-Seon; Lee, Jae Ho; Kim, Won Young; Lim, Kyoung Soo
2016-01-01
Background Digital surveillance using internet search queries can improve both the sensitivity and timeliness of the detection of a health event, such as an influenza outbreak. While it has recently been estimated that the mobile search volume surpasses the desktop search volume and mobile search patterns differ from desktop search patterns, the previous digital surveillance systems did not distinguish mobile and desktop search queries. The purpose of this study was to compare the performance of mobile and desktop search queries in terms of digital influenza surveillance. Methods and Results The study period was from September 6, 2010 through August 30, 2014, which consisted of four epidemiological years. Influenza-like illness (ILI) and virologic surveillance data from the Korea Centers for Disease Control and Prevention were used. A total of 210 combined queries from our previous survey work were used for this study. Mobile and desktop weekly search data were extracted from Naver, which is the largest search engine in Korea. Spearman’s correlation analysis was used to examine the correlation of the mobile and desktop data with ILI and virologic data in Korea. We also performed lag correlation analysis. We observed that the influenza surveillance performance of mobile search queries matched or exceeded that of desktop search queries over time. The mean correlation coefficients of mobile search queries and the number of queries with an r-value of ≥ 0.7 equaled or became greater than those of desktop searches over the four epidemiological years. A lag correlation analysis of up to two weeks showed similar trends. Conclusion Our study shows that mobile search queries for influenza surveillance have equaled or even become greater than desktop search queries over time. In the future development of influenza surveillance using search queries, the recognition of changing trend of mobile search data could be necessary. PMID:27391028
Shin, Soo-Yong; Kim, Taerim; Seo, Dong-Woo; Sohn, Chang Hwan; Kim, Sung-Hoon; Ryoo, Seung Mok; Lee, Yoon-Seon; Lee, Jae Ho; Kim, Won Young; Lim, Kyoung Soo
2016-01-01
Digital surveillance using internet search queries can improve both the sensitivity and timeliness of the detection of a health event, such as an influenza outbreak. While it has recently been estimated that the mobile search volume surpasses the desktop search volume and mobile search patterns differ from desktop search patterns, the previous digital surveillance systems did not distinguish mobile and desktop search queries. The purpose of this study was to compare the performance of mobile and desktop search queries in terms of digital influenza surveillance. The study period was from September 6, 2010 through August 30, 2014, which consisted of four epidemiological years. Influenza-like illness (ILI) and virologic surveillance data from the Korea Centers for Disease Control and Prevention were used. A total of 210 combined queries from our previous survey work were used for this study. Mobile and desktop weekly search data were extracted from Naver, which is the largest search engine in Korea. Spearman's correlation analysis was used to examine the correlation of the mobile and desktop data with ILI and virologic data in Korea. We also performed lag correlation analysis. We observed that the influenza surveillance performance of mobile search queries matched or exceeded that of desktop search queries over time. The mean correlation coefficients of mobile search queries and the number of queries with an r-value of ≥ 0.7 equaled or became greater than those of desktop searches over the four epidemiological years. A lag correlation analysis of up to two weeks showed similar trends. Our study shows that mobile search queries for influenza surveillance have equaled or even become greater than desktop search queries over time. In the future development of influenza surveillance using search queries, the recognition of changing trend of mobile search data could be necessary.
Unsupervised Spatial Event Detection in Targeted Domains with Applications to Civil Unrest Modeling
Zhao, Liang; Chen, Feng; Dai, Jing; Hua, Ting; Lu, Chang-Tien; Ramakrishnan, Naren
2014-01-01
Twitter has become a popular data source as a surrogate for monitoring and detecting events. Targeted domains such as crime, election, and social unrest require the creation of algorithms capable of detecting events pertinent to these domains. Due to the unstructured language, short-length messages, dynamics, and heterogeneity typical of Twitter data streams, it is technically difficult and labor-intensive to develop and maintain supervised learning systems. We present a novel unsupervised approach for detecting spatial events in targeted domains and illustrate this approach using one specific domain, viz. civil unrest modeling. Given a targeted domain, we propose a dynamic query expansion algorithm to iteratively expand domain-related terms, and generate a tweet homogeneous graph. An anomaly identification method is utilized to detect spatial events over this graph by jointly maximizing local modularity and spatial scan statistics. Extensive experiments conducted in 10 Latin American countries demonstrate the effectiveness of the proposed approach. PMID:25350136
Monitoring Moving Queries inside a Safe Region
Al-Khalidi, Haidar; Taniar, David; Alamri, Sultan
2014-01-01
With mobile moving range queries, there is a need to recalculate the relevant surrounding objects of interest whenever the query moves. Therefore, monitoring the moving query is very costly. The safe region is one method that has been proposed to minimise the communication and computation cost of continuously monitoring a moving range query. Inside the safe region the set of objects of interest to the query do not change; thus there is no need to update the query while it is inside its safe region. However, when the query leaves its safe region the mobile device has to reevaluate the query, necessitating communication with the server. Knowing when and where the mobile device will leave a safe region is widely known as a difficult problem. To solve this problem, we propose a novel method to monitor the position of the query over time using a linear function based on the direction of the query obtained by periodic monitoring of its position. Periodic monitoring ensures that the query is aware of its location all the time. This method reduces the costs associated with communications in client-server architecture. Computational results show that our method is successful in handling moving query patterns. PMID:24696652
Dynamic Visualization of Co-expression in Systems Genetics Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
New, Joshua Ryan; Huang, Jian; Chesler, Elissa J
2008-01-01
Biologists hope to address grand scientific challenges by exploring the abundance of data made available through modern microarray technology and other high-throughput techniques. The impact of this data, however, is limited unless researchers can effectively assimilate such complex information and integrate it into their daily research; interactive visualization tools are called for to support the effort. Specifically, typical studies of gene co-expression require novel visualization tools that enable the dynamic formulation and fine-tuning of hypotheses to aid the process of evaluating sensitivity of key parameters. These tools should allow biologists to develop an intuitive understanding of the structure of biologicalmore » networks and discover genes which reside in critical positions in networks and pathways. By using a graph as a universal data representation of correlation in gene expression data, our novel visualization tool employs several techniques that when used in an integrated manner provide innovative analytical capabilities. Our tool for interacting with gene co-expression data integrates techniques such as: graph layout, qualitative subgraph extraction through a novel 2D user interface, quantitative subgraph extraction using graph-theoretic algorithms or by querying an optimized b-tree, dynamic level-of-detail graph abstraction, and template-based fuzzy classification using neural networks. We demonstrate our system using a real-world workflow from a large-scale, systems genetics study of mammalian gene co-expression.« less
Expert system validation in prolog
NASA Technical Reports Server (NTRS)
Stock, Todd; Stachowitz, Rolf; Chang, Chin-Liang; Combs, Jacqueline
1988-01-01
An overview of the Expert System Validation Assistant (EVA) is being implemented in Prolog at the Lockheed AI Center. Prolog was chosen to facilitate rapid prototyping of the structure and logic checkers and since February 1987, we have implemented code to check for irrelevance, subsumption, duplication, deadends, unreachability, and cycles. The architecture chosen is extremely flexible and expansible, yet concise and complementary with the normal interactive style of Prolog. The foundation of the system is in the connection graph representation. Rules and facts are modeled as nodes in the graph and arcs indicate common patterns between rules. The basic activity of the validation system is then a traversal of the connection graph, searching for various patterns the system recognizes as erroneous. To aid in specifying these patterns, a metalanguage is developed, providing the user with the basic facilities required to reason about the expert system. Using the metalanguage, the user can, for example, give the Prolog inference engine the goal of finding inconsistent conclusions among the rules, and Prolog will search the graph intantiations which can match the definition of inconsistency. Examples of code for some of the checkers are provided and the algorithms explained. Technical highlights include automatic construction of a connection graph, demonstration of the use of metalanguage, the A* algorithm modified to detect all unique cycles, general-purpose stacks in Prolog, and a general-purpose database browser with pattern completion.
Searching social networks for subgraph patterns
NASA Astrophysics Data System (ADS)
Ogaard, Kirk; Kase, Sue; Roy, Heather; Nagi, Rakesh; Sambhoos, Kedar; Sudit, Moises
2013-06-01
Software tools for Social Network Analysis (SNA) are being developed which support various types of analysis of social networks extracted from social media websites (e.g., Twitter). Once extracted and stored in a database such social networks are amenable to analysis by SNA software. This data analysis often involves searching for occurrences of various subgraph patterns (i.e., graphical representations of entities and relationships). The authors have developed the Graph Matching Toolkit (GMT) which provides an intuitive Graphical User Interface (GUI) for a heuristic graph matching algorithm called the Truncated Search Tree (TruST) algorithm. GMT is a visual interface for graph matching algorithms processing large social networks. GMT enables an analyst to draw a subgraph pattern by using a mouse to select categories and labels for nodes and links from drop-down menus. GMT then executes the TruST algorithm to find the top five occurrences of the subgraph pattern within the social network stored in the database. GMT was tested using a simulated counter-insurgency dataset consisting of cellular phone communications within a populated area of operations in Iraq. The results indicated GMT (when executing the TruST graph matching algorithm) is a time-efficient approach to searching large social networks. GMT's visual interface to a graph matching algorithm enables intelligence analysts to quickly analyze and summarize the large amounts of data necessary to produce actionable intelligence.
Nadkarni, P M
1997-08-01
Concept Locator (CL) is a client-server application that accesses a Sybase relational database server containing a subset of the UMLS Metathesaurus for the purpose of retrieval of concepts corresponding to one or more query expressions supplied to it. CL's query grammar permits complex Boolean expressions, wildcard patterns, and parenthesized (nested) subexpressions. CL translates the query expressions supplied to it into one or more SQL statements that actually perform the retrieval. The generated SQL is optimized by the client to take advantage of the strengths of the server's query optimizer, and sidesteps its weaknesses, so that execution is reasonably efficient.
GrouseFlocks: steerable exploration of graph hierarchy space.
Archambault, Daniel; Munzner, Tamara; Auber, David
2008-01-01
Several previous systems allow users to interactively explore a large input graph through cuts of a superimposed hierarchy. This hierarchy is often created using clustering algorithms or topological features present in the graph. However, many graphs have domain-specific attributes associated with the nodes and edges, which could be used to create many possible hierarchies providing unique views of the input graph. GrouseFlocks is a system for the exploration of this graph hierarchy space. By allowing users to see several different possible hierarchies on the same graph, the system helps users investigate graph hierarchy space instead of a single fixed hierarchy. GrouseFlocks provides a simple set of operations so that users can create and modify their graph hierarchies based on selections. These selections can be made manually or based on patterns in the attribute data provided with the graph. It provides feedback to the user within seconds, allowing interactive exploration of this space.
Segmentation of touching handwritten Japanese characters using the graph theory method
NASA Astrophysics Data System (ADS)
Suwa, Misako
2000-12-01
Projection analysis methods have been widely used to segment Japanese character strings. However, if adjacent characters have overhanging strokes or a touching point doesn't correspond to the histogram minimum, the methods are prone to result in errors. In contrast, non-projection analysis methods being proposed for use on numerals or alphabet characters cannot be simply applied for Japanese characters because of the differences in the structure of the characters. Based on the oversegmenting strategy, a new pre-segmentation method is presented in this paper: touching patterns are represented as graphs and touching strokes are regarded as the elements of proper edge cutsets. By using the graph theoretical technique, the cutset martrix is calculated. Then, by applying pruning rules, potential touching strokes are determined and the patterns are over segmented. Moreover, this algorithm was confirmed to be valid for touching patterns with overhanging strokes and doubly connected patterns in simulations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grossman, Max; Pritchard Jr., Howard Porter; Budimlic, Zoran
2016-12-22
Graph500 [14] is an effort to offer a standardized benchmark across large-scale distributed platforms which captures the behavior of common communicationbound graph algorithms. Graph500 differs from other large-scale benchmarking efforts (such as HPL [6] or HPGMG [7]) primarily in the irregularity of its computation and data access patterns. The core computational kernel of Graph500 is a breadth-first search (BFS) implemented on an undirected graph. The output of Graph500 is a spanning tree of the input graph, usually represented by a predecessor mapping for every node in the graph. The Graph500 benchmark defines several pre-defined input sizes for implementers to testmore » against. This report summarizes investigation into implementing the Graph500 benchmark on OpenSHMEM, and focuses on first building a strong and practical understanding of the strengths and limitations of past work before proposing and developing novel extensions.« less
A Set of Handwriting Features for Use in Automated Writer Identification.
Miller, John J; Patterson, Robert Bradley; Gantz, Donald T; Saunders, Christopher P; Walch, Mark A; Buscaglia, JoAnn
2017-05-01
A writer's biometric identity can be characterized through the distribution of physical feature measurements ("writer's profile"); a graph-based system that facilitates the quantification of these features is described. To accomplish this quantification, handwriting is segmented into basic graphical forms ("graphemes"), which are "skeletonized" to yield the graphical topology of the handwritten segment. The graph-based matching algorithm compares the graphemes first by their graphical topology and then by their geometric features. Graphs derived from known writers can be compared against graphs extracted from unknown writings. The process is computationally intensive and relies heavily upon statistical pattern recognition algorithms. This article focuses on the quantification of these physical features and the construction of the associated pattern recognition methods for using the features to discriminate among writers. The graph-based system described in this article has been implemented in a highly accurate and approximately language-independent biometric recognition system of writers of cursive documents. © 2017 American Academy of Forensic Sciences.
Artificial Neural Networks for Processing Graphs with Application to Image Understanding: A Survey
NASA Astrophysics Data System (ADS)
Bianchini, Monica; Scarselli, Franco
In graphical pattern recognition, each data is represented as an arrangement of elements, that encodes both the properties of each element and the relations among them. Hence, patterns are modelled as labelled graphs where, in general, labels can be attached to both nodes and edges. Artificial neural networks able to process graphs are a powerful tool for addressing a great variety of real-world problems, where the information is naturally organized in entities and relationships among entities and, in fact, they have been widely used in computer vision, f.i. in logo recognition, in similarity retrieval, and for object detection. In this chapter, we propose a survey of neural network models able to process structured information, with a particular focus on those architectures tailored to address image understanding applications. Starting from the original recursive model (RNNs), we subsequently present different ways to represent images - by trees, forests of trees, multiresolution trees, directed acyclic graphs with labelled edges, general graphs - and, correspondingly, neural network architectures appropriate to process such structures.
NASA Astrophysics Data System (ADS)
Lee, Kyu J.; Kunii, T. L.; Noma, T.
1993-01-01
In this paper, we propose a syntactic pattern recognition method for non-schematic drawings, based on a new attributed graph grammar with flexible embedding. In our graph grammar, the embedding rule permits the nodes of a guest graph to be arbitrarily connected with the nodes of a host graph. The ambiguity caused by this flexible embedding is controlled with the evaluation of synthesized attributes and the check of context sensitivity. To integrate parsing with the synthesized attribute evaluation and the context sensitivity check, we also develop a bottom up parsing algorithm.
A linked GeoData map for enabling information access
Powell, Logan J.; Varanka, Dalia E.
2018-01-10
OverviewThe Geospatial Semantic Web (GSW) is an emerging technology that uses the Internet for more effective knowledge engineering and information extraction. Among the aims of the GSW are to structure the semantic specifications of data to reduce ambiguity and to link those data more efficiently. The data are stored as triples, the basic data unit in graph databases, which are similar to the vector data model of geographic information systems (GIS); that is, a node-edge-node model that forms a graph of semantically related information. The GSW is supported by emerging technologies such as linked geospatial data, described below, that enable it to store and manage geographical data that require new cartographic methods for visualization. This report describes a map that can interact with linked geospatial data using a simulation of a data query approach called the browsable graph to find information that is semantically related to a subject of interest, visualized using the Data Driven Documents (D3) library. Such a semantically enabled map functions as a map knowledge base (MKB) (Varanka and Usery, 2017).A MKB differs from a database in an important way. The central element of a triple, alternatively called the edge or property, is composed of a logic formalization that structures the relation between the first and third parts, the nodes or objects. Node-edge-node represents the graphic form of the triple, and the subject-property-object terms represent the data structure. Object classes connect to build a federated graph, similar to a network in visual form. Because the triple property is a logical statement (a predicate), the data graph represents logical propositions or assertions accepted to be true about the subject matter. These logical formalizations can be manipulated to calculate new triples, representing inferred logical assertions, from the existing data.To demonstrate a MKB system, a technical proof-of-concept is developed that uses geographically attributed Resource Description Framework (RDF) serializations of linked data for mapping. The proof-of-concept focuses on accessing triple data from visual elements of a geographic map as the interface to the MKB. The map interface is embedded with other essential functions such as SPARQL Protocol and RDF Query Language (SPARQL) data query endpoint services and reasoning capabilities of Apache Marmotta (Apache Software Foundation, 2017). An RDF database of the Geographic Names Information System (GNIS), which contains official names of domestic feature in the United States, was linked to a county data layer from The National Map of the U.S. Geological Survey. The county data are part of a broader Government Units theme offered to the public as Esri shapefiles. The shapefile used to draw the map itself was converted to a geographic-oriented JavaScript Object Notation (JSON) (GeoJSON) format and linked through various properties with a linked geodata version of the GNIS database called “GNIS–LD” (Butler and others, 2016; B. Regalia and others, University of California-Santa Barbara, written commun., 2017). The GNIS–LD files originated in Terse RDF Triple Language (Turtle) format but were converted to a JSON format specialized in linked data, “JSON–LD” (Beckett and Berners-Lee, 2011; Sorny and others, 2014). The GNIS–LD database is composed of roughly three predominant triple data graphs: Features, Names, and History. The graphs include a set of namespace prefixes used by each of the attributes. Predefining the prefixes made the conversion to the JSON–LD format simple to complete because Turtle and JSON–LD are variant specifications of the basic RDF concept.To convert a shapefile into GeoJSON format to capture the geospatial coordinate geometry objects, an online converter, Mapshaper, was used (Bloch, 2013). To convert the Turtle files, a custom converter written in Java reconstructs the files by parsing each grouping of attributes belonging to one subject and pasting the data into a new file that follows the syntax of JSON–LD. Additionally, the Features file contained its own set of geometries, which was exported into a separate JSON–LD file along with its elevation value to form a fourth file, named “features-geo.json.” Extracted data from external files can be represented in HyperText Markup Language (HTML) path objects. The goal was to import multiple JSON–LD files using this approach.
ERIC Educational Resources Information Center
Mueller, Derek
2012-01-01
Presented as a series of graphs, bibliographic data gathered from "College Composition and Communication" provides perspective useful for inquiring into the changing shape of the field as it continues to mature. In its focus on graphing, the article demonstrates an application of distant reading methods to present patterns not only reflective of…
SAFE: SPARQL Federation over RDF Data Cubes with Access Control.
Khan, Yasar; Saleem, Muhammad; Mehdi, Muntazir; Hogan, Aidan; Mehmood, Qaiser; Rebholz-Schuhmann, Dietrich; Sahay, Ratnesh
2017-02-01
Several query federation engines have been proposed for accessing public Linked Open Data sources. However, in many domains, resources are sensitive and access to these resources is tightly controlled by stakeholders; consequently, privacy is a major concern when federating queries over such datasets. In the Healthcare and Life Sciences (HCLS) domain real-world datasets contain sensitive statistical information: strict ownership is granted to individuals working in hospitals, research labs, clinical trial organisers, etc. Therefore, the legal and ethical concerns on (i) preserving the anonymity of patients (or clinical subjects); and (ii) respecting data ownership through access control; are key challenges faced by the data analytics community working within the HCLS domain. Likewise statistical data play a key role in the domain, where the RDF Data Cube Vocabulary has been proposed as a standard format to enable the exchange of such data. However, to the best of our knowledge, no existing approach has looked to optimise federated queries over such statistical data. We present SAFE: a query federation engine that enables policy-aware access to sensitive statistical datasets represented as RDF data cubes. SAFE is designed specifically to query statistical RDF data cubes in a distributed setting, where access control is coupled with source selection, user profiles and their access rights. SAFE proposes a join-aware source selection method that avoids wasteful requests to irrelevant and unauthorised data sources. In order to preserve anonymity and enforce stricter access control, SAFE's indexing system does not hold any data instances-it stores only predicates and endpoints. The resulting data summary has a significantly lower index generation time and size compared to existing engines, which allows for faster updates when sources change. We validate the performance of the system with experiments over real-world datasets provided by three clinical organisations as well as legacy linked datasets. We show that SAFE enables granular graph-level access control over distributed clinical RDF data cubes and efficiently reduces the source selection and overall query execution time when compared with general-purpose SPARQL query federation engines in the targeted setting.
GRADIENT: Graph Analytic Approach for Discovering Irregular Events, Nascent and Temporal
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogan, Emilie
2015-03-31
Finding a time-ordered signature within large graphs is a computationally complex problem due to the combinatorial explosion of potential patterns. GRADIENT is designed to search and understand that problem space.
GRADIENT: Graph Analytic Approach for Discovering Irregular Events, Nascent and Temporal
Hogan, Emilie
2018-01-16
Finding a time-ordered signature within large graphs is a computationally complex problem due to the combinatorial explosion of potential patterns. GRADIENT is designed to search and understand that problem space.
Preliminary Results on Uncertainty Quantification for Pattern Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Stracuzzi, David John; Brost, Randolph; Chen, Maximillian Gene
2015-09-01
This report summarizes preliminary research into uncertainty quantification for pattern ana- lytics within the context of the Pattern Analytics to Support High-Performance Exploitation and Reasoning (PANTHER) project. The primary focus of PANTHER was to make large quantities of remote sensing data searchable by analysts. The work described in this re- port adds nuance to both the initial data preparation steps and the search process. Search queries are transformed from does the specified pattern exist in the data? to how certain is the system that the returned results match the query? We show example results for both data processing and search,more » and discuss a number of possible improvements for each.« less
Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples
Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav
2018-01-01
Abstract Motivation As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Results Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Availability and implementation Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. Contact chris.wilks@jhu.edu or langmea@cs.jhu.edu Supplementary information Supplementary data are available at Bioinformatics online. PMID:28968689
Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples.
Wilks, Christopher; Gaddipati, Phani; Nellore, Abhinav; Langmead, Ben
2018-01-01
As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license. chris.wilks@jhu.edu or langmea@cs.jhu.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Design and Implementation of a Metadata-rich File System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ames, S; Gokhale, M B; Maltzahn, C
2010-01-19
Despite continual improvements in the performance and reliability of large scale file systems, the management of user-defined file system metadata has changed little in the past decade. The mismatch between the size and complexity of large scale data stores and their ability to organize and query their metadata has led to a de facto standard in which raw data is stored in traditional file systems, while related, application-specific metadata is stored in relational databases. This separation of data and semantic metadata requires considerable effort to maintain consistency and can result in complex, slow, and inflexible system operation. To address thesemore » problems, we have developed the Quasar File System (QFS), a metadata-rich file system in which files, user-defined attributes, and file relationships are all first class objects. In contrast to hierarchical file systems and relational databases, QFS defines a graph data model composed of files and their relationships. QFS incorporates Quasar, an XPATH-extended query language for searching the file system. Results from our QFS prototype show the effectiveness of this approach. Compared to the de facto standard, the QFS prototype shows superior ingest performance and comparable query performance on user metadata-intensive operations and superior performance on normal file metadata operations.« less
Query-Adaptive Hash Code Ranking for Large-Scale Multi-View Visual Search.
Liu, Xianglong; Huang, Lei; Deng, Cheng; Lang, Bo; Tao, Dacheng
2016-10-01
Hash-based nearest neighbor search has become attractive in many applications. However, the quantization in hashing usually degenerates the discriminative power when using Hamming distance ranking. Besides, for large-scale visual search, existing hashing methods cannot directly support the efficient search over the data with multiple sources, and while the literature has shown that adaptively incorporating complementary information from diverse sources or views can significantly boost the search performance. To address the problems, this paper proposes a novel and generic approach to building multiple hash tables with multiple views and generating fine-grained ranking results at bitwise and tablewise levels. For each hash table, a query-adaptive bitwise weighting is introduced to alleviate the quantization loss by simultaneously exploiting the quality of hash functions and their complement for nearest neighbor search. From the tablewise aspect, multiple hash tables are built for different data views as a joint index, over which a query-specific rank fusion is proposed to rerank all results from the bitwise ranking by diffusing in a graph. Comprehensive experiments on image search over three well-known benchmarks show that the proposed method achieves up to 17.11% and 20.28% performance gains on single and multiple table search over the state-of-the-art methods.
biochem4j: Integrated and extensible biochemical knowledge through graph databases.
Swainston, Neil; Batista-Navarro, Riza; Carbonell, Pablo; Dobson, Paul D; Dunstan, Mark; Jervis, Adrian J; Vinaixa, Maria; Williams, Alan R; Ananiadou, Sophia; Faulon, Jean-Loup; Mendes, Pedro; Kell, Douglas B; Scrutton, Nigel S; Breitling, Rainer
2017-01-01
Biologists and biochemists have at their disposal a number of excellent, publicly available data resources such as UniProt, KEGG, and NCBI Taxonomy, which catalogue biological entities. Despite the usefulness of these resources, they remain fundamentally unconnected. While links may appear between entries across these databases, users are typically only able to follow such links by manual browsing or through specialised workflows. Although many of the resources provide web-service interfaces for computational access, performing federated queries across databases remains a non-trivial but essential activity in interdisciplinary systems and synthetic biology programmes. What is needed are integrated repositories to catalogue both biological entities and-crucially-the relationships between them. Such a resource should be extensible, such that newly discovered relationships-for example, those between novel, synthetic enzymes and non-natural products-can be added over time. With the introduction of graph databases, the barrier to the rapid generation, extension and querying of such a resource has been lowered considerably. With a particular focus on metabolic engineering as an illustrative application domain, biochem4j, freely available at http://biochem4j.org, is introduced to provide an integrated, queryable database that warehouses chemical, reaction, enzyme and taxonomic data from a range of reliable resources. The biochem4j framework establishes a starting point for the flexible integration and exploitation of an ever-wider range of biological data sources, from public databases to laboratory-specific experimental datasets, for the benefit of systems biologists, biosystems engineers and the wider community of molecular biologists and biological chemists.
biochem4j: Integrated and extensible biochemical knowledge through graph databases
Batista-Navarro, Riza; Dunstan, Mark; Jervis, Adrian J.; Vinaixa, Maria; Ananiadou, Sophia; Faulon, Jean-Loup; Kell, Douglas B.
2017-01-01
Biologists and biochemists have at their disposal a number of excellent, publicly available data resources such as UniProt, KEGG, and NCBI Taxonomy, which catalogue biological entities. Despite the usefulness of these resources, they remain fundamentally unconnected. While links may appear between entries across these databases, users are typically only able to follow such links by manual browsing or through specialised workflows. Although many of the resources provide web-service interfaces for computational access, performing federated queries across databases remains a non-trivial but essential activity in interdisciplinary systems and synthetic biology programmes. What is needed are integrated repositories to catalogue both biological entities and–crucially–the relationships between them. Such a resource should be extensible, such that newly discovered relationships–for example, those between novel, synthetic enzymes and non-natural products–can be added over time. With the introduction of graph databases, the barrier to the rapid generation, extension and querying of such a resource has been lowered considerably. With a particular focus on metabolic engineering as an illustrative application domain, biochem4j, freely available at http://biochem4j.org, is introduced to provide an integrated, queryable database that warehouses chemical, reaction, enzyme and taxonomic data from a range of reliable resources. The biochem4j framework establishes a starting point for the flexible integration and exploitation of an ever-wider range of biological data sources, from public databases to laboratory-specific experimental datasets, for the benefit of systems biologists, biosystems engineers and the wider community of molecular biologists and biological chemists. PMID:28708831
Learning and Inductive Inference
1982-07-01
a set of graph grammars to describe visual scenes . Other researchers have applied graph grammars to the pattern recognition of handwritten characters...345 1. Issues / 345 2. Mostows’ operationalizer / 350 0. Learning from ezamples / 360 1. Issues / 3t60 2. Learning in control and pattern recognition ...art.icleis on rote learntinig and ailvice- tAik g. K(ennieth Clarkson contributed Ltte article on grmvit atical inference, anid Geoff’ lroiney wrote
Google Trends terms reporting rhinitis and related topics differ in European countries.
Bousquet, J; Agache, I; Anto, J M; Bergmann, K C; Bachert, C; Annesi-Maesano, I; Bousquet, P J; D'Amato, G; Demoly, P; De Vries, G; Eller, E; Fokkens, W J; Fonseca, J; Haahtela, T; Hellings, P W; Just, J; Keil, T; Klimek, L; Kuna, P; Lodrup Carlsen, K C; Mösges, R; Murray, R; Nekam, K; Onorato, G; Papadopoulos, N G; Samolinski, B; Schmid-Grendelmeier, P; Thibaudon, M; Tomazic, P; Triggiani, M; Valiulis, A; Valovirta, E; Van Eerd, M; Wickman, M; Zuberbier, T; Sheikh, A
2017-08-01
Google Trends (GT) searches trends of specific queries in Google and reflects the real-life epidemiology of allergic rhinitis. We compared Google Trends terms related to allergy and rhinitis in all European Union countries, Norway and Switzerland from 1 January 2011 to 20 December 2016. The aim was to assess whether the same terms could be used to report the seasonal variations of allergic diseases. Using the Google Trend 5-year graph, an annual and clear seasonality of queries was found in all countries apart from Cyprus, Estonia, Latvia, Lithuania and Malta. Different terms were found to demonstrate seasonality depending on the country - namely 'hay fever', 'allergy' and 'pollen' - showing cultural differences. A single set of terms cannot be used across all European countries, but allergy seasonality can be compared across Europe providing the above three terms are used. Using longitudinal data in different countries and multiple terms, we identified an awareness-related spike of searches (December 2016). © 2017 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
1989-09-30
parses, in a second experiment. This procedure used PUNDIT’s Selection Pattern Query and Response ( SPQR ) component JLang19881. We first used SPQR in...messages pattern. SPQR continues the analysis of the ISR. from each domain, and the resulting output is and the parsing of the sentence is allowed to...UNISYS P. 0. Box 517, Paoli, PA 19301 ABSTRACT knowledge. This paper presents SPQR (Selectional Pat- One obvious benefit of acquiring domain- tern Queries
OPEX: Optimized Eccentricity Computation in Graphs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Henderson, Keith
2011-11-14
Real-world graphs have many properties of interest, but often these properties are expensive to compute. We focus on eccentricity, radius and diameter in this work. These properties are useful measures of the global connectivity patterns in a graph. Unfortunately, computing eccentricity for all nodes is O(n2) for a graph with n nodes. We present OPEX, a novel combination of optimizations which improves computation time of these properties by orders of magnitude in real-world experiments on graphs of many different sizes. We run OPEX on graphs with up to millions of links. OPEX gives either exact results or bounded approximations, unlikemore » its competitors which give probabilistic approximations or sacrifice node-level information (eccentricity) to compute graphlevel information (diameter).« less
2007-04-19
define the patterns and are better at analyzing behavior. SPQR (System for Pattern Query and Recognition) [18, 58] can recognize pattern vari- ants...Stotts. SPQR : Flexible automated design pattern extraction from source code. ase, 00:215, 2003. ISSN 1527-1366. doi: http://doi.ieeecomputersociety. org
Nelson, Carl A; Miller, David J; Oleynikov, Dmitry
2008-01-01
As modular systems come into the forefront of robotic telesurgery, streamlining the process of selecting surgical tools becomes an important consideration. This paper presents a method for optimal queuing of tools in modular surgical tool systems, based on patterns in tool-use sequences, in order to minimize time spent changing tools. The solution approach is to model the set of tools as a graph, with tool-change frequency expressed as edge weights in the graph, and to solve the Traveling Salesman Problem for the graph. In a set of simulations, this method has shown superior performance at optimizing tool arrangements for streamlining surgical procedures.
Query Log Analysis of an Electronic Health Record Search Engine
Yang, Lei; Mei, Qiaozhu; Zheng, Kai; Hanauer, David A.
2011-01-01
We analyzed a longitudinal collection of query logs of a full-text search engine designed to facilitate information retrieval in electronic health records (EHR). The collection, 202,905 queries and 35,928 user sessions recorded over a course of 4 years, represents the information-seeking behavior of 533 medical professionals, including frontline practitioners, coding personnel, patient safety officers, and biomedical researchers for patient data stored in EHR systems. In this paper, we present descriptive statistics of the queries, a categorization of information needs manifested through the queries, as well as temporal patterns of the users’ information-seeking behavior. The results suggest that information needs in medical domain are substantially more sophisticated than those that general-purpose web search engines need to accommodate. Therefore, we envision there exists a significant challenge, along with significant opportunities, to provide intelligent query recommendations to facilitate information retrieval in EHR. PMID:22195150
Probabilistic generation of random networks taking into account information on motifs occurrence.
Bois, Frederic Y; Gayraud, Ghislaine
2015-01-01
Because of the huge number of graphs possible even with a small number of nodes, inference on network structure is known to be a challenging problem. Generating large random directed graphs with prescribed probabilities of occurrences of some meaningful patterns (motifs) is also difficult. We show how to generate such random graphs according to a formal probabilistic representation, using fast Markov chain Monte Carlo methods to sample them. As an illustration, we generate realistic graphs with several hundred nodes mimicking a gene transcription interaction network in Escherichia coli.
Probabilistic Generation of Random Networks Taking into Account Information on Motifs Occurrence
Bois, Frederic Y.
2015-01-01
Abstract Because of the huge number of graphs possible even with a small number of nodes, inference on network structure is known to be a challenging problem. Generating large random directed graphs with prescribed probabilities of occurrences of some meaningful patterns (motifs) is also difficult. We show how to generate such random graphs according to a formal probabilistic representation, using fast Markov chain Monte Carlo methods to sample them. As an illustration, we generate realistic graphs with several hundred nodes mimicking a gene transcription interaction network in Escherichia coli. PMID:25493547
Knowledge Acquisition of Generic Queries for Information Retrieval
Seol, Yoon-Ho; Johnson, Stephen B.; Cimino, James J.
2002-01-01
Several studies have identified clinical questions posed by health care professionals to understand the nature of information needs during clinical practice. To support access to digital information sources, it is necessary to integrate the information needs with a computer system. We have developed a conceptual guidance approach in information retrieval, based on a knowledge base that contains the patterns of information needs. The knowledge base uses a formal representation of clinical questions based on the UMLS knowledge sources, called the Generic Query model. To improve the coverage of the knowledge base, we investigated a method for extracting plausible clinical questions from the medical literature. This poster presents the Generic Query model, shows how it is used to represent the patterns of clinical questions, and describes the framework used to extract knowledge from the medical literature.
Software-defined Quantum Networking Ecosystem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humble, Travis S.; Sadlier, Ronald
The software enables a user to perform modeling and simulation of software-defined quantum networks. The software addresses the problem of how to synchronize transmission of quantum and classical signals through multi-node networks and to demonstrate quantum information protocols such as quantum teleportation. The software approaches this problem by generating a graphical model of the underlying network and attributing properties to each node and link in the graph. The graphical model is then simulated using a combination of discrete-event simulators to calculate the expected state of each node and link in the graph at a future time. A user interacts withmore » the software by providing an initial network model and instantiating methods for the nodes to transmit information with each other. This includes writing application scripts in python that make use of the software library interfaces. A user then initiates the application scripts, which invokes the software simulation. The user then uses the built-in diagnostic tools to query the state of the simulation and to collect statistics on synchronization.« less
Fuzzy Intervals for Designing Structural Signature: An Application to Graphic Symbol Recognition
NASA Astrophysics Data System (ADS)
Luqman, Muhammad Muzzamil; Delalandre, Mathieu; Brouard, Thierry; Ramel, Jean-Yves; Lladós, Josep
The motivation behind our work is to present a new methodology for symbol recognition. The proposed method employs a structural approach for representing visual associations in symbols and a statistical classifier for recognition. We vectorize a graphic symbol, encode its topological and geometrical information by an attributed relational graph and compute a signature from this structural graph. We have addressed the sensitivity of structural representations to noise, by using data adapted fuzzy intervals. The joint probability distribution of signatures is encoded by a Bayesian network, which serves as a mechanism for pruning irrelevant features and choosing a subset of interesting features from structural signatures of underlying symbol set. The Bayesian network is deployed in a supervised learning scenario for recognizing query symbols. The method has been evaluated for robustness against degradations & deformations on pre-segmented 2D linear architectural & electronic symbols from GREC databases, and for its recognition abilities on symbols with context noise i.e. cropped symbols.
Kangaroo – A pattern-matching program for biological sequences
2002-01-01
Background Biologists are often interested in performing a simple database search to identify proteins or genes that contain a well-defined sequence pattern. Many databases do not provide straightforward or readily available query tools to perform simple searches, such as identifying transcription binding sites, protein motifs, or repetitive DNA sequences. However, in many cases simple pattern-matching searches can reveal a wealth of information. We present in this paper a regular expression pattern-matching tool that was used to identify short repetitive DNA sequences in human coding regions for the purpose of identifying potential mutation sites in mismatch repair deficient cells. Results Kangaroo is a web-based regular expression pattern-matching program that can search for patterns in DNA, protein, or coding region sequences in ten different organisms. The program is implemented to facilitate a wide range of queries with no restriction on the length or complexity of the query expression. The program is accessible on the web at http://bioinfo.mshri.on.ca/kangaroo/ and the source code is freely distributed at http://sourceforge.net/projects/slritools/. Conclusion A low-level simple pattern-matching application can prove to be a useful tool in many research settings. For example, Kangaroo was used to identify potential genetic targets in a human colorectal cancer variant that is characterized by a high frequency of mutations in coding regions containing mononucleotide repeats. PMID:12150718
Adaptive Signal Recovery on Graphs via Harmonic Analysis for Experimental Design in Neuroimaging.
Kim, Won Hwa; Hwang, Seong Jae; Adluru, Nagesh; Johnson, Sterling C; Singh, Vikas
2016-10-01
Consider an experimental design of a neuroimaging study, where we need to obtain p measurements for each participant in a setting where p ' (< p ) are cheaper and easier to acquire while the remaining ( p - p ') are expensive. For example, the p ' measurements may include demographics, cognitive scores or routinely offered imaging scans while the ( p - p ') measurements may correspond to more expensive types of brain image scans with a higher participant burden. In this scenario, it seems reasonable to seek an "adaptive" design for data acquisition so as to minimize the cost of the study without compromising statistical power. We show how this problem can be solved via harmonic analysis of a band-limited graph whose vertices correspond to participants and our goal is to fully recover a multi-variate signal on the nodes, given the full set of cheaper features and a partial set of more expensive measurements. This is accomplished using an adaptive query strategy derived from probing the properties of the graph in the frequency space. To demonstrate the benefits that this framework can provide, we present experimental evaluations on two independent neuroimaging studies and show that our proposed method can reliably recover the true signal with only partial observations directly yielding substantial financial savings.
Mining the SDSS SkyServer SQL queries log
NASA Astrophysics Data System (ADS)
Hirota, Vitor M.; Santos, Rafael; Raddick, Jordan; Thakar, Ani
2016-05-01
SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.
Human connectome module pattern detection using a new multi-graph MinMax cut model.
De, Wang; Wang, Yang; Nie, Feiping; Yan, Jingwen; Cai, Weidong; Saykin, Andrew J; Shen, Li; Huang, Heng
2014-01-01
Many recent scientific efforts have been devoted to constructing the human connectome using Diffusion Tensor Imaging (DTI) data for understanding the large-scale brain networks that underlie higher-level cognition in human. However, suitable computational network analysis tools are still lacking in human connectome research. To address this problem, we propose a novel multi-graph min-max cut model to detect the consistent network modules from the brain connectivity networks of all studied subjects. A new multi-graph MinMax cut model is introduced to solve this challenging computational neuroscience problem and the efficient optimization algorithm is derived. In the identified connectome module patterns, each network module shows similar connectivity patterns in all subjects, which potentially associate to specific brain functions shared by all subjects. We validate our method by analyzing the weighted fiber connectivity networks. The promising empirical results demonstrate the effectiveness of our method.
2015-01-01
Background In recent years, with advances in techniques for protein structure analysis, the knowledge about protein structure and function has been published in a vast number of articles. A method to search for specific publications from such a large pool of articles is needed. In this paper, we propose a method to search for related articles on protein structure analysis by using an article itself as a query. Results Each article is represented as a set of concepts in the proposed method. Then, by using similarities among concepts formulated from databases such as Gene Ontology, similarities between articles are evaluated. In this framework, the desired search results vary depending on the user's search intention because a variety of information is included in a single article. Therefore, the proposed method provides not only one input article (primary article) but also additional articles related to it as an input query to determine the search intention of the user, based on the relationship between two query articles. In other words, based on the concepts contained in the input article and additional articles, we actualize a relevant literature search that considers user intention by varying the degree of attention given to each concept and modifying the concept hierarchy graph. Conclusions We performed an experiment to retrieve relevant papers from articles on protein structure analysis registered in the Protein Data Bank by using three query datasets. The experimental results yielded search results with better accuracy than when user intention was not considered, confirming the effectiveness of the proposed method. PMID:25952498
The many faces of graph dynamics
NASA Astrophysics Data System (ADS)
Pignolet, Yvonne Anne; Roy, Matthieu; Schmid, Stefan; Tredan, Gilles
2017-06-01
The topological structure of complex networks has fascinated researchers for several decades, resulting in the discovery of many universal properties and reoccurring characteristics of different kinds of networks. However, much less is known today about the network dynamics: indeed, complex networks in reality are not static, but rather dynamically evolve over time. Our paper is motivated by the empirical observation that network evolution patterns seem far from random, but exhibit structure. Moreover, the specific patterns appear to depend on the network type, contradicting the existence of a ‘one fits it all’ model. However, we still lack observables to quantify these intuitions, as well as metrics to compare graph evolutions. Such observables and metrics are needed for extrapolating or predicting evolutions, as well as for interpolating graph evolutions. To explore the many faces of graph dynamics and to quantify temporal changes, this paper suggests to build upon the concept of centrality, a measure of node importance in a network. In particular, we introduce the notion of centrality distance, a natural similarity measure for two graphs which depends on a given centrality, characterizing the graph type. Intuitively, centrality distances reflect the extent to which (non-anonymous) node roles are different or, in case of dynamic graphs, have changed over time, between two graphs. We evaluate the centrality distance approach for five evolutionary models and seven real-world social and physical networks. Our results empirically show the usefulness of centrality distances for characterizing graph dynamics compared to a null-model of random evolution, and highlight the differences between the considered scenarios. Interestingly, our approach allows us to compare the dynamics of very different networks, in terms of scale and evolution speed.
Neural coding in graphs of bidirectional associative memories.
Bouchain, A David; Palm, Günther
2012-01-24
In the last years we have developed large neural network models for the realization of complex cognitive tasks in a neural network architecture that resembles the network of the cerebral cortex. We have used networks of several cortical modules that contain two populations of neurons (one excitatory, one inhibitory). The excitatory populations in these so-called "cortical networks" are organized as a graph of Bidirectional Associative Memories (BAMs), where edges of the graph correspond to BAMs connecting two neural modules and nodes of the graph correspond to excitatory populations with associative feedback connections (and inhibitory interneurons). The neural code in each of these modules consists essentially of the firing pattern of the excitatory population, where mainly it is the subset of active neurons that codes the contents to be represented. The overall activity can be used to distinguish different properties of the patterns that are represented which we need to distinguish and control when performing complex tasks like language understanding with these cortical networks. The most important pattern properties or situations are: exactly fitting or matching input, incomplete information or partially matching pattern, superposition of several patterns, conflicting information, and new information that is to be learned. We show simple simulations of these situations in one area or module and discuss how to distinguish these situations based on the overall internal activation of the module. This article is part of a Special Issue entitled "Neural Coding". Copyright © 2011 Elsevier B.V. All rights reserved.
Jupiter, Daniel; Chen, Hailin; VanBuren, Vincent
2009-01-01
Background Although expression microarrays have become a standard tool used by biologists, analysis of data produced by microarray experiments may still present challenges. Comparison of data from different platforms, organisms, and labs may involve complicated data processing, and inferring relationships between genes remains difficult. Results STARNET 2 is a new web-based tool that allows post hoc visual analysis of correlations that are derived from expression microarray data. STARNET 2 facilitates user discovery of putative gene regulatory networks in a variety of species (human, rat, mouse, chicken, zebrafish, Drosophila, C. elegans, S. cerevisiae, Arabidopsis and rice) by graphing networks of genes that are closely co-expressed across a large heterogeneous set of preselected microarray experiments. For each of the represented organisms, raw microarray data were retrieved from NCBI's Gene Expression Omnibus for a selected Affymetrix platform. All pairwise Pearson correlation coefficients were computed for expression profiles measured on each platform, respectively. These precompiled results were stored in a MySQL database, and supplemented by additional data retrieved from NCBI. A web-based tool allows user-specified queries of the database, centered at a gene of interest. The result of a query includes graphs of correlation networks, graphs of known interactions involving genes and gene products that are present in the correlation networks, and initial statistical analyses. Two analyses may be performed in parallel to compare networks, which is facilitated by the new HEATSEEKER module. Conclusion STARNET 2 is a useful tool for developing new hypotheses about regulatory relationships between genes and gene products, and has coverage for 10 species. Interpretation of the correlation networks is supported with a database of previously documented interactions, a test for enrichment of Gene Ontology terms, and heat maps of correlation distances that may be used to compare two networks. The list of genes in a STARNET network may be useful in developing a list of candidate genes to use for the inference of causal networks. The tool is freely available at , and does not require user registration. PMID:19828039
GO2PUB: Querying PubMed with semantic expansion of gene ontology terms
2012-01-01
Background With the development of high throughput methods of gene analyses, there is a growing need for mining tools to retrieve relevant articles in PubMed. As PubMed grows, literature searches become more complex and time-consuming. Automated search tools with good precision and recall are necessary. We developed GO2PUB to automatically enrich PubMed queries with gene names, symbols and synonyms annotated by a GO term of interest or one of its descendants. Results GO2PUB enriches PubMed queries based on selected GO terms and keywords. It processes the result and displays the PMID, title, authors, abstract and bibliographic references of the articles. Gene names, symbols and synonyms that have been generated as extra keywords from the GO terms are also highlighted. GO2PUB is based on a semantic expansion of PubMed queries using the semantic inheritance between terms through the GO graph. Two experts manually assessed the relevance of GO2PUB, GoPubMed and PubMed on three queries about lipid metabolism. Experts’ agreement was high (kappa = 0.88). GO2PUB returned 69% of the relevant articles, GoPubMed: 40% and PubMed: 29%. GO2PUB and GoPubMed have 17% of their results in common, corresponding to 24% of the total number of relevant results. 70% of the articles returned by more than one tool were relevant. 36% of the relevant articles were returned only by GO2PUB, 17% only by GoPubMed and 14% only by PubMed. For determining whether these results can be generalized, we generated twenty queries based on random GO terms with a granularity similar to those of the first three queries and compared the proportions of GO2PUB and GoPubMed results. These were respectively of 77% and 40% for the first queries, and of 70% and 38% for the random queries. The two experts also assessed the relevance of seven of the twenty queries (the three related to lipid metabolism and four related to other domains). Expert agreement was high (0.93 and 0.8). GO2PUB and GoPubMed performances were similar to those of the first queries. Conclusions We demonstrated that the use of genes annotated by either GO terms of interest or a descendant of these GO terms yields some relevant articles ignored by other tools. The comparison of GO2PUB, based on semantic expansion, with GoPubMed, based on text mining techniques, showed that both tools are complementary. The analysis of the randomly-generated queries suggests that the results obtained about lipid metabolism can be generalized to other biological processes. GO2PUB is available at http://go2pub.genouest.org. PMID:22958570
Land Treatment Digital Library
Pilliod, David S.; Welty, Justin L.
2013-01-01
The Land Treatment Digital Library (LTDL) was created by the U.S. Geological Survey to catalog legacy land treatment information on Bureau of Land Management lands in the western United States. The LTDL can be used by federal managers and scientists for compiling information for data-calls, producing maps, generating reports, and conducting analyses at varying spatial and temporal scales. The LTDL currently houses thousands of treatments from BLM lands across 10 states. Users can browse a map to find information on individual treatments, perform more complex queries to identify a set of treatments, and view graphs of treatment summary statistics.
Automatic micropropagation of plants--the vision-system: graph rewriting as pattern recognition
NASA Astrophysics Data System (ADS)
Schwanke, Joerg; Megnet, Roland; Jensch, Peter F.
1993-03-01
The automation of plant-micropropagation is necessary to produce high amounts of biomass. Plants have to be dissected on particular cutting-points. A vision-system is needed for the recognition of the cutting-points on the plants. With this background, this contribution is directed to the underlying formalism to determine cutting-points on abstract-plant models. We show the usefulness of pattern recognition by graph-rewriting along with some examples in this context.
Patterns and Practices for Future Architectures
2014-08-01
14. SUBJECT TERMS computing architecture, graph algorithms, high-performance computing, big data , GPU 15. NUMBER OF PAGES 44 16. PRICE CODE 17...at Vertex 1 6 Figure 4: Data Structures Created by Kernel 1 of Single CPU, List Implementation Using the Graph in the Example from Section 1.2 9...Figure 5: Kernel 2 of Graph500 BFS Reference Implementation: Single CPU, List 10 Figure 6: Data Structures for Sequential CSR Algorithm 12 Figure 7
DOE Office of Scientific and Technical Information (OSTI.GOV)
Winlaw, Manda; De Sterck, Hans; Sanders, Geoffrey
In very simple terms a network can be de ned as a collection of points joined together by lines. Thus, networks can be used to represent connections between entities in a wide variety of elds including engi- neering, science, medicine, and sociology. Many large real-world networks share a surprising number of properties, leading to a strong interest in model development research and techniques for building synthetic networks have been developed, that capture these similarities and replicate real-world graphs. Modeling these real-world networks serves two purposes. First, building models that mimic the patterns and prop- erties of real networks helps tomore » understand the implications of these patterns and helps determine which patterns are important. If we develop a generative process to synthesize real networks we can also examine which growth processes are plausible and which are not. Secondly, high-quality, large-scale network data is often not available, because of economic, legal, technological, or other obstacles [7]. Thus, there are many instances where the systems of interest cannot be represented by a single exemplar network. As one example, consider the eld of cybersecurity, where systems require testing across diverse threat scenarios and validation across diverse network structures. In these cases, where there is no single exemplar network, the systems must instead be modeled as a collection of networks in which the variation among them may be just as important as their common features. By developing processes to build synthetic models, so-called graph generators, we can build synthetic networks that capture both the essential features of a system and realistic variability. Then we can use such synthetic graphs to perform tasks such as simulations, analysis, and decision making. We can also use synthetic graphs to performance test graph analysis algorithms, including clustering algorithms and anomaly detection algorithms.« less
Compound analysis via graph kernels incorporating chirality.
Brown, J B; Urata, Takashi; Tamura, Takeyuki; Arai, Midori A; Kawabata, Takeo; Akutsu, Tatsuya
2010-12-01
High accuracy is paramount when predicting biochemical characteristics using Quantitative Structural-Property Relationships (QSPRs). Although existing graph-theoretic kernel methods combined with machine learning techniques are efficient for QSPR model construction, they cannot distinguish topologically identical chiral compounds which often exhibit different biological characteristics. In this paper, we propose a new method that extends the recently developed tree pattern graph kernel to accommodate stereoisomers. We show that Support Vector Regression (SVR) with a chiral graph kernel is useful for target property prediction by demonstrating its application to a set of human vitamin D receptor ligands currently under consideration for their potential anti-cancer effects.
SP2Bench: A SPARQL Performance Benchmark
NASA Astrophysics Data System (ADS)
Schmidt, Michael; Hornung, Thomas; Meier, Michael; Pinkel, Christoph; Lausen, Georg
A meaningful analysis and comparison of both existing storage schemes for RDF data and evaluation approaches for SPARQL queries necessitates a comprehensive and universal benchmark platform. We present SP2Bench, a publicly available, language-specific performance benchmark for the SPARQL query language. SP2Bench is settled in the DBLP scenario and comprises a data generator for creating arbitrarily large DBLP-like documents and a set of carefully designed benchmark queries. The generated documents mirror vital key characteristics and social-world distributions encountered in the original DBLP data set, while the queries implement meaningful requests on top of this data, covering a variety of SPARQL operator constellations and RDF access patterns. In this chapter, we discuss requirements and desiderata for SPARQL benchmarks and present the SP2Bench framework, including its data generator, benchmark queries and performance metrics.
GO Explorer: A gene-ontology tool to aid in the interpretation of shotgun proteomics data.
Carvalho, Paulo C; Fischer, Juliana Sg; Chen, Emily I; Domont, Gilberto B; Carvalho, Maria Gc; Degrave, Wim M; Yates, John R; Barbosa, Valmir C
2009-02-24
Spectral counting is a shotgun proteomics approach comprising the identification and relative quantitation of thousands of proteins in complex mixtures. However, this strategy generates bewildering amounts of data whose biological interpretation is a challenge. Here we present a new algorithm, termed GO Explorer (GOEx), that leverages the gene ontology (GO) to aid in the interpretation of proteomic data. GOEx stands out because it combines data from protein fold changes with GO over-representation statistics to help draw conclusions. Moreover, it is tightly integrated within the PatternLab for Proteomics project and, thus, lies within a complete computational environment that provides parsers and pattern recognition tools designed for spectral counting. GOEx offers three independent methods to query data: an interactive directed acyclic graph, a specialist mode where key words can be searched, and an automatic search. Its usefulness is demonstrated by applying it to help interpret the effects of perillyl alcohol, a natural chemotherapeutic agent, on glioblastoma multiform cell lines (A172). We used a new multi-surfactant shotgun proteomic strategy and identified more than 2600 proteins; GOEx pinpointed key sets of differentially expressed proteins related to cell cycle, alcohol catabolism, the Ras pathway, apoptosis, and stress response, to name a few. GOEx facilitates organism-specific studies by leveraging GO and providing a rich graphical user interface. It is a simple to use tool, specialized for biologists who wish to analyze spectral counting data from shotgun proteomics. GOEx is available at http://pcarvalho.com/patternlab.
Content-Aware DataGuide with Incremental Index Update using Frequently Used Paths
NASA Astrophysics Data System (ADS)
Sharma, A. K.; Duhan, Neelam; Khattar, Priyanka
2010-11-01
Size of the WWW is increasing day by day. Due to the absence of structured data on the Web, it becomes very difficult for information retrieval tools to fully utilize the Web information. As a solution to this problem, XML pages come into play, which provide structural information to the users to some extent. Without efficient indexes, query processing can be quite inefficient due to an exhaustive traversal on XML data. In this paper an improved content-centric approach of Content-Aware DataGuide, which is an indexing technique for XML databases, is being proposed that uses frequently used paths from historical query logs to improve query performance. The index can be updated incrementally according to the changes in query workload and thus, the overhead of reconstruction can be minimized. Frequently used paths are extracted using any Sequential Pattern mining algorithm on subsequent queries in the query workload. After this, the data structures are incrementally updated. This indexing technique proves to be efficient as partial matching queries can be executed efficiently and users can now get the more relevant documents in results.
Functional network organization of the human brain
Power, Jonathan D; Cohen, Alexander L; Nelson, Steven M; Wig, Gagan S; Barnes, Kelly Anne; Church, Jessica A; Vogel, Alecia C; Laumann, Timothy O; Miezin, Fran M; Schlaggar, Bradley L; Petersen, Steven E
2011-01-01
Summary Real-world complex systems may be mathematically modeled as graphs, revealing properties of the system. Here we study graphs of functional brain organization in healthy adults using resting state functional connectivity MRI. We propose two novel brain-wide graphs, one of 264 putative functional areas, the other a modification of voxelwise networks that eliminates potentially artificial short-distance relationships. These graphs contain many subgraphs in good agreement with known functional brain systems. Other subgraphs lack established functional identities; we suggest possible functional characteristics for these subgraphs. Further, graph measures of the areal network indicate that the default mode subgraph shares network properties with sensory and motor subgraphs: it is internally integrated but isolated from other subgraphs, much like a “processing” system. The modified voxelwise graph also reveals spatial motifs in the patterning of systems across the cortex. PMID:22099467
From Provenance Standards and Tools to Queries and Actionable Provenance
NASA Astrophysics Data System (ADS)
Ludaescher, B.
2017-12-01
The W3C PROV standard provides a minimal core for sharing retrospective provenance information for scientific workflows and scripts. PROV extensions such as DataONE's ProvONE model are necessary for linking runtime observables in retrospective provenance records with conceptual-level prospective provenance information, i.e., workflow (or dataflow) graphs. Runtime provenance recorders, such as DataONE's RunManager for R, or noWorkflow for Python capture retrospective provenance automatically. YesWorkflow (YW) is a toolkit that allows researchers to declare high-level prospective provenance models of scripts via simple inline comments (YW-annotations), revealing the computational modules and dataflow dependencies in the script. By combining and linking both forms of provenance, important queries and use cases can be supported that neither provenance model can afford on its own. We present existing and emerging provenance tools developed for the DataONE and SKOPE (Synthesizing Knowledge of Past Environments) projects. We show how the different tools can be used individually and in combination to model, capture, share, query, and visualize provenance information. We also present challenges and opportunities for making provenance information more immediately actionable for the researchers who create it in the first place. We argue that such a shift towards "provenance-for-self" is necessary to accelerate the creation, sharing, and use of provenance in support of transparent, reproducible computational and data science.
2013-10-15
statistic,” in Artifical Intelligence and Statistics (AISTATS), 2013. [6] ——, “Detecting activity in graphs via the Graph Ellipsoid Scan Statistic... Artifical Intelligence and Statistics (AISTATS), 2013. [8] ——, “Near-optimal anomaly detection in graphs using Lovász Extended Scan Statistic,” in Neural...networks,” in Artificial Intelligence and Statistics (AISTATS), 2010. 11 [11] D. Aldous, “The random walk construction of uniform spanning trees and
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gillen, David S.
Analysis activities for Nonproliferation and Arms Control verification require the use of many types of data. Tabular structured data, such as Excel spreadsheets and relational databases, have traditionally been used for data mining activities, where specific queries are issued against data to look for matching results. The application of visual analytics tools to structured data enables further exploration of datasets to promote discovery of previously unknown results. This paper discusses the application of a specific visual analytics tool to datasets related to the field of Arms Control and Nonproliferation to promote the use of visual analytics more broadly in thismore » domain. Visual analytics focuses on analytical reasoning facilitated by interactive visual interfaces (Wong and Thomas 2004). It promotes exploratory analysis of data, and complements data mining technologies where known patterns can be mined for. Also with a human in the loop, they can bring in domain knowledge and subject matter expertise. Visual analytics has not widely been applied to this domain. In this paper, we will focus on one type of data: structured data, and show the results of applying a specific visual analytics tool to answer questions in the Arms Control and Nonproliferation domain. We chose to use the T.Rex tool, a visual analytics tool developed at PNNL, which uses a variety of visual exploration patterns to discover relationships in structured datasets, including a facet view, graph view, matrix view, and timeline view. The facet view enables discovery of relationships between categorical information, such as countries and locations. The graph tool visualizes node-link relationship patterns, such as the flow of materials being shipped between parties. The matrix visualization shows highly correlated categories of information. The timeline view shows temporal patterns in data. In this paper, we will use T.Rex with two different datasets to demonstrate how interactive exploration of the data can aid an analyst with arms control and nonproliferation verification activities. Using a dataset from PIERS (PIERS 2014), we will show how container shipment imports and exports can aid an analyst in understanding the shipping patterns between two countries. We will also use T.Rex to examine a collection of research publications from the IAEA International Nuclear Information System (IAEA 2014) to discover collaborations of concern. We hope this paper will encourage the use of visual analytics structured data analytics in the field of nonproliferation and arms control verification. Our paper outlines some of the challenges that exist before broad adoption of these kinds of tools can occur and offers next steps to overcome these challenges.« less
Intelligent web image retrieval system
NASA Astrophysics Data System (ADS)
Hong, Sungyong; Lee, Chungwoo; Nah, Yunmook
2001-07-01
Recently, the web sites such as e-business sites and shopping mall sites deal with lots of image information. To find a specific image from these image sources, we usually use web search engines or image database engines which rely on keyword only retrievals or color based retrievals with limited search capabilities. This paper presents an intelligent web image retrieval system. We propose the system architecture, the texture and color based image classification and indexing techniques, and representation schemes of user usage patterns. The query can be given by providing keywords, by selecting one or more sample texture patterns, by assigning color values within positional color blocks, or by combining some or all of these factors. The system keeps track of user's preferences by generating user query logs and automatically add more search information to subsequent user queries. To show the usefulness of the proposed system, some experimental results showing recall and precision are also explained.
Array Databases: Agile Analytics (not just) for the Earth Sciences
NASA Astrophysics Data System (ADS)
Baumann, P.; Misev, D.
2015-12-01
Gridded data, such as images, image timeseries, and climate datacubes, today are managed separately from the metadata, and with different, restricted retrieval capabilities. While databases are good at metadata modelled in tables, XML hierarchies, or RDF graphs, they traditionally do not support multi-dimensional arrays.This gap is being closed by Array Databases, pioneered by the scalable rasdaman ("raster data manager") array engine. Its declarative query language, rasql, extends SQL with array operators which are optimized and parallelized on server side. Installations can easily be mashed up securely, thereby enabling large-scale location-transparent query processing in federations. Domain experts value the integration with their commonly used tools leading to a quick learning curve.Earth, Space, and Life sciences, but also Social sciences as well as business have massive amounts of data and complex analysis challenges that are answered by rasdaman. As of today, rasdaman is mature and in operational use on hundreds of Terabytes of timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Additionally, its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL, many of which are supported by both open-source and commercial systems meantime. In the geo field, rasdaman is reference implementation for the Open Geospatial Consortium (OGC) Big Data standard, WCS, now also under adoption by ISO. Further, rasdaman is in the final stage of OSGeo incubation.In this contribution we present array queries a la rasdaman, describe the architecture and novel optimization and parallelization techniques introduced in 2015, and put this in context of the intercontinental EarthServer initiative which utilizes rasdaman for enabling agile analytics on Petascale datacubes.
Attribute-based Decision Graphs: A framework for multiclass data classification.
Bertini, João Roberto; Nicoletti, Maria do Carmo; Zhao, Liang
2017-01-01
Graph-based algorithms have been successfully applied in machine learning and data mining tasks. A simple but, widely used, approach to build graphs from vector-based data is to consider each data instance as a vertex and connecting pairs of it using a similarity measure. Although this abstraction presents some advantages, such as arbitrary shape representation of the original data, it is still tied to some drawbacks, for example, it is dependent on the choice of a pre-defined distance metric and is biased by the local information among data instances. Aiming at exploring alternative ways to build graphs from data, this paper proposes an algorithm for constructing a new type of graph, called Attribute-based Decision Graph-AbDG. Given a vector-based data set, an AbDG is built by partitioning each data attribute range into disjoint intervals and representing each interval as a vertex. The edges are then established between vertices from different attributes according to a pre-defined pattern. Classification is performed through a matching process among the attribute values of the new instance and AbDG. Moreover, AbDG provides an inner mechanism to handle missing attribute values, which contributes for expanding its applicability. Results of classification tasks have shown that AbDG is a competitive approach when compared to well-known multiclass algorithms. The main contribution of the proposed framework is the combination of the advantages of attribute-based and graph-based techniques to perform robust pattern matching data classification, while permitting the analysis the input data considering only a subset of its attributes. Copyright © 2016 Elsevier Ltd. All rights reserved.
Structure and strategy in encoding simplified graphs
NASA Technical Reports Server (NTRS)
Schiano, Diane J.; Tversky, Barbara
1992-01-01
Tversky and Schiano (1989) found a systematic bias toward the 45-deg line in memory for the slopes of identical lines when embedded in graphs, but not in maps, suggesting the use of a cognitive reference frame specifically for encoding meaningful graphs. The present experiments explore this issue further using the linear configurations alone as stimuli. Experiments 1 and 2 demonstrate that perception and immediate memory for the slope of a test line within orthogonal 'axes' are predictable from purely structural considerations. In Experiments 3 and 4, subjects were instructed to use a diagonal-reference strategy in viewing the stimuli, which were described as 'graphs' only in Experiment 3. Results for both studies showed the diagonal bias previously found only for graphs. This pattern provides converging evidence for the diagonal as a cognitive reference frame in encoding linear graphs, and demonstrates that even in highly simplified displays, strategic factors can produce encoding biases not predictable solely from stimulus structure alone.
The graph neural network model.
Scarselli, Franco; Gori, Marco; Tsoi, Ah Chung; Hagenbuchner, Markus; Monfardini, Gabriele
2009-01-01
Many underlying relationships among data in several areas of science and engineering, e.g., computer vision, molecular chemistry, molecular biology, pattern recognition, and data mining, can be represented in terms of graphs. In this paper, we propose a new neural network model, called graph neural network (GNN) model, that extends existing neural network methods for processing the data represented in graph domains. This GNN model, which can directly process most of the practically useful types of graphs, e.g., acyclic, cyclic, directed, and undirected, implements a function tau(G,n) is an element of IR(m) that maps a graph G and one of its nodes n into an m-dimensional Euclidean space. A supervised learning algorithm is derived to estimate the parameters of the proposed GNN model. The computational cost of the proposed algorithm is also considered. Some experimental results are shown to validate the proposed learning algorithm, and to demonstrate its generalization capabilities.
HAL: a hierarchical format for storing and analyzing multiple genome alignments.
Hickey, Glenn; Paten, Benedict; Earl, Dent; Zerbino, Daniel; Haussler, David
2013-05-15
Large multiple genome alignments and inferred ancestral genomes are ideal resources for comparative studies of molecular evolution, and advances in sequencing and computing technology are making them increasingly obtainable. These structures can provide a rich understanding of the genetic relationships between all subsets of species they contain. Current formats for storing genomic alignments, such as XMFA and MAF, are all indexed or ordered using a single reference genome, however, which limits the information that can be queried with respect to other species and clades. This loss of information grows with the number of species under comparison, as well as their phylogenetic distance. We present HAL, a compressed, graph-based hierarchical alignment format for storing multiple genome alignments and ancestral reconstructions. HAL graphs are indexed on all genomes they contain. Furthermore, they are organized phylogenetically, which allows for modular and parallel access to arbitrary subclades without fragmentation because of rearrangements that have occurred in other lineages. HAL graphs can be created or read with a comprehensive C++ API. A set of tools is also provided to perform basic operations, such as importing and exporting data, identifying mutations and coordinate mapping (liftover). All documentation and source code for the HAL API and tools are freely available at http://github.com/glennhickey/hal. hickey@soe.ucsc.edu or haussler@soe.ucsc.edu Supplementary data are available at Bioinformatics online.
Pattern Discovery and Change Detection of Online Music Query Streams
NASA Astrophysics Data System (ADS)
Li, Hua-Fu
In this paper, an efficient stream mining algorithm, called FTP-stream (Frequent Temporal Pattern mining of streams), is proposed to find the frequent temporal patterns over melody sequence streams. In the framework of our proposed algorithm, an effective bit-sequence representation is used to reduce the time and memory needed to slide the windows. The FTP-stream algorithm can calculate the support threshold in only a single pass based on the concept of bit-sequence representation. It takes the advantage of "left" and "and" operations of the representation. Experiments show that the proposed algorithm only scans the music query stream once, and runs significant faster and consumes less memory than existing algorithms, such as SWFI-stream and Moment.
sc-PDB-Frag: a database of protein-ligand interaction patterns for Bioisosteric replacements.
Desaphy, Jérémy; Rognan, Didier
2014-07-28
Bioisosteric replacement plays an important role in medicinal chemistry by keeping the biological activity of a molecule while changing either its core scaffold or substituents, thereby facilitating lead optimization and patenting. Bioisosteres are classically chosen in order to keep the main pharmacophoric moieties of the substructure to replace. However, notably when changing a scaffold, no attention is usually paid as whether all atoms of the reference scaffold are equally important for binding to the desired target. We herewith propose a novel database for bioisosteric replacement (scPDBFrag), capitalizing on our recently published structure-based approach to scaffold hopping, focusing on interaction pattern graphs. Protein-bound ligands are first fragmented and the interaction of the corresponding fragments with their protein environment computed-on-the-fly. Using an in-house developed graph alignment tool, interaction patterns graphs can be compared, aligned, and sorted by decreasing similarity to any reference. In the herein presented sc-PDB-Frag database ( http://bioinfo-pharma.u-strasbg.fr/scPDBFrag ), fragments, interaction patterns, alignments, and pairwise similarity scores have been extracted from the sc-PDB database of 8077 druggable protein-ligand complexes and further stored in a relational database. We herewith present the database, its Web implementation, and procedures for identifying true bioisosteric replacements based on conserved interaction patterns.
Semi-Automated Annotation of Biobank Data Using Standard Medical Terminologies in a Graph Database.
Hofer, Philipp; Neururer, Sabrina; Goebel, Georg
2016-01-01
Data describing biobank resources frequently contains unstructured free-text information or insufficient coding standards. (Bio-) medical ontologies like Orphanet Rare Diseases Ontology (ORDO) or the Human Disease Ontology (DOID) provide a high number of concepts, synonyms and entity relationship properties. Such standard terminologies increase quality and granularity of input data by adding comprehensive semantic background knowledge from validated entity relationships. Moreover, cross-references between terminology concepts facilitate data integration across databases using different coding standards. In order to encourage the use of standard terminologies, our aim is to identify and link relevant concepts with free-text diagnosis inputs within a biobank registry. Relevant concepts are selected automatically by lexical matching and SPARQL queries against a RDF triplestore. To ensure correctness of annotations, proposed concepts have to be confirmed by medical data administration experts before they are entered into the registry database. Relevant (bio-) medical terminologies describing diseases and phenotypes were identified and stored in a graph database which was tied to a local biobank registry. Concept recommendations during data input trigger a structured description of medical data and facilitate data linkage between heterogeneous systems.
Supervised graph hashing for histopathology image retrieval and classification.
Shi, Xiaoshuang; Xing, Fuyong; Xu, KaiDi; Xie, Yuanpu; Su, Hai; Yang, Lin
2017-12-01
In pathology image analysis, morphological characteristics of cells are critical to grade many diseases. With the development of cell detection and segmentation techniques, it is possible to extract cell-level information for further analysis in pathology images. However, it is challenging to conduct efficient analysis of cell-level information on a large-scale image dataset because each image usually contains hundreds or thousands of cells. In this paper, we propose a novel image retrieval based framework for large-scale pathology image analysis. For each image, we encode each cell into binary codes to generate image representation using a novel graph based hashing model and then conduct image retrieval by applying a group-to-group matching method to similarity measurement. In order to improve both computational efficiency and memory requirement, we further introduce matrix factorization into the hashing model for scalable image retrieval. The proposed framework is extensively validated with thousands of lung cancer images, and it achieves 97.98% classification accuracy and 97.50% retrieval precision with all cells of each query image used. Copyright © 2017 Elsevier B.V. All rights reserved.
EEG analysis of seizure patterns using visibility graphs for detection of generalized seizures.
Wang, Lei; Long, Xi; Arends, Johan B A M; Aarts, Ronald M
2017-10-01
The traditional EEG features in the time and frequency domain show limited seizure detection performance in the epileptic population with intellectual disability (ID). In addition, the influence of EEG seizure patterns on detection performance was less studied. A single-channel EEG signal can be mapped into visibility graphs (VGS), including basic visibility graph (VG), horizontal VG (HVG), and difference VG (DVG). These graphs were used to characterize different EEG seizure patterns. To demonstrate its effectiveness in identifying EEG seizure patterns and detecting generalized seizures, EEG recordings of 615h on one EEG channel from 29 epileptic patients with ID were analyzed. A novel feature set with discriminative power for seizure detection was obtained by using the VGS method. The degree distributions (DDs) of DVG can clearly distinguish EEG of each seizure pattern. The degree entropy and power-law degree power in DVG were proposed here for the first time, and they show significant difference between seizure and non-seizure EEG. The connecting structure measured by HVG can better distinguish seizure EEG from background than those by VG and DVG. A traditional EEG feature set based on frequency analysis was used here as a benchmark feature set. With a support vector machine (SVM) classifier, the seizure detection performance of the benchmark feature set (sensitivity of 24%, FD t /h of 1.8s) can be improved by combining our proposed VGS features extracted from one EEG channel (sensitivity of 38%, FD t /h of 1.4s). The proposed VGS-based features can help improve seizure detection for ID patients. Copyright © 2017 Elsevier B.V. All rights reserved.
Movement Forms: A Graph-Dynamic Perspective
Saltzman, Elliot; Holt, Ken
2014-01-01
The focus of this paper is on characterizing the physical movement forms (e.g., walk, crawl, roll, etc.) that can be used to actualize abstract, functionally-specified behavioral goals (e.g., locomotion). Emphasis is placed on how such forms are distinguished from one another, in part, by the set of topological patterns of physical contact between agent and environment (i.e., the set of physical graphs associated with each form) and the transitions among these patterns displayed over the course of performance (i.e., the form’s physical graph dynamics). Crucial in this regard is the creation and dissolution of loops in these graphs, which can be related to the distinction between open and closed kinematic chains. Formal similarities are described within the theoretical framework of task-dynamics between physically-closed kinematic chains (physical loops) that are created during various movement forms and functionally-closed kinematic chains (functional loops) that are associated with task-space control of end-effectors; it is argued that both types of loop must be flexibly incorporated into the coordinative structures that govern skilled action. Final speculation is focused on the role of graphs and their dynamics, not only in processes of coordination and control for individual agents, but also in processes of inter-agent coordination and the coupling of agents with (non-sentient) environmental objects. PMID:24910507
Movement Forms: A Graph-Dynamic Perspective.
Saltzman, Elliot; Holt, Ken
2014-01-01
The focus of this paper is on characterizing the physical movement forms (e.g., walk, crawl, roll, etc.) that can be used to actualize abstract, functionally-specified behavioral goals (e.g., locomotion). Emphasis is placed on how such forms are distinguished from one another, in part, by the set of topological patterns of physical contact between agent and environment (i.e., the set of physical graphs associated with each form) and the transitions among these patterns displayed over the course of performance (i.e., the form's physical graph dynamics ). Crucial in this regard is the creation and dissolution of loops in these graphs, which can be related to the distinction between open and closed kinematic chains. Formal similarities are described within the theoretical framework of task-dynamics between physically-closed kinematic chains (physical loops) that are created during various movement forms and functionally-closed kinematic chains (functional loops) that are associated with task-space control of end-effectors; it is argued that both types of loop must be flexibly incorporated into the coordinative structures that govern skilled action. Final speculation is focused on the role of graphs and their dynamics, not only in processes of coordination and control for individual agents, but also in processes of inter-agent coordination and the coupling of agents with (non-sentient) environmental objects.
Huang, Chung-Chi; Lu, Zhiyong
2016-01-01
Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as ‘CHEMICAL-1 compared to CHEMICAL-2.’ With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical–disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted. PMID:27016698
Riding the Hype Wave: Evaluating new AI Techniques for their Applicability in Earth Science
NASA Astrophysics Data System (ADS)
Ramachandran, R.; Zhang, J.; Maskey, M.; Lee, T. J.
2016-12-01
Every few years a new technology rides the hype wave generated by the computer science community. Converts to this new technology who surface from both the science community and the informatics community promulgate that it can radically improve or even change the existing scientific process. Recent examples of new technology following in the footsteps of "big data" now include deep learning algorithms and knowledge graphs. Deep learning algorithms mimic the human brain and process information through multiple stages of transformation and representation. These algorithms are able to learn complex functions that map pixels directly to outputs without relying on human-crafted features and solve some of the complex classification problems that exist in science. Similarly, knowledge graphs aggregate information around defined topics that enable users to resolve their query without having to navigate and assemble information manually. Knowledge graphs could potentially be used in scientific research to assist in hypothesis formulation, testing, and review. The challenge for the Earth science research community is to evaluate these new technologies by asking the right questions and considering what-if scenarios. What is this new technology enabling/providing that is innovative and different? Can one justify the adoption costs with respect to the research returns? Since nothing comes for free, utilizing a new technology entails adoption costs that may outweigh the benefits. Furthermore, these technologies may require significant computing infrastructure in order to be utilized effectively. Results from two different projects will be presented along with lessons learned from testing these technologies. The first project primarily evaluates deep learning techniques for different applications of image retrieval within Earth science while the second project builds a prototype knowledge graph constructed for Hurricane science.
Identifying Threats Using Graph-based Anomaly Detection
NASA Astrophysics Data System (ADS)
Eberle, William; Holder, Lawrence; Cook, Diane
Much of the data collected during the monitoring of cyber and other infrastructures is structural in nature, consisting of various types of entities and relationships between them. The detection of threatening anomalies in such data is crucial to protecting these infrastructures. We present an approach to detecting anomalies in a graph-based representation of such data that explicitly represents these entities and relationships. The approach consists of first finding normative patterns in the data using graph-based data mining and then searching for small, unexpected deviations to these normative patterns, assuming illicit behavior tries to mimic legitimate, normative behavior. The approach is evaluated using several synthetic and real-world datasets. Results show that the approach has high truepositive rates, low false-positive rates, and is capable of detecting complex structural anomalies in real-world domains including email communications, cellphone calls and network traffic.
Atmospheric Pressure Patterns Before and During Dust Storm
2012-11-27
This graph compares a typical daily pattern of changing atmospheric pressure blue with the pattern during a regional dust storm hundreds of miles away red. The data are by the Rover Environmental Monitoring Station REMS on NASA Curiosity rover.
Interest in tanning beds and sunscreen in German-speaking countries.
Kirchberger, Michael C; Kirchberger, Laura F; Eigentler, Thomas K; Reinhard, Raphael; Berking, Carola; Schuler, Gerold; Heinzerling, Lucie; Heppt, Markus V
2017-12-01
The growing incidence of nearly all types of skin cancer can be attributed to increased exposure to natural or artificial ultraviolet (UV) radiation. However, there is a scarcity of statistical data on risk behavior or sunscreen use, which would be important for any prevention efforts. Using the search engine Google ® , we analyzed search patterns for the terms Solarium (tanning bed), Sonnencreme (sunscreen), and Sonnenschutz (sun protection) in Germany, Austria, and Switzerland between 2004 and 2016, and compared it to search patterns worldwide. For this purpose, "normalized search volumes" (NSVs) were calculated for the various search queries. The corresponding polynomial functions were then compared with each other over the course of time. Since 2001, there has been a marked worldwide decrease in the search queries for tanning bed, whereas those for sunscreen have steadily increased. In German-speaking countries, on the other hand, there have - for years - consistently been more search queries for tanning bed than for sunscreen. There is an annual periodicity of the queries, with the highest NSVs for tanning bed between March and May and those for sunscreen in the summer months around June. In Germany, the city-states of Hamburg and Berlin have particularly high NSVs for tanning bed. Compared to the rest of the world, German-speaking countries show a strikingly unfavorable search pattern. There is still great need for education and prevention with respect to sunscreen use and avoidance of artificial UV exposure. © 2017 Deutsche Dermatologische Gesellschaft (DDG). Published by John Wiley & Sons Ltd.
Mining and Modeling Real-World Networks: Patterns, Anomalies, and Tools
ERIC Educational Resources Information Center
Akoglu, Leman
2012-01-01
Large real-world graph (a.k.a network, relational) data are omnipresent, in online media, businesses, science, and the government. Analysis of these massive graphs is crucial, in order to extract descriptive and predictive knowledge with many commercial, medical, and environmental applications. In addition to its general structure, knowing what…
Enabling Graph Mining in RDF Triplestores using SPARQL for Holistic In-situ Graph Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Sangkeun; Sukumar, Sreenivas R; Hong, Seokyong
The graph analysis is now considered as a promising technique to discover useful knowledge in data with a new perspective. We envi- sion that there are two dimensions of graph analysis: OnLine Graph Analytic Processing (OLGAP) and Graph Mining (GM) where each respectively focuses on subgraph pattern matching and automatic knowledge discovery in graph. Moreover, as these two dimensions aim to complementarily solve complex problems, holistic in-situ graph analysis which covers both OLGAP and GM in a single system is critical for minimizing the burdens of operating multiple graph systems and transferring intermediate result-sets between those systems. Nevertheless, most existingmore » graph analysis systems are only capable of one dimension of graph analysis. In this work, we take an approach to enabling GM capabilities (e.g., PageRank, connected-component analysis, node eccentricity, etc.) in RDF triplestores, which are originally developed to store RDF datasets and provide OLGAP capability. More specifically, to achieve our goal, we implemented six representative graph mining algorithms using SPARQL. The approach allows a wide range of available RDF data sets directly applicable for holistic graph analysis within a system. For validation of our approach, we evaluate performance of our implementations with nine real-world datasets and three different computing environments - a laptop computer, an Amazon EC2 instance, and a shared-memory Cray XMT2 URIKA-GD graph-processing appliance. The experimen- tal results show that our implementation can provide promising and scalable performance for real world graph analysis in all tested environments. The developed software is publicly available in an open-source project that we initiated.« less
Enabling Graph Mining in RDF Triplestores using SPARQL for Holistic In-situ Graph Analysis
Lee, Sangkeun; Sukumar, Sreenivas R; Hong, Seokyong; ...
2016-01-01
The graph analysis is now considered as a promising technique to discover useful knowledge in data with a new perspective. We envi- sion that there are two dimensions of graph analysis: OnLine Graph Analytic Processing (OLGAP) and Graph Mining (GM) where each respectively focuses on subgraph pattern matching and automatic knowledge discovery in graph. Moreover, as these two dimensions aim to complementarily solve complex problems, holistic in-situ graph analysis which covers both OLGAP and GM in a single system is critical for minimizing the burdens of operating multiple graph systems and transferring intermediate result-sets between those systems. Nevertheless, most existingmore » graph analysis systems are only capable of one dimension of graph analysis. In this work, we take an approach to enabling GM capabilities (e.g., PageRank, connected-component analysis, node eccentricity, etc.) in RDF triplestores, which are originally developed to store RDF datasets and provide OLGAP capability. More specifically, to achieve our goal, we implemented six representative graph mining algorithms using SPARQL. The approach allows a wide range of available RDF data sets directly applicable for holistic graph analysis within a system. For validation of our approach, we evaluate performance of our implementations with nine real-world datasets and three different computing environments - a laptop computer, an Amazon EC2 instance, and a shared-memory Cray XMT2 URIKA-GD graph-processing appliance. The experimen- tal results show that our implementation can provide promising and scalable performance for real world graph analysis in all tested environments. The developed software is publicly available in an open-source project that we initiated.« less
An efficient randomized algorithm for contact-based NMR backbone resonance assignment.
Kamisetty, Hetunandan; Bailey-Kellogg, Chris; Pandurangan, Gopal
2006-01-15
Backbone resonance assignment is a critical bottleneck in studies of protein structure, dynamics and interactions by nuclear magnetic resonance (NMR) spectroscopy. A minimalist approach to assignment, which we call 'contact-based', seeks to dramatically reduce experimental time and expense by replacing the standard suite of through-bond experiments with the through-space (nuclear Overhauser enhancement spectroscopy, NOESY) experiment. In the contact-based approach, spectral data are represented in a graph with vertices for putative residues (of unknown relation to the primary sequence) and edges for hypothesized NOESY interactions, such that observed spectral peaks could be explained if the residues were 'close enough'. Due to experimental ambiguity, several incorrect edges can be hypothesized for each spectral peak. An assignment is derived by identifying consistent patterns of edges (e.g. for alpha-helices and beta-sheets) within a graph and by mapping the vertices to the primary sequence. The key algorithmic challenge is to be able to uncover these patterns even when they are obscured by significant noise. This paper develops, analyzes and applies a novel algorithm for the identification of polytopes representing consistent patterns of edges in a corrupted NOESY graph. Our randomized algorithm aggregates simplices into polytopes and fixes inconsistencies with simple local modifications, called rotations, that maintain most of the structure already uncovered. In characterizing the effects of experimental noise, we employ an NMR-specific random graph model in proving that our algorithm gives optimal performance in expected polynomial time, even when the input graph is significantly corrupted. We confirm this analysis in simulation studies with graphs corrupted by up to 500% noise. Finally, we demonstrate the practical application of the algorithm on several experimental beta-sheet datasets. Our approach is able to eliminate a large majority of noise edges and to uncover large consistent sets of interactions. Our algorithm has been implemented in the platform-independent Python code. The software can be freely obtained for academic use by request from the authors.
Identifying patients with Alzheimer's disease using resting-state fMRI and graph theory.
Khazaee, Ali; Ebrahimzadeh, Ata; Babajani-Feremi, Abbas
2015-11-01
Study of brain network on the basis of resting-state functional magnetic resonance imaging (fMRI) has provided promising results to investigate changes in connectivity among different brain regions because of diseases. Graph theory can efficiently characterize different aspects of the brain network by calculating measures of integration and segregation. In this study, we combine graph theoretical approaches with advanced machine learning methods to study functional brain network alteration in patients with Alzheimer's disease (AD). Support vector machine (SVM) was used to explore the ability of graph measures in diagnosis of AD. We applied our method on the resting-state fMRI data of twenty patients with AD and twenty age and gender matched healthy subjects. The data were preprocessed and each subject's graph was constructed by parcellation of the whole brain into 90 distinct regions using the automated anatomical labeling (AAL) atlas. The graph measures were then calculated and used as the discriminating features. Extracted network-based features were fed to different feature selection algorithms to choose most significant features. In addition to the machine learning approach, statistical analysis was performed on connectivity matrices to find altered connectivity patterns in patients with AD. Using the selected features, we were able to accurately classify patients with AD from healthy subjects with accuracy of 100%. Results of this study show that pattern recognition and graph of brain network, on the basis of the resting state fMRI data, can efficiently assist in the diagnosis of AD. Classification based on the resting-state fMRI can be used as a non-invasive and automatic tool to diagnosis of Alzheimer's disease. Copyright © 2015 International Federation of Clinical Neurophysiology. All rights reserved.
linkedISA: semantic representation of ISA-Tab experimental metadata.
González-Beltrán, Alejandra; Maguire, Eamonn; Sansone, Susanna-Assunta; Rocca-Serra, Philippe
2014-01-01
Reporting and sharing experimental metadata- such as the experimental design, characteristics of the samples, and procedures applied, along with the analysis results, in a standardised manner ensures that datasets are comprehensible and, in principle, reproducible, comparable and reusable. Furthermore, sharing datasets in formats designed for consumption by humans and machines will also maximize their use. The Investigation/Study/Assay (ISA) open source metadata tracking framework facilitates standards-compliant collection, curation, visualization, storage and sharing of datasets, leveraging on other platforms to enable analysis and publication. The ISA software suite includes several components used in increasingly diverse set of life science and biomedical domains; it is underpinned by a general-purpose format, ISA-Tab, and conversions exist into formats required by public repositories. While ISA-Tab works well mainly as a human readable format, we have also implemented a linked data approach to semantically define the ISA-Tab syntax. We present a semantic web representation of the ISA-Tab syntax that complements ISA-Tab's syntactic interoperability with semantic interoperability. We introduce the linkedISA conversion tool from ISA-Tab to the Resource Description Framework (RDF), supporting mappings from the ISA syntax to multiple community-defined, open ontologies and capitalising on user-provided ontology annotations in the experimental metadata. We describe insights of the implementation and how annotations can be expanded driven by the metadata. We applied the conversion tool as part of Bio-GraphIIn, a web-based application supporting integration of the semantically-rich experimental descriptions. Designed in a user-friendly manner, the Bio-GraphIIn interface hides most of the complexities to the users, exposing a familiar tabular view of the experimental description to allow seamless interaction with the RDF representation, and visualising descriptors to drive the query over the semantic representation of the experimental design. In addition, we defined queries over the linkedISA RDF representation and demonstrated its use over the linkedISA conversion of datasets from Nature' Scientific Data online publication. Our linked data approach has allowed us to: 1) make the ISA-Tab semantics explicit and machine-processable, 2) exploit the existing ontology-based annotations in the ISA-Tab experimental descriptions, 3) augment the ISA-Tab syntax with new descriptive elements, 4) visualise and query elements related to the experimental design. Reasoning over ISA-Tab metadata and associated data will facilitate data integration and knowledge discovery.
GIS tool for California state legislature electoral history
NASA Astrophysics Data System (ADS)
Artham, Swathi
The California State Legislature contains two bodies consisting of the lower house, the California State Assembly, with eighty members, and the upper house, the California State Senate, with forty members. Elections are held for every two years for both Senate and Assembly. The terms of the Senators are staggered so that half the membership is elected every two years, whereas all the Assembly members are elected every two years. The electoral district boundaries vary after every 10-year census. My main objective is to provide a summary of both California State Senate and California State Assembly election results in a single GIS tool, from the years 1970 to 2012. This tool provides information about different trends in the California State Senate and State Assembly elections along the years. This tool was designed to help students, and teachers to interactively learn about the California State Legislature elections. Users can view the election results by selecting a particular year for Senate or Assembly, which results in adding a new layer on the map with a coloring scheme for better understanding of change of parties; red for Republicans, blue for Democrats and green for Independents. Users can click on any district shown on the map using a hotlink tool to see the electoral trends for the districts for the past years. This application provides a powerful Stored Query Language (SQL) query option to enter queries and get election results in the form of tables with various fields. This data can be further used to aid other analysis as per user requirements. This tool also provides various visual statistics using graphs and tables for voter turnout, number of candidates won by each party, number of seats changed from one party to another. It also features a color matrix table that helps users to see trends in California State Senate and Assembly. Every two-year election results are shown in the form of graphs and tables for better understanding by the user. The tool provides two quiz options for users who are willing to test the knowledge they gained using the tool. This tool was developed in JAVA swing and AWT, Map Objects Java Objects (MOJO), Apache Derby, DBF Explorer, HTML5, CSS3 and JavaScript.
Large-Scale Constraint-Based Pattern Mining
ERIC Educational Resources Information Center
Zhu, Feida
2009-01-01
We studied the problem of constraint-based pattern mining for three different data formats, item-set, sequence and graph, and focused on mining patterns of large sizes. Colossal patterns in each data formats are studied to discover pruning properties that are useful for direct mining of these patterns. For item-set data, we observed robustness of…
Tadić, Bosiljka; Andjelković, Miroslav; Boshkoska, Biljana Mileva; Levnajić, Zoran
2016-01-01
Human behaviour in various circumstances mirrors the corresponding brain connectivity patterns, which are suitably represented by functional brain networks. While the objective analysis of these networks by graph theory tools deepened our understanding of brain functions, the multi-brain structures and connections underlying human social behaviour remain largely unexplored. In this study, we analyse the aggregate graph that maps coordination of EEG signals previously recorded during spoken communications in two groups of six listeners and two speakers. Applying an innovative approach based on the algebraic topology of graphs, we analyse higher-order topological complexes consisting of mutually interwoven cliques of a high order to which the identified functional connections organise. Our results reveal that the topological quantifiers provide new suitable measures for differences in the brain activity patterns and inter-brain synchronisation between speakers and listeners. Moreover, the higher topological complexity correlates with the listener’s concentration to the story, confirmed by self-rating, and closeness to the speaker’s brain activity pattern, which is measured by network-to-network distance. The connectivity structures of the frontal and parietal lobe consistently constitute distinct clusters, which extend across the listener’s group. Formally, the topology quantifiers of the multi-brain communities exceed the sum of those of the participating individuals and also reflect the listener’s rated attributes of the speaker and the narrated subject. In the broader context, the presented study exposes the relevance of higher topological structures (besides standard graph measures) for characterising functional brain networks under different stimuli. PMID:27880802
Expediting Scientific Data Analysis with Reorganization of Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Byna, Surendra; Wu, Kesheng
2013-08-19
Data producers typically optimize the layout of data files to minimize the write time. In most cases, data analysis tasks read these files in access patterns different from the write patterns causing poor read performance. In this paper, we introduce Scientific Data Services (SDS), a framework for bridging the performance gap between writing and reading scientific data. SDS reorganizes data to match the read patterns of analysis tasks and enables transparent data reads from the reorganized data. We implemented a HDF5 Virtual Object Layer (VOL) plugin to redirect the HDF5 dataset read calls to the reorganized data. To demonstrate themore » effectiveness of SDS, we applied two parallel data organization techniques: a sort-based organization on a plasma physics data and a transpose-based organization on mass spectrometry imaging data. We also extended the HDF5 data access API to allow selection of data based on their values through a query interface, called SDS Query. We evaluated the execution time in accessing various subsets of data through existing HDF5 Read API and SDS Query. We showed that reading the reorganized data using SDS is up to 55X faster than reading the original data.« less
Graph Theoretical Framework of Brain Networks in Multiple Sclerosis: A Review of Concepts.
Fleischer, Vinzenz; Radetz, Angela; Ciolac, Dumitru; Muthuraman, Muthuraman; Gonzalez-Escamilla, Gabriel; Zipp, Frauke; Groppa, Sergiu
2017-11-01
Network science provides powerful access to essential organizational principles of the human brain. It has been applied in combination with graph theory to characterize brain connectivity patterns. In multiple sclerosis (MS), analysis of the brain networks derived from either structural or functional imaging provides new insights into pathological processes within the gray and white matter. Beyond focal lesions and diffuse tissue damage, network connectivity patterns could be important for closely tracking and predicting the disease course. In this review, we describe concepts of graph theory, highlight novel issues of tissue reorganization in acute and chronic neuroinflammation and address pitfalls with regard to network analysis in MS patients. We further provide an outline of functional and structural connectivity patterns observed in MS, spanning from disconnection and disruption on one hand to adaptation and compensation on the other. Moreover, we link network changes and their relation to clinical disability based on the current literature. Finally, we discuss the perspective of network science in MS for future research and postulate its role in the clinical framework. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.
Thinking graphically: Connecting vision and cognition during graph comprehension.
Ratwani, Raj M; Trafton, J Gregory; Boehm-Davis, Deborah A
2008-03-01
Task analytic theories of graph comprehension account for the perceptual and conceptual processes required to extract specific information from graphs. Comparatively, the processes underlying information integration have received less attention. We propose a new framework for information integration that highlights visual integration and cognitive integration. During visual integration, pattern recognition processes are used to form visual clusters of information; these visual clusters are then used to reason about the graph during cognitive integration. In 3 experiments, the processes required to extract specific information and to integrate information were examined by collecting verbal protocol and eye movement data. Results supported the task analytic theories for specific information extraction and the processes of visual and cognitive integration for integrative questions. Further, the integrative processes scaled up as graph complexity increased, highlighting the importance of these processes for integration in more complex graphs. Finally, based on this framework, design principles to improve both visual and cognitive integration are described. PsycINFO Database Record (c) 2008 APA, all rights reserved
Visual Routines Are Associated with Specific Graph Interpretations
ERIC Educational Resources Information Center
Michal, Audrey L.; Franconeri, Steven L.
2017-01-01
We argue that people compare values in graphs with a "visual routine"--attending to data values in an ordered pattern over time. Do these visual routines exist to manage capacity limitations in how many values can be encoded at once, or do they actually affect the relations that are extracted? We measured eye movements while people…
New methods for analyzing semantic graph based assessments in science education
NASA Astrophysics Data System (ADS)
Vikaros, Lance Steven
This research investigated how the scoring of semantic graphs (known by many as concept maps) could be improved and automated in order to address issues of inter-rater reliability and scalability. As part of the NSF funded SENSE-IT project to introduce secondary school science students to sensor networks (NSF Grant No. 0833440), semantic graphs illustrating how temperature change affects water ecology were collected from 221 students across 16 schools. The graphing task did not constrain students' use of terms, as is often done with semantic graph based assessment due to coding and scoring concerns. The graphing software used provided real-time feedback to help students learn how to construct graphs, stay on topic and effectively communicate ideas. The collected graphs were scored by human raters using assessment methods expected to boost reliability, which included adaptations of traditional holistic and propositional scoring methods, use of expert raters, topical rubrics, and criterion graphs. High levels of inter-rater reliability were achieved, demonstrating that vocabulary constraints may not be necessary after all. To investigate a new approach to automating the scoring of graphs, thirty-two different graph features characterizing graphs' structure, semantics, configuration and process of construction were then used to predict human raters' scoring of graphs in order to identify feature patterns correlated to raters' evaluations of graphs' topical accuracy and complexity. Results led to the development of a regression model able to predict raters' scoring with 77% accuracy, with 46% accuracy expected when used to score new sets of graphs, as estimated via cross-validation tests. Although such performance is comparable to other graph and essay based scoring systems, cross-context testing of the model and methods used to develop it would be needed before it could be recommended for widespread use. Still, the findings suggest techniques for improving the reliability and scalability of semantic graph based assessments without requiring constraint of how ideas are expressed.
Yu, Qingbao; Erhardt, Erik B.; Sui, Jing; Du, Yuhui; He, Hao; Hjelm, Devon; Cetin, Mustafa S.; Rachakonda, Srinivas; Miller, Robyn L.; Pearlson, Godfrey; Calhoun, Vince D.
2014-01-01
Graph theory-based analysis has been widely employed in brain imaging studies, and altered topological properties of brain connectivity have emerged as important features of mental diseases such as schizophrenia. However, most previous studies have focused on graph metrics of stationary brain graphs, ignoring that brain connectivity exhibits fluctuations over time. Here we develop a new framework for accessing dynamic graph properties of time-varying functional brain connectivity in resting state fMRI data and apply it to healthy controls (HCs) and patients with schizophrenia (SZs). Specifically, nodes of brain graphs are defined by intrinsic connectivity networks (ICNs) identified by group independent component analysis (ICA). Dynamic graph metrics of the time-varying brain connectivity estimated by the correlation of sliding time-windowed ICA time courses of ICNs are calculated. First- and second-level connectivity states are detected based on the correlation of nodal connectivity strength between time-varying brain graphs. Our results indicate that SZs show decreased variance in the dynamic graph metrics. Consistent with prior stationary functional brain connectivity works, graph measures of identified first-level connectivity states show lower values in SZs. In addition, more first-level connectivity states are disassociated with the second-level connectivity state which resembles the stationary connectivity pattern computed by the entire scan. Collectively, the findings provide new evidence about altered dynamic brain graphs in schizophrenia which may underscore the abnormal brain performance in this mental illness. PMID:25514514
Model-based multiple patterning layout decomposition
NASA Astrophysics Data System (ADS)
Guo, Daifeng; Tian, Haitong; Du, Yuelin; Wong, Martin D. F.
2015-10-01
As one of the most promising next generation lithography technologies, multiple patterning lithography (MPL) plays an important role in the attempts to keep in pace with 10 nm technology node and beyond. With feature size keeps shrinking, it has become impossible to print dense layouts within one single exposure. As a result, MPL such as double patterning lithography (DPL) and triple patterning lithography (TPL) has been widely adopted. There is a large volume of literature on DPL/TPL layout decomposition, and the current approach is to formulate the problem as a classical graph-coloring problem: Layout features (polygons) are represented by vertices in a graph G and there is an edge between two vertices if and only if the distance between the two corresponding features are less than a minimum distance threshold value dmin. The problem is to color the vertices of G using k colors (k = 2 for DPL, k = 3 for TPL) such that no two vertices connected by an edge are given the same color. This is a rule-based approach, which impose a geometric distance as a minimum constraint to simply decompose polygons within the distance into different masks. It is not desired in practice because this criteria cannot completely capture the behavior of the optics. For example, it lacks of sufficient information such as the optical source characteristics and the effects between the polygons outside the minimum distance. To remedy the deficiency, a model-based layout decomposition approach to make the decomposition criteria base on simulation results was first introduced at SPIE 2013.1 However, the algorithm1 is based on simplified assumption on the optical simulation model and therefore its usage on real layouts is limited. Recently AMSL2 also proposed a model-based approach to layout decomposition by iteratively simulating the layout, which requires excessive computational resource and may lead to sub-optimal solutions. The approach2 also potentially generates too many stiches. In this paper, we propose a model-based MPL layout decomposition method using a pre-simulated library of frequent layout patterns. Instead of using the graph G in the standard graph-coloring formulation, we build an expanded graph H where each vertex represents a group of adjacent features together with a coloring solution. By utilizing the library and running sophisticated graph algorithms on H, our approach can obtain optimal decomposition results efficiently. Our model-based solution can achieve a practical mask design which significantly improves the lithography quality on the wafer compared to the rule based decomposition.
Using Zipf-Mandelbrot law and graph theory to evaluate animal welfare
NASA Astrophysics Data System (ADS)
de Oliveira, Caprice G. L.; Miranda, José G. V.; Japyassú, Hilton F.; El-Hani, Charbel N.
2018-02-01
This work deals with the construction and testing of metrics of welfare based on behavioral complexity, using assumptions derived from Zipf-Mandelbrot law and graph theory. To test these metrics we compared yellow-breasted capuchins (Sapajus xanthosternos) (Wied-Neuwied, 1826) (PRIMATES CEBIDAE) found in two institutions, subjected to different captive conditions: a Zoobotanical Garden (hereafter, ZOO; n = 14), in good welfare condition, and a Wildlife Rescue Center (hereafter, WRC; n = 8), in poor welfare condition. In the Zipf-Mandelbrot-based analysis, the power law exponent was calculated using behavior frequency values versus behavior rank value. These values allow us to evaluate variations in individual behavioral complexity. For each individual we also constructed a graph using the sequence of behavioral units displayed in each recording (average recording time per individual: 4 h 26 min in the ZOO, 4 h 30 min in the WRC). Then, we calculated the values of the main graph attributes, which allowed us to analyze the complexity of the connectivity of the behaviors displayed in the individuals' behavioral sequences. We found significant differences between the two groups for the slope values in the Zipf-Mandelbrot analysis. The slope values for the ZOO individuals approached -1, with graphs representing a power law, while the values for the WRC individuals diverged from -1, differing from a power law pattern. Likewise, we found significant differences for the graph attributes average degree, weighted average degree, and clustering coefficient when comparing the ZOO and WRC individual graphs. However, no significant difference was found for the attributes modularity and average path length. Both analyses were effective in detecting differences between the patterns of behavioral complexity in the two groups. The slope values for the ZOO individuals indicated a higher behavioral complexity when compared to the WRC individuals. Similarly, graph construction and the calculation of its attributes values allowed us to show that the complexity of the connectivity among the behaviors was higher in the ZOO than in the WRC individual graphs. These results show that the two measuring approaches introduced and tested in this paper were capable of capturing the differences in welfare levels between the two conditions, as shown by differences in behavioral complexity.
Indexed variation graphs for efficient and accurate resistome profiling.
Rowe, Will P M; Winn, Martyn D
2018-05-14
Antimicrobial resistance remains a major threat to global health. Profiling the collective antimicrobial resistance genes within a metagenome (the "resistome") facilitates greater understanding of antimicrobial resistance gene diversity and dynamics. In turn, this can allow for gene surveillance, individualised treatment of bacterial infections and more sustainable use of antimicrobials. However, resistome profiling can be complicated by high similarity between reference genes, as well as the sheer volume of sequencing data and the complexity of analysis workflows. We have developed an efficient and accurate method for resistome profiling that addresses these complications and improves upon currently available tools. Our method combines a variation graph representation of gene sets with an LSH Forest indexing scheme to allow for fast classification of metagenomic sequence reads using similarity-search queries. Subsequent hierarchical local alignment of classified reads against graph traversals enables accurate reconstruction of full-length gene sequences using a scoring scheme. We provide our implementation, GROOT, and show it to be both faster and more accurate than a current reference-dependent tool for resistome profiling. GROOT runs on a laptop and can process a typical 2 gigabyte metagenome in 2 minutes using a single CPU. Our method is not restricted to resistome profiling and has the potential to improve current metagenomic workflows. GROOT is written in Go and is available at https://github.com/will-rowe/groot (MIT license). will.rowe@stfc.ac.uk. Supplementary data are available at Bioinformatics online.
Simon, Steffen T; Higginson, Irene J; Benalia, Hamid; Gysels, Marjolein; Murtagh, Fliss Em; Spicer, James; Bausewein, Claudia
2013-06-01
Despite the high prevalence and impact of episodic breathlessness, information about characteristics and patterns is scarce. To explore the experience of patients with advanced disease suffering from episodic breathlessness, in order to describe types and patterns. Qualitative design using in-depth interviews with patients suffering from advanced stages of chronic heart failure, chronic obstructive pulmonary disease, lung cancer or motor neurone disease. As part of the interviews, patients were asked to draw a graph to illustrate typical patterns of breathlessness episodes. Interviews were tape-recorded, transcribed verbatim and analysed using Framework Analysis. The graphs were grouped according to their patterns. Fifty-one participants (15 chronic heart failure, 14 chronic obstructive pulmonary disease, 13 lung cancer and 9 motor neurone disease) were included (mean age 68.2 years, 30 of 51 men, mean Karnofsky 63.1, mean breathlessness intensity 3.2 of 10). Five different types of episodic breathlessness were described: triggered with normal level of breathlessness, triggered with predictable response (always related to trigger level, e.g. slight exertion causes severe breathlessness), triggered with unpredictable response (not related to trigger level), non-triggered attack-like (quick onset, often severe) and wave-like (triggered or non-triggered, gradual onset). Four patterns of episodic breathlessness could be identified based on the graphs with differences regarding onset and recovery of episodes. These did not correspond with the types of breathlessness described before. Patients with advanced disease experience clearly distinguishable types and patterns of episodic breathlessness. The understanding of these will help clinicians to tailor specific management strategies for patients who suffer from episodes of breathlessness.
Immune networks: multitasking capabilities near saturation
NASA Astrophysics Data System (ADS)
Agliari, E.; Annibale, A.; Barra, A.; Coolen, A. C. C.; Tantari, D.
2013-10-01
Pattern-diluted associative networks were recently introduced as models for the immune system, with nodes representing T-lymphocytes and stored patterns representing signalling protocols between T- and B-lymphocytes. It was shown earlier that in the regime of extreme pattern dilution, a system with NT T-lymphocytes can manage a number N_B={ {O}}(N_T^\\delta ) of B-lymphocytes simultaneously, with δ < 1. Here we study this model in the extensive load regime NB = αNT, with a high degree of pattern dilution, in agreement with immunological findings. We use graph theory and statistical mechanical analysis based on replica methods to show that in the finite-connectivity regime, where each T-lymphocyte interacts with a finite number of B-lymphocytes as NT → ∞, the T-lymphocytes can coordinate effective immune responses to an extensive number of distinct antigen invasions in parallel. As α increases, the system eventually undergoes a second order transition to a phase with clonal cross-talk interference, where the system’s performance degrades gracefully. Mathematically, the model is equivalent to a spin system on a finitely connected graph with many short loops, so one would expect the available analytical methods, which all assume locally tree-like graphs, to fail. Yet it turns out to be solvable. Our results are supported by numerical simulations.
Dimitriadis, S I; Laskaris, N A; Tzelepi, A; Economou, G
2012-05-01
There is growing interest in studying the association of functional connectivity patterns with particular cognitive tasks. The ability of graphs to encapsulate relational data has been exploited in many related studies, where functional networks (sketched by different neural synchrony estimators) are characterized by a rich repertoire of graph-related metrics. We introduce commute times (CTs) as an alternative way to capture the true interplay between the nodes of a functional connectivity graph (FCG). CT is a measure of the time taken for a random walk to setout and return between a pair of nodes on a graph. Its computation is considered here as a robust and accurate integration, over the FCG, of the individual pairwise measurements of functional coupling. To demonstrate the benefits from our approach, we attempted the characterization of time evolving connectivity patterns derived from EEG signals recorded while the subject was engaged in an eye-movement task. With respect to standard ways, which are currently employed to characterize connectivity, an improved detection of event-related dynamical changes is noticeable. CTs appear to be a promising technique for deriving temporal fingerprints of the brain's dynamic functional organization.
Sleep-wake time perception varies by direct or indirect query.
Alameddine, Y; Ellenbogen, J M; Bianchi, M T
2015-01-15
The diagnosis of insomnia rests on self-report of difficulty initiating or maintaining sleep. However, subjective reports may be unreliable, and possibly may vary by the method of inquiry. We investigated this possibility by comparing within-individual response to direct versus indirect time queries after overnight polysomnography. We obtained self-reported sleep-wake times via morning questionnaires in 879 consecutive adult diagnostic polysomnograms. Responses were compared within subjects (direct versus indirect query) and across groups defined by apnea-hypopnea index and by self-reported insomnia symptoms in pre-sleep questionnaires. Direct queries required a time duration response, while indirect queries required clock times from which we calculated time durations. Direct and indirect queries of sleep latency were the same in only 41% of cases, and total sleep time queries matched in only 5.4%. For both latency and total sleep, the most common discrepancy involved the indirect value being larger than the direct response. The discrepancy between direct and indirect queries was not related to objective sleep metrics. The degree of discrepancy was not related to the presence of insomnia symptoms, although patients reporting insomnia symptoms showed underestimation of total sleep duration by direct response. Self-reported sleep latency and total sleep time are often internally inconsistent when comparing direct and indirect survey queries of each measure. These discrepancies represent substantive challenges to effective clinical practice, particularly when diagnosis and management depends on self-reported sleep patterns, as with insomnia. Although self-reported sleep-wake times remains fundamental to clinical practice, objective measures provide clinically relevant adjunctive information. © 2015 American Academy of Sleep Medicine.
Graphical Representations of Electronic Search Patterns.
ERIC Educational Resources Information Center
Lin, Xia; And Others
1991-01-01
Discussion of search behavior in electronic environments focuses on the development of GRIP (Graphic Representor of Interaction Patterns), a graphing tool based on HyperCard that produces graphic representations of search patterns. Search state spaces are explained, and forms of data available from electronic searches are described. (34…
Huang, Chung-Chi; Lu, Zhiyong
2016-01-01
Identifying relevant papers from the literature is a common task in biocuration. Most current biomedical literature search systems primarily rely on matching user keywords. Semantic search, on the other hand, seeks to improve search accuracy by understanding the entities and contextual relations in user keywords. However, past research has mostly focused on semantically identifying biological entities (e.g. chemicals, diseases and genes) with little effort on discovering semantic relations. In this work, we aim to discover biomedical semantic relations in PubMed queries in an automated and unsupervised fashion. Specifically, we focus on extracting and understanding the contextual information (or context patterns) that is used by PubMed users to represent semantic relations between entities such as 'CHEMICAL-1 compared to CHEMICAL-2' With the advances in automatic named entity recognition, we first tag entities in PubMed queries and then use tagged entities as knowledge to recognize pattern semantics. More specifically, we transform PubMed queries into context patterns involving participating entities, which are subsequently projected to latent topics via latent semantic analysis (LSA) to avoid the data sparseness and specificity issues. Finally, we mine semantically similar contextual patterns or semantic relations based on LSA topic distributions. Our two separate evaluation experiments of chemical-chemical (CC) and chemical-disease (CD) relations show that the proposed approach significantly outperforms a baseline method, which simply measures pattern semantics by similarity in participating entities. The highest performance achieved by our approach is nearly 0.9 and 0.85 respectively for the CC and CD task when compared against the ground truth in terms of normalized discounted cumulative gain (nDCG), a standard measure of ranking quality. These results suggest that our approach can effectively identify and return related semantic patterns in a ranked order covering diverse bio-entity relations. To assess the potential utility of our automated top-ranked patterns of a given relation in semantic search, we performed a pilot study on frequently sought semantic relations in PubMed and observed improved literature retrieval effectiveness based on post-hoc human relevance evaluation. Further investigation in larger tests and in real-world scenarios is warranted. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.
Zhou, Mu; Zhang, Qiao; Xu, Kunjie; Tian, Zengshan; Wang, Yanmeng; He, Wei
2015-01-01
Due to the wide deployment of wireless local area networks (WLAN), received signal strength (RSS)-based indoor WLAN localization has attracted considerable attention in both academia and industry. In this paper, we propose a novel page rank-based indoor mapping and localization (PRIMAL) by using the gene-sequenced unlabeled WLAN RSS for simultaneous localization and mapping (SLAM). Specifically, first of all, based on the observation of the motion patterns of the people in the target environment, we use the Allen logic to construct the mobility graph to characterize the connectivity among different areas of interest. Second, the concept of gene sequencing is utilized to assemble the sporadically-collected RSS sequences into a signal graph based on the transition relations among different RSS sequences. Third, we apply the graph drawing approach to exhibit both the mobility graph and signal graph in a more readable manner. Finally, the page rank (PR) algorithm is proposed to construct the mapping from the signal graph into the mobility graph. The experimental results show that the proposed approach achieves satisfactory localization accuracy and meanwhile avoids the intensive time and labor cost involved in the conventional location fingerprinting-based indoor WLAN localization. PMID:26404274
Inferring ontology graph structures using OWL reasoning.
Rodríguez-García, Miguel Ángel; Hoehndorf, Robert
2018-01-05
Ontologies are representations of a conceptualization of a domain. Traditionally, ontologies in biology were represented as directed acyclic graphs (DAG) which represent the backbone taxonomy and additional relations between classes. These graphs are widely exploited for data analysis in the form of ontology enrichment or computation of semantic similarity. More recently, ontologies are developed in a formal language such as the Web Ontology Language (OWL) and consist of a set of axioms through which classes are defined or constrained. While the taxonomy of an ontology can be inferred directly from the axioms of an ontology as one of the standard OWL reasoning tasks, creating general graph structures from OWL ontologies that exploit the ontologies' semantic content remains a challenge. We developed a method to transform ontologies into graphs using an automated reasoner while taking into account all relations between classes. Searching for (existential) patterns in the deductive closure of ontologies, we can identify relations between classes that are implied but not asserted and generate graph structures that encode for a large part of the ontologies' semantic content. We demonstrate the advantages of our method by applying it to inference of protein-protein interactions through semantic similarity over the Gene Ontology and demonstrate that performance is increased when graph structures are inferred using deductive inference according to our method. Our software and experiment results are available at http://github.com/bio-ontology-research-group/Onto2Graph . Onto2Graph is a method to generate graph structures from OWL ontologies using automated reasoning. The resulting graphs can be used for improved ontology visualization and ontology-based data analysis.
Projections for fast protein structure retrieval
Bhattacharya, Sourangshu; Bhattacharyya, Chiranjib; Chandra, Nagasuma R
2006-01-01
Background In recent times, there has been an exponential rise in the number of protein structures in databases e.g. PDB. So, design of fast algorithms capable of querying such databases is becoming an increasingly important research issue. This paper reports an algorithm, motivated from spectral graph matching techniques, for retrieving protein structures similar to a query structure from a large protein structure database. Each protein structure is specified by the 3D coordinates of residues of the protein. The algorithm is based on a novel characterization of the residues, called projections, leading to a similarity measure between the residues of the two proteins. This measure is exploited to efficiently compute the optimal equivalences. Results Experimental results show that, the current algorithm outperforms the state of the art on benchmark datasets in terms of speed without losing accuracy. Search results on SCOP 95% nonredundant database, for fold similarity with 5 proteins from different SCOP classes show that the current method performs competitively with the standard algorithm CE. The algorithm is also capable of detecting non-topological similarities between two proteins which is not possible with most of the state of the art tools like Dali. PMID:17254310
Sumner, Walton; Xu, Jin Zhong; Roussel, Guy; Hagen, Michael D
2007-10-11
The American Board of Family Medicine deployed virtual patient simulations in 2004 to evaluate Diplomates' diagnostic and management skills. A previously reported dynamic process generates general symptom histories from time series data representing baseline values and reactions to medications. The simulator also must answer queries about details such as palliation and provocation. These responses often describe some recurring pattern, such as, "this medicine relieves my symptoms in a few minutes." The simulator can provide a detail stored as text, or it can evaluate a reference to a second query object. The second query object can generate details using a single Bayesian network to evaluate the effect of each drug in a virtual patient's medication list. A new medication option may not require redesign of the second query object if its implementation is consistent with related drugs. We expect this mechanism to maintain realistic responses to detail questions in complex simulations.
A novel approach of an absolute coding pattern based on Hamiltonian graph
NASA Astrophysics Data System (ADS)
Wang, Ya'nan; Wang, Huawei; Hao, Fusheng; Liu, Liqiang
2017-02-01
In this paper, a novel approach of an optical type absolute rotary encoder coding pattern is presented. The concept is based on the principle of the absolute encoder to find out a unique sequence that ensures an unambiguous shaft position of any angular. We design a single-ring and a n-by-2 matrix absolute encoder coding pattern by using the variations of Hamiltonian graph principle. 12 encoding bits is used in the single-ring by a linear array CCD to achieve an 1080-position cycle encoding. Besides, a 2-by-2 matrix is used as an unit in the 2-track disk to achieve a 16-bits encoding pattern by using an area array CCD sensor (as a sample). Finally, a higher resolution can be gained by an electronic subdivision of the signals. Compared with the conventional gray or binary code pattern (for a 2n resolution), this new pattern has a higher resolution (2n*n) with less coding tracks, which means the new pattern can lead to a smaller encoder, which is essential in the industrial production.
A Ruby API to query the Ensembl database for genomic features.
Strozzi, Francesco; Aerts, Jan
2011-04-01
The Ensembl database makes genomic features available via its Genome Browser. It is also possible to access the underlying data through a Perl API for advanced querying. We have developed a full-featured Ruby API to the Ensembl databases, providing the same functionality as the Perl interface with additional features. A single Ruby API is used to access different releases of the Ensembl databases and is also able to query multi-species databases. Most functionality of the API is provided using the ActiveRecord pattern. The library depends on introspection to make it release independent. The API is available through the Rubygem system and can be installed with the command gem install ruby-ensembl-api.
Patch-based iterative conditional geostatistical simulation using graph cuts
NASA Astrophysics Data System (ADS)
Li, Xue; Mariethoz, Gregoire; Lu, DeTang; Linde, Niklas
2016-08-01
Training image-based geostatistical methods are increasingly popular in groundwater hydrology even if existing algorithms present limitations that often make real-world applications difficult. These limitations include a computational cost that can be prohibitive for high-resolution 3-D applications, the presence of visual artifacts in the model realizations, and a low variability between model realizations due to the limited pool of patterns available in a finite-size training image. In this paper, we address these issues by proposing an iterative patch-based algorithm which adapts a graph cuts methodology that is widely used in computer graphics. Our adapted graph cuts method optimally cuts patches of pixel values borrowed from the training image and assembles them successively, each time accounting for the information of previously stitched patches. The initial simulation result might display artifacts, which are identified as regions of high cost. These artifacts are reduced by iteratively placing new patches in high-cost regions. In contrast to most patch-based algorithms, the proposed scheme can also efficiently address point conditioning. An advantage of the method is that the cut process results in the creation of new patterns that are not present in the training image, thereby increasing pattern variability. To quantify this effect, a new measure of variability is developed, the merging index, quantifies the pattern variability in the realizations with respect to the training image. A series of sensitivity analyses demonstrates the stability of the proposed graph cuts approach, which produces satisfying simulations for a wide range of parameters values. Applications to 2-D and 3-D cases are compared to state-of-the-art multiple-point methods. The results show that the proposed approach obtains significant speedups and increases variability between realizations. Connectivity functions applied to 2-D models transport simulations in 3-D models are used to demonstrate that pattern continuity is preserved.
Kettunen, Jyrki; Eirola, Emil; Paakkonen, Heikki
2018-01-01
Background Some of the temporal variations and clock-like rhythms that govern several different health-related behaviors can be traced in near real-time with the help of search engine data. This is especially useful when studying phenomena where little or no traditional data exist. One specific area where traditional data are incomplete is the study of diurnal mood variations, or daily changes in individuals’ overall mood state in relation to depression-like symptoms. Objective The objective of this exploratory study was to analyze diurnal variations for interest in depression on the Web to discover hourly patterns of depression interest and help seeking. Methods Hourly query volume data for 6 depression-related queries in Finland were downloaded from Google Trends in March 2017. A continuous wavelet transform (CWT) was applied to the hourly data to focus on the diurnal variation. Longer term trends and noise were also eliminated from the data to extract the diurnal variation for each query term. An analysis of variance was conducted to determine the statistical differences between the distributions of each hour. Data were also trichotomized and analyzed in 3 time blocks to make comparisons between different time periods during the day. Results Search volumes for all depression-related query terms showed a unimodal regular pattern during the 24 hours of the day. All queries feature clear peaks during the nighttime hours around 11 PM to 4 AM and troughs between 5 AM and 10 PM. In the means of the CWT-reconstructed data, the differences in nighttime and daytime interest are evident, with a difference of 37.3 percentage points (pp) for the term “Depression,” 33.5 pp for “Masennustesti,” 30.6 pp for “Masennus,” 12.8 pp for “Depression test,” 12.0 pp for “Masennus testi,” and 11.8 pp for “Masennus oireet.” The trichotomization showed peaks in the first time block (00.00 AM-7.59 AM) for all 6 terms. The search volumes then decreased significantly during the second time block (8.00 AM-3.59 PM) for the terms “Masennus oireet” (P<.001), “Masennus” (P=.001), “Depression” (P=.005), and “Depression test” (P=.004). Higher search volumes for the terms “Masennus” (P=.14), “Masennustesti” (P=.07), and “Depression test” (P=.10) were present between the second and third time blocks. Conclusions Help seeking for depression has clear diurnal patterns, with significant rise in depression-related query volumes toward the evening and night. Thus, search engine query data support the notion of the evening-worse pattern in diurnal mood variation. Information on the timely nature of depression-related interest on an hourly level could improve the chances for early intervention, which is beneficial for positive health outcomes. PMID:29792291
Tana, Jonas Christoffer; Kettunen, Jyrki; Eirola, Emil; Paakkonen, Heikki
2018-05-23
Some of the temporal variations and clock-like rhythms that govern several different health-related behaviors can be traced in near real-time with the help of search engine data. This is especially useful when studying phenomena where little or no traditional data exist. One specific area where traditional data are incomplete is the study of diurnal mood variations, or daily changes in individuals' overall mood state in relation to depression-like symptoms. The objective of this exploratory study was to analyze diurnal variations for interest in depression on the Web to discover hourly patterns of depression interest and help seeking. Hourly query volume data for 6 depression-related queries in Finland were downloaded from Google Trends in March 2017. A continuous wavelet transform (CWT) was applied to the hourly data to focus on the diurnal variation. Longer term trends and noise were also eliminated from the data to extract the diurnal variation for each query term. An analysis of variance was conducted to determine the statistical differences between the distributions of each hour. Data were also trichotomized and analyzed in 3 time blocks to make comparisons between different time periods during the day. Search volumes for all depression-related query terms showed a unimodal regular pattern during the 24 hours of the day. All queries feature clear peaks during the nighttime hours around 11 PM to 4 AM and troughs between 5 AM and 10 PM. In the means of the CWT-reconstructed data, the differences in nighttime and daytime interest are evident, with a difference of 37.3 percentage points (pp) for the term "Depression," 33.5 pp for "Masennustesti," 30.6 pp for "Masennus," 12.8 pp for "Depression test," 12.0 pp for "Masennus testi," and 11.8 pp for "Masennus oireet." The trichotomization showed peaks in the first time block (00.00 AM-7.59 AM) for all 6 terms. The search volumes then decreased significantly during the second time block (8.00 AM-3.59 PM) for the terms "Masennus oireet" (P<.001), "Masennus" (P=.001), "Depression" (P=.005), and "Depression test" (P=.004). Higher search volumes for the terms "Masennus" (P=.14), "Masennustesti" (P=.07), and "Depression test" (P=.10) were present between the second and third time blocks. Help seeking for depression has clear diurnal patterns, with significant rise in depression-related query volumes toward the evening and night. Thus, search engine query data support the notion of the evening-worse pattern in diurnal mood variation. Information on the timely nature of depression-related interest on an hourly level could improve the chances for early intervention, which is beneficial for positive health outcomes. ©Jonas Christoffer Tana, Jyrki Kettunen, Emil Eirola, Heikki Paakkonen. Originally published in JMIR Mental Health (http://mental.jmir.org), 23.05.2018.
Classification of Automated Search Traffic
NASA Astrophysics Data System (ADS)
Buehrer, Greg; Stokes, Jack W.; Chellapilla, Kumar; Platt, John C.
As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a web site’s rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic (sometimes referred to as bot traffic) in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by automated means. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. Using the these detection features, we next classify the query stream using multiple binary classifiers. In addition, a multiclass classifier is then developed to identify subclasses of both normal and automated traffic. An active learning algorithm is used to suggest which user sessions to label to improve the accuracy of the multiclass classifier, while also seeking to discover new classes of automated traffic. Performance analysis are then provided. Finally, the multiclass classifier is used to predict the subclass distribution for the search query stream.
Web queries as a source for syndromic surveillance.
Hulth, Anette; Rydevik, Gustaf; Linde, Annika
2009-01-01
In the field of syndromic surveillance, various sources are exploited for outbreak detection, monitoring and prediction. This paper describes a study on queries submitted to a medical web site, with influenza as a case study. The hypothesis of the work was that queries on influenza and influenza-like illness would provide a basis for the estimation of the timing of the peak and the intensity of the yearly influenza outbreaks that would be as good as the existing laboratory and sentinel surveillance. We calculated the occurrence of various queries related to influenza from search logs submitted to a Swedish medical web site for two influenza seasons. These figures were subsequently used to generate two models, one to estimate the number of laboratory verified influenza cases and one to estimate the proportion of patients with influenza-like illness reported by selected General Practitioners in Sweden. We applied an approach designed for highly correlated data, partial least squares regression. In our work, we found that certain web queries on influenza follow the same pattern as that obtained by the two other surveillance systems for influenza epidemics, and that they have equal power for the estimation of the influenza burden in society. Web queries give a unique access to ill individuals who are not (yet) seeking care. This paper shows the potential of web queries as an accurate, cheap and labour extensive source for syndromic surveillance.
Pattern detection in forensic case data using graph theory: application to heroin cutting agents.
Terrettaz-Zufferey, Anne-Laure; Ratle, Frédéric; Ribaux, Olivier; Esseiva, Pierre; Kanevski, Mikhail
2007-04-11
Pattern recognition techniques can be very useful in forensic sciences to point out to relevant sets of events and potentially encourage an intelligence-led style of policing. In this study, these techniques have been applied to categorical data corresponding to cutting agents found in heroin seizures. An application of graph theoretic methods has been performed, in order to highlight the possible relationships between the location of seizures and co-occurrences of particular heroin cutting agents. An analysis of the co-occurrences to establish several main combinations has been done. Results illustrate the practical potential of mathematical models in forensic data analysis.
Burrows, Nilka R.; Geiss, Linda S.
2014-01-01
The Diabetes Interactive Atlas is a recently released Web-based collection of maps that allows users to view geographic patterns and examine trends in diabetes and its risk factors over time across the United States and within states. The atlas provides maps, tables, graphs, and motion charts that depict national, state, and county data. Large amounts of data can be viewed in various ways simultaneously. In this article, we describe the design and technical issues for developing the atlas and provide an overview of the atlas’ maps and graphs. The Diabetes Interactive Atlas improves visualization of geographic patterns, highlights observation of trends, and demonstrates the concomitant geographic and temporal growth of diabetes and obesity. PMID:24503340
Bone marrow cavity segmentation using graph-cuts with wavelet-based texture feature.
Shigeta, Hironori; Mashita, Tomohiro; Kikuta, Junichi; Seno, Shigeto; Takemura, Haruo; Ishii, Masaru; Matsuda, Hideo
2017-10-01
Emerging bioimaging technologies enable us to capture various dynamic cellular activities [Formula: see text]. As large amounts of data are obtained these days and it is becoming unrealistic to manually process massive number of images, automatic analysis methods are required. One of the issues for automatic image segmentation is that image-taking conditions are variable. Thus, commonly, many manual inputs are required according to each image. In this paper, we propose a bone marrow cavity (BMC) segmentation method for bone images as BMC is considered to be related to the mechanism of bone remodeling, osteoporosis, and so on. To reduce manual inputs to segment BMC, we classified the texture pattern using wavelet transformation and support vector machine. We also integrated the result of texture pattern classification into the graph-cuts-based image segmentation method because texture analysis does not consider spatial continuity. Our method is applicable to a particular frame in an image sequence in which the condition of fluorescent material is variable. In the experiment, we evaluated our method with nine types of mother wavelets and several sets of scale parameters. The proposed method with graph-cuts and texture pattern classification performs well without manual inputs by a user.
Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval.
Wang, Yang; Lin, Xuemin; Wu, Lin; Zhang, Wenjie
2017-03-01
Given a query photo issued by a user (q-user), the landmark retrieval is to return a set of photos with their landmarks similar to those of the query, while the existing studies on the landmark retrieval focus on exploiting geometries of landmarks for similarity matches between candidate photos and a query photo. We observe that the same landmarks provided by different users over social media community may convey different geometry information depending on the viewpoints and/or angles, and may, subsequently, yield very different results. In fact, dealing with the landmarks with low quality shapes caused by the photography of q-users is often nontrivial and has seldom been studied. In this paper, we propose a novel framework, namely, multi-query expansions, to retrieve semantically robust landmarks by two steps. First, we identify the top- k photos regarding the latent topics of a query landmark to construct multi-query set so as to remedy its possible low quality shape. For this purpose, we significantly extend the techniques of Latent Dirichlet Allocation. Then, motivated by the typical collaborative filtering methods, we propose to learn a collaborative deep networks-based semantically, nonlinear, and high-level features over the latent factor for landmark photo as the training set, which is formed by matrix factorization over collaborative user-photo matrix regarding the multi-query set. The learned deep network is further applied to generate the features for all the other photos, meanwhile resulting into a compact multi-query set within such space. Then, the final ranking scores are calculated over the high-level feature space between the multi-query set and all other photos, which are ranked to serve as the final ranking list of landmark retrieval. Extensive experiments are conducted on real-world social media data with both landmark photos together with their user information to show the superior performance over the existing methods, especially our recently proposed multi-query based mid-level pattern representation method [1].
CUFID-query: accurate network querying through random walk based network flow estimation.
Jeong, Hyundoo; Qian, Xiaoning; Yoon, Byung-Jun
2017-12-28
Functional modules in biological networks consist of numerous biomolecules and their complicated interactions. Recent studies have shown that biomolecules in a functional module tend to have similar interaction patterns and that such modules are often conserved across biological networks of different species. As a result, such conserved functional modules can be identified through comparative analysis of biological networks. In this work, we propose a novel network querying algorithm based on the CUFID (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins) framework combined with an efficient seed-and-extension approach. The proposed algorithm, CUFID-query, can accurately detect conserved functional modules as small subnetworks in the target network that are expected to perform similar functions to the given query functional module. The CUFID framework was recently developed for probabilistic pairwise global comparison of biological networks, and it has been applied to pairwise global network alignment, where the framework was shown to yield accurate network alignment results. In the proposed CUFID-query algorithm, we adopt the CUFID framework and extend it for local network alignment, specifically to solve network querying problems. First, in the seed selection phase, the proposed method utilizes the CUFID framework to compare the query and the target networks and to predict the probabilistic node-to-node correspondence between the networks. Next, the algorithm selects and greedily extends the seed in the target network by iteratively adding nodes that have frequent interactions with other nodes in the seed network, in a way that the conductance of the extended network is maximally reduced. Finally, CUFID-query removes irrelevant nodes from the querying results based on the personalized PageRank vector for the induced network that includes the fully extended network and its neighboring nodes. Through extensive performance evaluation based on biological networks with known functional modules, we show that CUFID-query outperforms the existing state-of-the-art algorithms in terms of prediction accuracy and biological significance of the predictions.
Seasonal trends in sleep-disordered breathing: evidence from Internet search engine query data.
Ingram, David G; Matthews, Camilla K; Plante, David T
2015-03-01
The primary aim of the current study was to test the hypothesis that there is a seasonal component to snoring and obstructive sleep apnea (OSA) through the use of Google search engine query data. Internet search engine query data were retrieved from Google Trends from January 2006 to December 2012. Monthly normalized search volume was obtained over that 7-year period in the USA and Australia for the following search terms: "snoring" and "sleep apnea". Seasonal effects were investigated by fitting cosinor regression models. In addition, the search terms "snoring children" and "sleep apnea children" were evaluated to examine seasonal effects in pediatric populations. Statistically significant seasonal effects were found using cosinor analysis in both USA and Australia for "snoring" (p < 0.00001 for both countries). Similarly, seasonal patterns were observed for "sleep apnea" in the USA (p = 0.001); however, cosinor analysis was not significant for this search term in Australia (p = 0.13). Seasonal patterns for "snoring children" and "sleep apnea children" were observed in the USA (p = 0.002 and p < 0.00001, respectively), with insufficient search volume to examine these search terms in Australia. All searches peaked in the winter or early spring in both countries, with the magnitude of seasonal effect ranging from 5 to 50 %. Our findings indicate that there are significant seasonal trends for both snoring and sleep apnea internet search engine queries, with a peak in the winter and early spring. Further research is indicated to determine the mechanisms underlying these findings, whether they have clinical impact, and if they are associated with other comorbid medical conditions that have similar patterns of seasonal exacerbation.
Ali, Nadia; Peebles, David
2013-02-01
We report three experiments investigating the ability of undergraduate college students to comprehend 2 x 2 "interaction" graphs from two-way factorial research designs. Factorial research designs are an invaluable research tool widely used in all branches of the natural and social sciences, and the teaching of such designs lies at the core of many college curricula. Such data can be represented in bar or line graph form. Previous studies have shown, however, that people interpret these two graphical forms differently. In Experiment 1, participants were required to interpret interaction data in either bar or line graphs while thinking aloud. Verbal protocol analysis revealed that line graph users were significantly more likely to misinterpret the data or fail to interpret the graph altogether. The patterns of errors line graph users made were interpreted as arising from the operation of Gestalt principles of perceptual organization, and this interpretation was used to develop two modified versions of the line graph, which were then tested in two further experiments. One of the modifications resulted in a significant improvement in performance. Results of the three experiments support the proposed explanation and demonstrate the effects (both positive and negative) of Gestalt principles of perceptual organization on graph comprehension. We propose that our new design provides a more balanced representation of the data than the standard line graph for nonexpert users to comprehend the full range of relationships in two-way factorial research designs and may therefore be considered a more appropriate representation for use in educational and other nonexpert contexts.
Discrete Mathematical Approaches to Graph-Based Traffic Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Cowley, Wendy E.; Hogan, Emilie A.
2014-04-01
Modern cyber defense and anlaytics requires general, formal models of cyber systems. Multi-scale network models are prime candidates for such formalisms, using discrete mathematical methods based in hierarchically-structured directed multigraphs which also include rich sets of labels. An exemplar of an application of such an approach is traffic analysis, that is, observing and analyzing connections between clients, servers, hosts, and actors within IP networks, over time, to identify characteristic or suspicious patterns. Towards that end, NetFlow (or more generically, IPFLOW) data are available from routers and servers which summarize coherent groups of IP packets flowing through the network. In thismore » paper, we consider traffic analysis of Netflow using both basic graph statistics and two new mathematical measures involving labeled degree distributions and time interval overlap measures. We do all of this over the VAST test data set of 96M synthetic Netflow graph edges, against which we can identify characteristic patterns of simulated ground-truth network attacks.« less
Graph theory applied to the analysis of motor activity in patients with schizophrenia and depression
Fasmer, Erlend Eindride; Berle, Jan Øystein; Oedegaard, Ketil J.; Hauge, Erik R.
2018-01-01
Depression and schizophrenia are defined only by their clinical features, and diagnostic separation between them can be difficult. Disturbances in motor activity pattern are central features of both types of disorders. We introduce a new method to analyze time series, called the similarity graph algorithm. Time series of motor activity, obtained from actigraph registrations over 12 days in depressed and schizophrenic patients, were mapped into a graph and we then applied techniques from graph theory to characterize these time series, primarily looking for changes in complexity. The most marked finding was that depressed patients were found to be significantly different from both controls and schizophrenic patients, with evidence of less regularity of the time series, when analyzing the recordings with one hour intervals. These findings support the contention that there are important differences in control systems regulating motor behavior in patients with depression and schizophrenia. The similarity graph algorithm we have described can easily be applied to the study of other types of time series. PMID:29668743
Fasmer, Erlend Eindride; Fasmer, Ole Bernt; Berle, Jan Øystein; Oedegaard, Ketil J; Hauge, Erik R
2018-01-01
Depression and schizophrenia are defined only by their clinical features, and diagnostic separation between them can be difficult. Disturbances in motor activity pattern are central features of both types of disorders. We introduce a new method to analyze time series, called the similarity graph algorithm. Time series of motor activity, obtained from actigraph registrations over 12 days in depressed and schizophrenic patients, were mapped into a graph and we then applied techniques from graph theory to characterize these time series, primarily looking for changes in complexity. The most marked finding was that depressed patients were found to be significantly different from both controls and schizophrenic patients, with evidence of less regularity of the time series, when analyzing the recordings with one hour intervals. These findings support the contention that there are important differences in control systems regulating motor behavior in patients with depression and schizophrenia. The similarity graph algorithm we have described can easily be applied to the study of other types of time series.
NASA Technical Reports Server (NTRS)
Shewhart, Mark
1991-01-01
Statistical Process Control (SPC) charts are one of several tools used in quality control. Other tools include flow charts, histograms, cause and effect diagrams, check sheets, Pareto diagrams, graphs, and scatter diagrams. A control chart is simply a graph which indicates process variation over time. The purpose of drawing a control chart is to detect any changes in the process signalled by abnormal points or patterns on the graph. The Artificial Intelligence Support Center (AISC) of the Acquisition Logistics Division has developed a hybrid machine learning expert system prototype which automates the process of constructing and interpreting control charts.
PlantTribes: a gene and gene family resource for comparative genomics in plants
Wall, P. Kerr; Leebens-Mack, Jim; Müller, Kai F.; Field, Dawn; Altman, Naomi S.; dePamphilis, Claude W.
2008-01-01
The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting ∼4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study. PMID:18073194
Watanabe, Katsumi; Yoshimura, Yuko; Kikuchi, Mitsuru; Minabe, Yoshio; Aihara, Kazuyuki
2017-01-01
Autism spectrum disorder (ASD) is a developmental disorder that involves developmental delays. It has been hypothesized that aberrant neural connectivity in ASD may cause atypical brain network development. Brain graphs not only describe the differences in brain networks between clinical and control groups, but also provide information about network development within each group. In the present study, graph indices of brain networks were estimated in children with ASD and in typically developing (TD) children using magnetoencephalography performed while the children viewed a cartoon video. We examined brain graphs from a developmental point of view, and compared the networks between children with ASD and TD children. Network development patterns (NDPs) were assessed by examining the association between the graph indices and the raw scores on the achievement scale or the age of the children. The ASD and TD groups exhibited different NDPs at both network and nodal levels. In the left frontal areas, the nodal degree and efficiency of the ASD group were negatively correlated with the achievement scores. Reduced network connections were observed in the temporal and posterior areas of TD children. These results suggested that the atypical network developmental trajectory in children with ASD is associated with the development score rather than age. PMID:28886147
Graph theory network function in Parkinson's disease assessed with electroencephalography.
Utianski, Rene L; Caviness, John N; van Straaten, Elisabeth C W; Beach, Thomas G; Dugger, Brittany N; Shill, Holly A; Driver-Dunckley, Erika D; Sabbagh, Marwan N; Mehta, Shyamal; Adler, Charles H; Hentz, Joseph G
2016-05-01
To determine what differences exist in graph theory network measures derived from electroencephalography (EEG), between Parkinson's disease (PD) patients who are cognitively normal (PD-CN) and matched healthy controls; and between PD-CN and PD dementia (PD-D). EEG recordings were analyzed via graph theory network analysis to quantify changes in global efficiency and local integration. This included minimal spanning tree analysis. T-tests and correlations were used to assess differences between groups and assess the relationship with cognitive performance. Network measures showed increased local integration across all frequency bands between control and PD-CN; in contrast, decreased local integration occurred in PD-D when compared to PD-CN in the alpha1 frequency band. Differences found in PD-MCI mirrored PD-D. Correlations were found between network measures and assessments of global cognitive performance in PD. Our results reveal distinct patterns of band and network measure type alteration and breakdown for PD, as well as with cognitive decline in PD. These patterns suggest specific ways that interaction between cortical areas becomes abnormal and contributes to PD symptoms at various stages. Graph theory analysis by EEG suggests that network alteration and breakdown are robust attributes of PD cortical dysfunction pathophysiology. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
An ontology-based comparative anatomy information system
Travillian, Ravensara S.; Diatchka, Kremena; Judge, Tejinder K.; Wilamowska, Katarzyna; Shapiro, Linda G.
2010-01-01
Introduction This paper describes the design, implementation, and potential use of a comparative anatomy information system (CAIS) for querying on similarities and differences between homologous anatomical structures across species, the knowledge base it operates upon, the method it uses for determining the answers to the queries, and the user interface it employs to present the results. The relevant informatics contributions of our work include (1) the development and application of the structural difference method, a formalism for symbolically representing anatomical similarities and differences across species; (2) the design of the structure of a mapping between the anatomical models of two different species and its application to information about specific structures in humans, mice, and rats; and (3) the design of the internal syntax and semantics of the query language. These contributions provide the foundation for the development of a working system that allows users to submit queries about the similarities and differences between mouse, rat, and human anatomy; delivers result sets that describe those similarities and differences in symbolic terms; and serves as a prototype for the extension of the knowledge base to any number of species. Additionally, we expanded the domain knowledge by identifying medically relevant structural questions for the human, the mouse, and the rat, and made an initial foray into the validation of the application and its content by means of user questionnaires, software testing, and other feedback. Methods The anatomical structures of the species to be compared, as well as the mappings between species, are modeled on templates from the Foundational Model of Anatomy knowledge base, and compared using graph-matching techniques. A graphical user interface allows users to issue queries that retrieve information concerning similarities and differences between structures in the species being examined. Queries from diverse information sources, including domain experts, peer-reviewed articles, and reference books, have been used to test the system and to illustrate its potential use in comparative anatomy studies. Results 157 test queries were submitted to the CAIS system, and all of them were correctly answered. The interface was evaluated in terms of clarity and ease of use. This testing determined that the application works well, and is fairly intuitive to use, but users want to see more clarification of the meaning of the different types of possible queries. Some of the interface issues will naturally be resolved as we refine our conceptual model to deal with partial and complex homologies in the content. Conclusions The CAIS system and its associated methods are expected to be useful to biologists and translational medicine researchers. Possible applications range from supporting theoretical work in clarifying and modeling ontogenetic, physiological, pathological, and evolutionary transformations, to concrete techniques for improving the analysis of genotype–phenotype relationships among various animal models in support of a wide array of clinical and scientific initiatives. PMID:21146377
Semantics based approach for analyzing disease-target associations.
Kaalia, Rama; Ghosh, Indira
2016-08-01
A complex disease is caused by heterogeneous biological interactions between genes and their products along with the influence of environmental factors. There have been many attempts for understanding the cause of these diseases using experimental, statistical and computational methods. In the present work the objective is to address the challenge of representation and integration of information from heterogeneous biomedical aspects of a complex disease using semantics based approach. Semantic web technology is used to design Disease Association Ontology (DAO-db) for representation and integration of disease associated information with diabetes as the case study. The functional associations of disease genes are integrated using RDF graphs of DAO-db. Three semantic web based scoring algorithms (PageRank, HITS (Hyperlink Induced Topic Search) and HITS with semantic weights) are used to score the gene nodes on the basis of their functional interactions in the graph. Disease Association Ontology for Diabetes (DAO-db) provides a standard ontology-driven platform for describing genes, proteins, pathways involved in diabetes and for integrating functional associations from various interaction levels (gene-disease, gene-pathway, gene-function, gene-cellular component and protein-protein interactions). An automatic instance loader module is also developed in present work that helps in adding instances to DAO-db on a large scale. Our ontology provides a framework for querying and analyzing the disease associated information in the form of RDF graphs. The above developed methodology is used to predict novel potential targets involved in diabetes disease from the long list of loose (statistically associated) gene-disease associations. Copyright © 2016 Elsevier Inc. All rights reserved.
Considerations on the Use of Custom Accelerators for Big Data Analytics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Tumeo, Antonino; Minutoli, Marco
Accelerators, including Graphic Processing Units (GPUs) for gen- eral purpose computation, many-core designs with wide vector units (e.g., Intel Phi), have become a common component of many high performance clusters. The appearance of more stable and reliable tools tools that can automatically convert code written in high-level specifications with annotations (such as C or C++) to hardware de- scription languages (High-Level Synthesis - HLS), is also setting the stage for a broader use of reconfigurable devices (e.g., Field Pro- grammable Gate Arrays - FPGAs) in high performance system for the implementation of custom accelerators, helped by the fact that newmore » processors include advanced cache-coherent interconnects for these components. In this chapter, we briefly survey the status of the use of accelerators in high performance systems targeted at big data analytics applications. We argue that, although the progress in the use of accelerators for this class of applications has been sig- nificant, differently from scientific simulations there still are gaps to close. This is particularly true for the ”irregular” behaviors exhibited by no-SQL, graph databases. We focus our attention on the limits of HLS tools for data analytics and graph methods, and discuss a new architectural template that better fits the requirement of this class of applications. We validate the new architectural templates by mod- ifying the Graph Engine for Multithreaded System (GEMS) frame- work to support accelerators generated with such a methodology, and testing with queries coming from the Lehigh University Benchmark (LUBM). The architectural template enables better supporting the task and memory level parallelism present in graph methods by sup- porting a new control model and a enhanced memory interface. We show that out solution allows generating parallel accelerators, pro- viding speed ups with respect to conventional HLS flows. We finally draw conclusions and present a perspective on the use of reconfig- urable devices and Design Automation tools for data analytics.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gong, Zhenhuan; Boyuka, David; Zou, X
Download Citation Email Print Request Permissions Save to Project The size and scope of cutting-edge scientific simulations are growing much faster than the I/O and storage capabilities of their run-time environments. The growing gap is exacerbated by exploratory, data-intensive analytics, such as querying simulation data with multivariate, spatio-temporal constraints, which induces heterogeneous access patterns that stress the performance of the underlying storage system. Previous work addresses data layout and indexing techniques to improve query performance for a single access pattern, which is not sufficient for complex analytics jobs. We present PARLO a parallel run-time layout optimization framework, to achieve multi-levelmore » data layout optimization for scientific applications at run-time before data is written to storage. The layout schemes optimize for heterogeneous access patterns with user-specified priorities. PARLO is integrated with ADIOS, a high-performance parallel I/O middleware for large-scale HPC applications, to achieve user-transparent, light-weight layout optimization for scientific datasets. It offers simple XML-based configuration for users to achieve flexible layout optimization without the need to modify or recompile application codes. Experiments show that PARLO improves performance by 2 to 26 times for queries with heterogeneous access patterns compared to state-of-the-art scientific database management systems. Compared to traditional post-processing approaches, its underlying run-time layout optimization achieves a 56% savings in processing time and a reduction in storage overhead of up to 50%. PARLO also exhibits a low run-time resource requirement, while also limiting the performance impact on running applications to a reasonable level.« less
Local Table Condensation in Rough Set Approach for Jumping Emerging Pattern Induction
NASA Astrophysics Data System (ADS)
Terlecki, Pawel; Walczak, Krzysztof
This paper extends the rough set approach for JEP induction based on the notion of a condensed decision table. The original transaction database is transformed to a relational form and patterns are induced by means of local reducts. The transformation employs an item aggregation obtained by coloring a graph that re0ects con0icts among items. For e±ciency reasons we propose to perform this preprocessing locally, i.e. at the transaction level, to achieve a higher dimensionality gain. Special maintenance strategy is also used to avoid graph rebuilds. Both global and local approach have been tested and discussed for dense and synthetically generated sparse datasets.
Graph theoretical model of a sensorimotor connectome in zebrafish.
Stobb, Michael; Peterson, Joshua M; Mazzag, Borbala; Gahtan, Ethan
2012-01-01
Mapping the detailed connectivity patterns (connectomes) of neural circuits is a central goal of neuroscience. The best quantitative approach to analyzing connectome data is still unclear but graph theory has been used with success. We present a graph theoretical model of the posterior lateral line sensorimotor pathway in zebrafish. The model includes 2,616 neurons and 167,114 synaptic connections. Model neurons represent known cell types in zebrafish larvae, and connections were set stochastically following rules based on biological literature. Thus, our model is a uniquely detailed computational representation of a vertebrate connectome. The connectome has low overall connection density, with 2.45% of all possible connections, a value within the physiological range. We used graph theoretical tools to compare the zebrafish connectome graph to small-world, random and structured random graphs of the same size. For each type of graph, 100 randomly generated instantiations were considered. Degree distribution (the number of connections per neuron) varied more in the zebrafish graph than in same size graphs with less biological detail. There was high local clustering and a short average path length between nodes, implying a small-world structure similar to other neural connectomes and complex networks. The graph was found not to be scale-free, in agreement with some other neural connectomes. An experimental lesion was performed that targeted three model brain neurons, including the Mauthner neuron, known to control fast escape turns. The lesion decreased the number of short paths between sensory and motor neurons analogous to the behavioral effects of the same lesion in zebrafish. This model is expandable and can be used to organize and interpret a growing database of information on the zebrafish connectome.
Structural graph-based morphometry: A multiscale searchlight framework based on sulcal pits.
Takerkart, Sylvain; Auzias, Guillaume; Brun, Lucile; Coulon, Olivier
2017-01-01
Studying the topography of the cortex has proved valuable in order to characterize populations of subjects. In particular, the recent interest towards the deepest parts of the cortical sulci - the so-called sulcal pits - has opened new avenues in that regard. In this paper, we introduce the first fully automatic brain morphometry method based on the study of the spatial organization of sulcal pits - Structural Graph-Based Morphometry (SGBM). Our framework uses attributed graphs to model local patterns of sulcal pits, and further relies on three original contributions. First, a graph kernel is defined to provide a new similarity measure between pit-graphs, with few parameters that can be efficiently estimated from the data. Secondly, we present the first searchlight scheme dedicated to brain morphometry, yielding dense information maps covering the full cortical surface. Finally, a multi-scale inference strategy is designed to jointly analyze the searchlight information maps obtained at different spatial scales. We demonstrate the effectiveness of our framework by studying gender differences and cortical asymmetries: we show that SGBM can both localize informative regions and estimate their spatial scales, while providing results which are consistent with the literature. Thanks to the modular design of our kernel and the vast array of available kernel methods, SGBM can easily be extended to include a more detailed description of the sulcal patterns and solve different statistical problems. Therefore, we suggest that our SGBM framework should be useful for both reaching a better understanding of the normal brain and defining imaging biomarkers in clinical settings. Copyright © 2016 Elsevier B.V. All rights reserved.
Using Betweenness Centrality to Identify Manifold Shortcuts
Cukierski, William J.; Foran, David J.
2010-01-01
High-dimensional data presents a challenge to tasks of pattern recognition and machine learning. Dimensionality reduction (DR) methods remove the unwanted variance and make these tasks tractable. Several nonlinear DR methods, such as the well known ISOMAP algorithm, rely on a neighborhood graph to compute geodesic distances between data points. These graphs can contain unwanted edges which connect disparate regions of one or more manifolds. This topological sensitivity is well known [1], [2], [3], yet handling high-dimensional, noisy data in the absence of a priori manifold knowledge, remains an open and difficult problem. This work introduces a divisive, edge-removal method based on graph betweenness centrality which can robustly identify manifold-shorting edges. The problem of graph construction in high dimension is discussed and the proposed algorithm is fit into the ISOMAP workflow. ROC analysis is performed and the performance is tested on synthetic and real datasets. PMID:20607142
Social Structure and Depression in TrevorSpace.
Homan, Christopher M; Lu, Naiji; Tu, Xin; Lytle, Megan C; Silenzio, Vincent M B
2014-02-01
We discover patterns related to depression in the social graph of an online community of approximately 20,000 lesbian, gay, and bisexual, transgender, and questioning youth. With survey data on fewer than two hundred community members and the network graph of the entire community (which is completely anonymous except for the survey responses), we detected statistically significant correlations between a number of graph properties and those TrevorSpace users showing a higher likelihood of depression, according to the Patient Healthcare Questionnaire-9, a standard instrument for estimating depression. Our results suggest that those who are less depressed are more deeply integrated into the social fabric of TrevorSpace than those who are more depressed. Our techniques may apply to other hard-to-reach online communities, like gay men on Facebook, where obtaining detailed information about individuals is difficult or expensive, but obtaining the social graph is not.
Social Structure and Depression in TrevorSpace
Homan, Christopher M.; Lu, Naiji; Tu, Xin; Lytle, Megan C.; Silenzio, Vincent M.B.
2016-01-01
We discover patterns related to depression in the social graph of an online community of approximately 20,000 lesbian, gay, and bisexual, transgender, and questioning youth. With survey data on fewer than two hundred community members and the network graph of the entire community (which is completely anonymous except for the survey responses), we detected statistically significant correlations between a number of graph properties and those TrevorSpace users showing a higher likelihood of depression, according to the Patient Healthcare Questionnaire-9, a standard instrument for estimating depression. Our results suggest that those who are less depressed are more deeply integrated into the social fabric of TrevorSpace than those who are more depressed. Our techniques may apply to other hard-to-reach online communities, like gay men on Facebook, where obtaining detailed information about individuals is difficult or expensive, but obtaining the social graph is not. PMID:28492067
Principal curve detection in complicated graph images
NASA Astrophysics Data System (ADS)
Liu, Yuncai; Huang, Thomas S.
2001-09-01
Finding principal curves in an image is an important low level processing in computer vision and pattern recognition. Principal curves are those curves in an image that represent boundaries or contours of objects of interest. In general, a principal curve should be smooth with certain length constraint and allow either smooth or sharp turning. In this paper, we present a method that can efficiently detect principal curves in complicated map images. For a given feature image, obtained from edge detection of an intensity image or thinning operation of a pictorial map image, the feature image is first converted to a graph representation. In graph image domain, the operation of principal curve detection is performed to identify useful image features. The shortest path and directional deviation schemes are used in our algorithm os principal verve detection, which is proven to be very efficient working with real graph images.
State transfer in highly connected networks and a quantum Babinet principle
NASA Astrophysics Data System (ADS)
Tsomokos, D. I.; Plenio, M. B.; de Vega, I.; Huelga, S. F.
2008-12-01
The transfer of a quantum state between distant nodes in two-dimensional networks is considered. The fidelity of state transfer is calculated as a function of the number of interactions in networks that are described by regular graphs. It is shown that perfect state transfer is achieved in a network of size N , whose structure is that of an (N/2) -cross polytope graph, if N is a multiple of 4 . The result is reminiscent of the Babinet principle of classical optics. A quantum Babinet principle is derived, which allows for the identification of complementary graphs leading to the same fidelity of state transfer, in analogy with complementary screens providing identical diffraction patterns.
Interactive design of generic chemical patterns.
Schomburg, Karen T; Wetzer, Lars; Rarey, Matthias
2013-07-01
Every medicinal chemist has to create chemical patterns occasionally for querying databases, applying filters or describing functional groups. However, the representations of chemical patterns have been so far limited to languages with highly complex syntax, handicapping the application of patterns. Graphic pattern editors similar to chemical editors can facilitate the work with patterns. In this article, we review the interfaces of frequently used web search engines for chemical patterns. We take a look at pattern editing concepts of standalone chemical editors and finally present a completely new, unpublished graphical approach to pattern design, the SMARTSeditor. Copyright © 2013 Elsevier Ltd. All rights reserved.
Development of an Electronic Kit for detecting asthma in Human Respiratory System
NASA Astrophysics Data System (ADS)
Shek Hong, Cheow; Ghani, Ahmad Shahrizan Abdul; Khairuddin, Ismail Mohd
2018-03-01
In this paper, a prototype of a carbon dioxide (CO2) measurement device is designed to detect and used to monitor asthma patients. Nowadays, capnogram device is widely used in monitoring asthma and asthma related medical services. However, capnogram is very costly and unaffordable for patient especially those who are in low income household. Thus, the proposed device is cost effective, affordable, and produced to detect and monitor the severity of asthma. Meanwhile, flow meter will cause patient to have chest pain as they needed maximum effort to blow in the device. To overcome these limitations, this prototype electronic kit is easy to use and suitable for all range patients. This prototype electronic kit consists of MH-Z14A carbon dioxide (CO2) sensor to detect the concentration of carbon dioxide from the user exhaled air. Arduino microcontroller is used to process the data while TFT Display shield is applied for data presentation. In addition, HC-06 Bluetooth module is used to communicate with PC for further analysis of the captured graph. This device was tested with 3 asthmatics and 3 normal users. The results showed that asthmatic user has a different graph pattern compared with normal user and these graphs are clearly differentiated on the device TFT screen. Asthmatic user produces “shark fin”-like pattern whereas normal user produces “square wave”-like pattern. This device has successfully produced distinguished-patterns difference between asthmatic and normal user; therefore, it is suitable for asthma monitoring.
White, Ryen W; Horvitz, Eric
2017-03-01
A statistical model that predicts the appearance of strong evidence of a lung carcinoma diagnosis via analysis of large-scale anonymized logs of web search queries from millions of people across the United States. To evaluate the feasibility of screening patients at risk of lung carcinoma via analysis of signals from online search activity. We identified people who issue special queries that provide strong evidence of a recent diagnosis of lung carcinoma. We then considered patterns of symptoms expressed as searches about concerning symptoms over several months prior to the appearance of the landmark web queries. We built statistical classifiers that predict the future appearance of landmark queries based on the search log signals. This was a retrospective log analysis of the online activity of millions of web searchers seeking health-related information online. Of web searchers who queried for symptoms related to lung carcinoma, some (n = 5443 of 4 813 985) later issued queries that provide strong evidence of recent clinical diagnosis of lung carcinoma and are regarded as positive cases in our analysis. Additional evidence on the reliability of these queries as representing clinical diagnoses is based on the significant increase in follow-on searches for treatments and medications for these searchers and on the correlation between lung carcinoma incidence rates and our log-based statistics. The remaining symptom searchers (n = 4 808 542) are regarded as negative cases. Performance of the statistical model for early detection from online search behavior, for different lead times, different sets of signals, and different cohorts of searchers stratified by potential risk. The statistical classifier predicting the future appearance of landmark web queries based on search log signals identified searchers who later input queries consistent with a lung carcinoma diagnosis, with a true-positive rate ranging from 3% to 57% for false-positive rates ranging from 0.00001 to 0.001, respectively. The methods can be used to identify people at highest risk up to a year in advance of the inferred diagnosis time. The 5 factors associated with the highest relative risk (RR) were evidence of family history (RR = 7.548; 95% CI, 3.937-14.470), age (RR = 3.558; 95% CI, 3.357-3.772), radon (RR = 2.529; 95% CI, 1.137-5.624), primary location (RR = 2.463; 95% CI, 1.364-4.446), and occupation (RR = 1.969; 95% CI, 1.143-3.391). Evidence of smoking (RR = 1.646; 95% CI, 1.032-2.260) was important but not top-ranked, which was due to the difficulty of identifying smoking history from search terms. Pattern recognition based on data drawn from large-scale web search queries holds opportunity for identifying risk factors and frames new directions with early detection of lung carcinoma.
Tune the topology to create or destroy patterns
NASA Astrophysics Data System (ADS)
Asllani, Malbor; Carletti, Timoteo; Fanelli, Duccio
2016-12-01
We consider the dynamics of a reaction-diffusion system on a multigraph. The species share the same set of nodes but can access different links to explore the embedding spatial support. By acting on the topology of the networks we can control the ability of the system to self-organise in macroscopic patterns, emerging as a symmetry breaking instability of an homogeneous fixed point. Two different cases study are considered: on the one side, we produce a global modification of the networks, starting from the limiting setting where species are hosted on the same graph. On the other, we consider the effect of inserting just one additional single link to differentiate the two graphs. In both cases, patterns can be generated or destroyed, as follows the imposed, small, topological perturbation. Approximate analytical formulae allow to grasp the essence of the phenomenon and can potentially inspire innovative control strategies to shape the macroscopic dynamics on multigraph networks.
Generalised power graph compression reveals dominant relationship patterns in complex networks
Ahnert, Sebastian E.
2014-01-01
We introduce a framework for the discovery of dominant relationship patterns in complex networks, by compressing the networks into power graphs with overlapping power nodes. When paired with enrichment analysis of node classification terms, the most compressible sets of edges provide a highly informative sketch of the dominant relationship patterns that define the network. In addition, this procedure also gives rise to a novel, link-based definition of overlapping node communities in which nodes are defined by their relationships with sets of other nodes, rather than through connections within the community. We show that this completely general approach can be applied to undirected, directed, and bipartite networks, yielding valuable insights into the large-scale structure of real-world networks, including social networks and food webs. Our approach therefore provides a novel way in which network architecture can be studied, defined and classified. PMID:24663099
Yeung, Daniel; Boes, Peter; Ho, Meng Wei; Li, Zuofeng
2015-05-08
Image-guided radiotherapy (IGRT), based on radiopaque markers placed in the prostate gland, was used for proton therapy of prostate patients. Orthogonal X-rays and the IBA Digital Image Positioning System (DIPS) were used for setup correction prior to treatment and were repeated after treatment delivery. Following a rationale for margin estimates similar to that of van Herk,(1) the daily post-treatment DIPS data were analyzed to determine if an adaptive radiotherapy plan was necessary. A Web application using ASP.NET MVC5, Entity Framework, and an SQL database was designed to automate this process. The designed features included state-of-the-art Web technologies, a domain model closely matching the workflow, a database-supporting concurrency and data mining, access to the DIPS database, secured user access and roles management, and graphing and analysis tools. The Model-View-Controller (MVC) paradigm allowed clean domain logic, unit testing, and extensibility. Client-side technologies, such as jQuery, jQuery Plug-ins, and Ajax, were adopted to achieve a rich user environment and fast response. Data models included patients, staff, treatment fields and records, correction vectors, DIPS images, and association logics. Data entry, analysis, workflow logics, and notifications were implemented. The system effectively modeled the clinical workflow and IGRT process.
BIOZON: a system for unification, management and analysis of heterogeneous biological data.
Birkland, Aaron; Yona, Golan
2006-02-15
Integration of heterogeneous data types is a challenging problem, especially in biology, where the number of databases and data types increase rapidly. Amongst the problems that one has to face are integrity, consistency, redundancy, connectivity, expressiveness and updatability. Here we present a system (Biozon) that addresses these problems, and offers biologists a new knowledge resource to navigate through and explore. Biozon unifies multiple biological databases consisting of a variety of data types (such as DNA sequences, proteins, interactions and cellular pathways). It is fundamentally different from previous efforts as it uses a single extensive and tightly connected graph schema wrapped with hierarchical ontology of documents and relations. Beyond warehousing existing data, Biozon computes and stores novel derived data, such as similarity relationships and functional predictions. The integration of similarity data allows propagation of knowledge through inference and fuzzy searches. Sophisticated methods of query that span multiple data types were implemented and first-of-a-kind biological ranking systems were explored and integrated. The Biozon system is an extensive knowledge resource of heterogeneous biological data. Currently, it holds more than 100 million biological documents and 6.5 billion relations between them. The database is accessible through an advanced web interface that supports complex queries, "fuzzy" searches, data materialization and more, online at http://biozon.org.
The Endpoint Hypothesis: A Topological-Cognitive Assessment of Geographic Scale Movement Patterns
NASA Astrophysics Data System (ADS)
Klippel, Alexander; Li, Rui
Movement patterns of individual entities at the geographic scale are becoming a prominent research focus in spatial sciences. One pertinent question is how cognitive and formal characterizations of movement patterns relate. In other words, are (mostly qualitative) formal characterizations cognitively adequate? This article experimentally evaluates movement patterns that can be characterized as paths through a conceptual neighborhood graph, that is, two extended spatial entities changing their topological relationship gradually. The central questions addressed are: (a) Do humans naturally use topology to create cognitive equivalent classes, that is, is topology the basis for categorizing movement patterns spatially? (b) Are ‘all’ topological relations equally salient, and (c) does language influence categorization. The first two questions are addressed using a modification of the endpoint hypothesis stating that: movement patterns are distinguished by the topological relation they end in. The third question addresses whether language has an influence on the classification of movement patterns, that is, whether there is a difference between linguistic and non-linguistic category construction. In contrast to our previous findings we were able to document the importance of topology for conceptualizing movement patterns but also reveal differences in the cognitive saliency of topological relations. The latter aspect calls for a weighted conceptual neighborhood graph to cognitively adequately model human conceptualization processes.
Motivated Proteins: A web application for studying small three-dimensional protein motifs
Leader, David P; Milner-White, E James
2009-01-01
Background Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are αβ-motifs, asx-motifs, asx-turns, β-bulges, β-bulge loops, β-turns, nests, niches, Schellmann loops, ST-motifs, ST-staples and ST-turns. We have constructed a database of such motifs from a range of high-quality protein structures and built a web application as a visual interface to this. Description The web application, Motivated Proteins, provides access to these 12 motifs (with 48 sub-categories) in a database of over 400 representative proteins. Queries can be made for specific categories or sub-categories of motif, motifs in the vicinity of ligands, motifs which include part of an enzyme active site, overlapping motifs, or motifs which include a particular amino acid sequence. Individual proteins can be specified, or, where appropriate, motifs for all proteins listed. The results of queries are presented in textual form as an (X)HTML table, and may be saved as parsable plain text or XML. Motifs can be viewed and manipulated either individually or in the context of the protein in the Jmol applet structural viewer. Cartoons of the motifs imposed on a linear representation of protein secondary structure are also provided. Summary information for the motifs is available, as are histograms of amino acid distribution, and graphs of dihedral angles at individual positions in the motifs. Conclusion Motivated Proteins is a publicly and freely accessible web application that enables protein scientists to study small three-dimensional motifs without requiring knowledge of either Structured Query Language or the underlying database schema. PMID:19210785
On the structure of Bayesian network for Indonesian text document paraphrase identification
NASA Astrophysics Data System (ADS)
Prayogo, Ario Harry; Syahrul Mubarok, Mohamad; Adiwijaya
2018-03-01
Paraphrase identification is an important process within natural language processing. The idea is to automatically recognize phrases that have different forms but contain same meanings. For examples if we input query “causing fire hazard”, then the computer has to recognize this query that this query has same meaning as “the cause of fire hazard. Paraphrasing is an activity that reveals the meaning of an expression, writing, or speech using different words or forms, especially to achieve greater clarity. In this research we will focus on classifying two Indonesian sentences whether it is a paraphrase to each other or not. There are four steps in this research, first is preprocessing, second is feature extraction, third is classifier building, and the last is performance evaluation. Preprocessing consists of tokenization, non-alphanumerical removal, and stemming. After preprocessing we will conduct feature extraction in order to build new features from given dataset. There are two kinds of features in the research, syntactic features and semantic features. Syntactic features consist of normalized levenshtein distance feature, term-frequency based cosine similarity feature, and LCS (Longest Common Subsequence) feature. Semantic features consist of Wu and Palmer feature and Shortest Path Feature. We use Bayesian Networks as the method of training the classifier. Parameter estimation that we use is called MAP (Maximum A Posteriori). For structure learning of Bayesian Networks DAG (Directed Acyclic Graph), we use BDeu (Bayesian Dirichlet equivalent uniform) scoring function and for finding DAG with the best BDeu score, we use K2 algorithm. In evaluation step we perform cross-validation. The average result that we get from testing the classifier as follows: Precision 75.2%, Recall 76.5%, F1-Measure 75.8% and Accuracy 75.6%.
Chen, Josephine; Zhao, Po; Massaro, Donald; Clerch, Linda B; Almon, Richard R; DuBois, Debra C; Jusko, William J; Hoffman, Eric P
2004-01-01
Publicly accessible DNA databases (genome browsers) are rapidly accelerating post-genomic research (see http://www.genome.ucsc.edu/), with integrated genomic DNA, gene structure, EST/ splicing and cross-species ortholog data. DNA databases have relatively low dimensionality; the genome is a linear code that anchors all associated data. In contrast, RNA expression and protein databases need to be able to handle very high dimensional data, with time, tissue, cell type and genes, as interrelated variables. The high dimensionality of microarray expression profile data, and the lack of a standard experimental platform have complicated the development of web-accessible databases and analytical tools. We have designed and implemented a public resource of expression profile data containing 1024 human, mouse and rat Affymetrix GeneChip expression profiles, generated in the same laboratory, and subject to the same quality and procedural controls (Public Expression Profiling Resource; PEPR). Our Oracle-based PEPR data warehouse includes a novel time series query analysis tool (SGQT), enabling dynamic generation of graphs and spreadsheets showing the action of any transcript of interest over time. In this report, we demonstrate the utility of this tool using a 27 time point, in vivo muscle regeneration series. This data warehouse and associated analysis tools provides access to multidimensional microarray data through web-based interfaces, both for download of all types of raw data for independent analysis, and also for straightforward gene-based queries. Planned implementations of PEPR will include web-based remote entry of projects adhering to quality control and standard operating procedure (QC/SOP) criteria, and automated output of alternative probe set algorithms for each project (see http://microarray.cnmcresearch.org/pgadatatable.asp).
Fast 3D shape screening of large chemical databases through alignment-recycling
Fontaine, Fabien; Bolton, Evan; Borodina, Yulia; Bryant, Stephen H
2007-01-01
Background Large chemical databases require fast, efficient, and simple ways of looking for similar structures. Although such tasks are now fairly well resolved for graph-based similarity queries, they remain an issue for 3D approaches, particularly for those based on 3D shape overlays. Inspired by a recent technique developed to compare molecular shapes, we designed a hybrid methodology, alignment-recycling, that enables efficient retrieval and alignment of structures with similar 3D shapes. Results Using a dataset of more than one million PubChem compounds of limited size (< 28 heavy atoms) and flexibility (< 6 rotatable bonds), we obtained a set of a few thousand diverse structures covering entirely the 3D shape space of the conformers of the dataset. Transformation matrices gathered from the overlays between these diverse structures and the 3D conformer dataset allowed us to drastically (100-fold) reduce the CPU time required for shape overlay. The alignment-recycling heuristic produces results consistent with de novo alignment calculation, with better than 80% hit list overlap on average. Conclusion Overlay-based 3D methods are computationally demanding when searching large databases. Alignment-recycling reduces the CPU time to perform shape similarity searches by breaking the alignment problem into three steps: selection of diverse shapes to describe the database shape-space; overlay of the database conformers to the diverse shapes; and non-optimized overlay of query and database conformers using common reference shapes. The precomputation, required by the first two steps, is a significant cost of the method; however, once performed, querying is two orders of magnitude faster. Extensions and variations of this methodology, for example, to handle more flexible and larger small-molecules are discussed. PMID:17880744
Chen, Josephine; Zhao, Po; Massaro, Donald; Clerch, Linda B.; Almon, Richard R.; DuBois, Debra C.; Jusko, William J.; Hoffman, Eric P.
2004-01-01
Publicly accessible DNA databases (genome browsers) are rapidly accelerating post-genomic research (see http://www.genome.ucsc.edu/), with integrated genomic DNA, gene structure, EST/ splicing and cross-species ortholog data. DNA databases have relatively low dimensionality; the genome is a linear code that anchors all associated data. In contrast, RNA expression and protein databases need to be able to handle very high dimensional data, with time, tissue, cell type and genes, as interrelated variables. The high dimensionality of microarray expression profile data, and the lack of a standard experimental platform have complicated the development of web-accessible databases and analytical tools. We have designed and implemented a public resource of expression profile data containing 1024 human, mouse and rat Affymetrix GeneChip expression profiles, generated in the same laboratory, and subject to the same quality and procedural controls (Public Expression Profiling Resource; PEPR). Our Oracle-based PEPR data warehouse includes a novel time series query analysis tool (SGQT), enabling dynamic generation of graphs and spreadsheets showing the action of any transcript of interest over time. In this report, we demonstrate the utility of this tool using a 27 time point, in vivo muscle regeneration series. This data warehouse and associated analysis tools provides access to multidimensional microarray data through web-based interfaces, both for download of all types of raw data for independent analysis, and also for straightforward gene-based queries. Planned implementations of PEPR will include web-based remote entry of projects adhering to quality control and standard operating procedure (QC/SOP) criteria, and automated output of alternative probe set algorithms for each project (see http://microarray.cnmcresearch.org/pgadatatable.asp). PMID:14681485
A new method for the automatic retrieval of medical cases based on the RadLex ontology.
Spanier, A B; Cohen, D; Joskowicz, L
2017-03-01
The goal of medical case-based image retrieval (M-CBIR) is to assist radiologists in the clinical decision-making process by finding medical cases in large archives that most resemble a given case. Cases are described by radiology reports comprised of radiological images and textual information on the anatomy and pathology findings. The textual information, when available in standardized terminology, e.g., the RadLex ontology, and used in conjunction with the radiological images, provides a substantial advantage for M-CBIR systems. We present a new method for incorporating textual radiological findings from medical case reports in M-CBIR. The input is a database of medical cases, a query case, and the number of desired relevant cases. The output is an ordered list of the most relevant cases in the database. The method is based on a new case formulation, the Augmented RadLex Graph and an Anatomy-Pathology List. It uses a new case relatedness metric [Formula: see text] that prioritizes more specific medical terms in the RadLex tree over less specific ones and that incorporates the length of the query case. An experimental study on 8 CT queries from the 2015 VISCERAL 3D Case Retrieval Challenge database consisting of 1497 volumetric CT scans shows that our method has accuracy rates of 82 and 70% on the first 10 and 30 most relevant cases, respectively, thereby outperforming six other methods. The increasing amount of medical imaging data acquired in clinical practice constitutes a vast database of untapped diagnostically relevant information. This paper presents a new hybrid approach to retrieving the most relevant medical cases based on textual and image information.
NASA Technical Reports Server (NTRS)
Abiteboul, Serge
1997-01-01
The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in the systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specic interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases. As will seen later when the notion of semi-structured data is more precisely de ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data- formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research. The main purpose of the paper is to isolate the essential aspects of semi- structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data.
Finding patterns in biomolecular data, particularly in DNA and RNA, is at the center of modern biological research. These data are complex and growing rapidly, so the search for patterns requires increasingly sophisticated computer methods. This book provides a summary of principal techniques. Each chapter describes techniques that are drawn from many fields, including graph
Evaluation of gingival vascularisation using laser Doppler flowmetry
NASA Astrophysics Data System (ADS)
Vitez, B.; Todea, C.; Velescu, A.; Şipoş, C.
2016-03-01
Aim: The present study aims to assess the level of vascularisation of the lower frontal gingiva of smoker patients, in comparison with non-smokers by using Laser Doppler Flowmetry (LDF), in order to determine the changes in gingival microcirculation. Material & methods: 16 volunteers were included in this study and separated into 2 equal groups: non-smoker subjects in Group I and smoker subjects in Group II. All patients were submitted to a visual examination and professional cleaning The gingival bloodflow of each patient was recorded in 5 zones using LDF, resulting in a total of 80 recordings. LDF was done with the Moor Instruments Ltd. "moorLAB" Laser Doppler. All data were collected as graphs, raw values and statistically analyzed. Results: After strict analysis results show that Group II presents a steady level of gingival microcirculation with even patterns in the graph, while Group I shows many signs of damage to it`s microvascular system through many irregularities in the microcirculation level and graph patterns. Conclusion: The results suggest that prolonged smoking has a definitive effect on the gingival vascularisation making it a key factor in periodontal pathology.
Spectral mapping of brain functional connectivity from diffusion imaging.
Becker, Cassiano O; Pequito, Sérgio; Pappas, George J; Miller, Michael B; Grafton, Scott T; Bassett, Danielle S; Preciado, Victor M
2018-01-23
Understanding the relationship between the dynamics of neural processes and the anatomical substrate of the brain is a central question in neuroscience. On the one hand, modern neuroimaging technologies, such as diffusion tensor imaging, can be used to construct structural graphs representing the architecture of white matter streamlines linking cortical and subcortical structures. On the other hand, temporal patterns of neural activity can be used to construct functional graphs representing temporal correlations between brain regions. Although some studies provide evidence that whole-brain functional connectivity is shaped by the underlying anatomy, the observed relationship between function and structure is weak, and the rules by which anatomy constrains brain dynamics remain elusive. In this article, we introduce a methodology to map the functional connectivity of a subject at rest from his or her structural graph. Using our methodology, we are able to systematically account for the role of structural walks in the formation of functional correlations. Furthermore, in our empirical evaluations, we observe that the eigenmodes of the mapped functional connectivity are associated with activity patterns associated with different cognitive systems.
NASA Astrophysics Data System (ADS)
Holme, Petter; Saramäki, Jari
2012-10-01
A great variety of systems in nature, society and technology-from the web of sexual contacts to the Internet, from the nervous system to power grids-can be modeled as graphs of vertices coupled by edges. The network structure, describing how the graph is wired, helps us understand, predict and optimize the behavior of dynamical systems. In many cases, however, the edges are not continuously active. As an example, in networks of communication via e-mail, text messages, or phone calls, edges represent sequences of instantaneous or practically instantaneous contacts. In some cases, edges are active for non-negligible periods of time: e.g., the proximity patterns of inpatients at hospitals can be represented by a graph where an edge between two individuals is on throughout the time they are at the same ward. Like network topology, the temporal structure of edge activations can affect dynamics of systems interacting through the network, from disease contagion on the network of patients to information diffusion over an e-mail network. In this review, we present the emergent field of temporal networks, and discuss methods for analyzing topological and temporal structure and models for elucidating their relation to the behavior of dynamical systems. In the light of traditional network theory, one can see this framework as moving the information of when things happen from the dynamical system on the network, to the network itself. Since fundamental properties, such as the transitivity of edges, do not necessarily hold in temporal networks, many of these methods need to be quite different from those for static networks. The study of temporal networks is very interdisciplinary in nature. Reflecting this, even the object of study has many names-temporal graphs, evolving graphs, time-varying graphs, time-aggregated graphs, time-stamped graphs, dynamic networks, dynamic graphs, dynamical graphs, and so on. This review covers different fields where temporal graphs are considered, but does not attempt to unify related terminology-rather, we want to make papers readable across disciplines.
An initial log analysis of usage patterns on a research networking system.
Boland, Mary Regina; Trembowelski, Sylvia; Bakken, Suzanne; Weng, Chunhua
2012-08-01
Usage data for research networking systems (RNSs) are valuable but generally unavailable for understanding scientific professionals' information needs and online collaborator seeking behaviors. This study contributes a method for evaluating RNSs and initial usage knowledge of one RNS obtained from using this method. We designed a log for an institutional RNS, defined categories of users and tasks, and analyzed correlations between usage patterns and user and query types. Our results show that scientific professionals spend more time performing deep Web searching on RNSs than generic Google users and we also show that retrieving scientist profiles is faster on an RNS than on Google (3.5 seconds vs. 34.2 seconds) whereas organization-specific browsing on a RNS takes longer than on Google (117.0 seconds vs. 34.2 seconds). Usage patterns vary by user role, e.g., faculty performed more informational queries than administrators, which implies role-specific user support is needed for RNSs. © 2012 Wiley Periodicals, Inc.
An Initial Log Analysis of Usage Patterns on a Research Networking System
Boland, Mary Regina; Trembowelski, Sylvia; Bakken, Suzanne; Weng, Chunhua
2012-01-01
Abstract Usage data for research networking systems (RNSs) are valuable but generally unavailable for understanding scientific professionals’ information needs and online collaborator seeking behaviors. This study contributes a method for evaluating RNSs and initial usage knowledge of one RNS obtained from using this method. We designed a log for an institutional RNS, defined categories of users and tasks, and analyzed correlations between usage patterns and user and query types. Our results show that scientific professionals spend more time performing deep Web searching on RNSs than generic Google users and we also show that retrieving scientist profiles is faster on an RNS than on Google (3.5 seconds vs. 34.2 seconds) whereas organization‐specific browsing on a RNS takes longer than on Google (117.0 seconds vs. 34.2 seconds). Usage patterns vary by user role, e.g., faculty performed more informational queries than administrators, which implies role‐specific user support is needed for RNSs. Clin Trans Sci 2012; Volume 5: 340–347 PMID:22883612
Novel Surveillance of Psychological Distress during the Great Recession
Ayers, John W.; Althouse, Benjamin M.; Allem, Jon-Patrick; Childers, Matthew A.; Zafar, Waleed; Latkin, Carl; Ribisl, Kurt M.; Brownstein, John S.
2015-01-01
Background Economic stressors have been retrospectively associated with net population increases in nonspecific psychological distress (PD). However, no sentinels exist to evaluate contemporaneous associations. Aggregate Internet search query surveillance was used to monitor population changes in PD around the United States’ Great Recession. Methods Monthly PD query trends were compared with unemployment, underemployment, homes in delinquency and foreclosure, median home value or sale prices, and S&P 500 trends for 2004–2010. Time series analyses, where economic indicators predicted PD one to seven months into the future, were performed in 2011. Results PD queries surpassed 1,000,000 per month, of which 300,000 may be attributable to the Great Recession. A one percentage point increase in mortgage delinquencies and foreclosures was associated with a 16% (95%CI, 9–24) increase in PD queries one-month, and 11% (95%CI, 3–18) four months later, in reference to a pre-Great Recession mean. Unemployment and underemployment had similar associations half and one-quarter the intensity. “Anxiety disorder,” “what is depression,” “signs of depression,” “depression symptoms,” and “symptoms of depression” were the queries exhibiting the strongest associations with mortgage delinquencies and foreclosures, unemployment or underemployment. Housing prices and S&P 500 trends were not associated with PD queries. Limitations A non-traditional measure of PD was used. It is unclear if actual clinically significant depression or anxiety increased during the Great Recession. Alternative explanations for strong associations between the Great Recession and PD queries, such as media, were explored and rejected. Conclusions Because the economy is constantly changing, this work not only provides a snapshot of recent associations between the economy and PD queries but also a framework and toolkit for real-time surveillance going forward. Health resources, clinician screening patterns, and policy debate may potentially be informed by changes in PD query trends. PMID:22835843
Novel surveillance of psychological distress during the great recession.
Ayers, John W; Althouse, Benjamin M; Allem, Jon-Patrick; Childers, Matthew A; Zafar, Waleed; Latkin, Carl; Ribisl, Kurt M; Brownstein, John S
2012-12-15
Economic stressors have been retrospectively associated with net population increases in nonspecific psychological distress (PD). However, no sentinels exist to evaluate contemporaneous associations. Aggregate Internet search query surveillance was used to monitor population changes in PD around the United States' Great Recession. Monthly PD query trends were compared with unemployment, underemployment, homes in delinquency and foreclosure, median home value or sale prices, and S&P 500 trends for 2004-2010. Time series analyses, where economic indicators predicted PD one to seven months into the future, were performed in 2011. PD queries surpassed 1,000,000 per month, of which 300,000 may be attributable to the Great Recession. A one percentage point increase in mortgage delinquencies and foreclosures was associated with a 16% (95%CI, 9-24) increase in PD queries one-month, and 11% (95%CI, 3-18) four months later, in reference to a pre-Great Recession mean. Unemployment and underemployment had similar associations half and one-quarter the intensity. "Anxiety disorder", "what is depression", "signs of depression", "depression symptoms", and "symptoms of depression" were the queries exhibiting the strongest associations with mortgage delinquencies and foreclosures, unemployment or underemployment. Housing prices and S&P 500 trends were not associated with PD queries. A non-traditional measure of PD was used. It is unclear if actual clinically significant depression or anxiety increased during the Great Recession. Alternative explanations for strong associations between the Great Recession and PD queries, such as media, were explored and rejected. Because the economy is constantly changing, this work not only provides a snapshot of recent associations between the economy and PD queries but also a framework and toolkit for real-time surveillance going forward. Health resources, clinician screening patterns, and policy debate may be informed by changes in PD query trends. Copyright © 2012 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
He, Xianjin; Zhang, Xinchang; Xin, Qinchuan
2018-02-01
Recognition of building group patterns (i.e., the arrangement and form exhibited by a collection of buildings at a given mapping scale) is important to the understanding and modeling of geographic space and is hence essential to a wide range of downstream applications such as map generalization. Most of the existing methods develop rigid rules based on the topographic relationships between building pairs to identify building group patterns and thus their applications are often limited. This study proposes a method to identify a variety of building group patterns that allow for map generalization. The method first identifies building group patterns from potential building clusters based on a machine-learning algorithm and further partitions the building clusters with no recognized patterns based on the graph partitioning method. The proposed method is applied to the datasets of three cities that are representative of the complex urban environment in Southern China. Assessment of the results based on the reference data suggests that the proposed method is able to recognize both regular (e.g., the collinear, curvilinear, and rectangular patterns) and irregular (e.g., the L-shaped, H-shaped, and high-density patterns) building group patterns well, given that the correctness values are consistently nearly 90% and the completeness values are all above 91% for three study areas. The proposed method shows promises in automated recognition of building group patterns that allows for map generalization.
Indexing Volumetric Shapes with Matching and Packing
Koes, David Ryan; Camacho, Carlos J.
2014-01-01
We describe a novel algorithm for bulk-loading an index with high-dimensional data and apply it to the problem of volumetric shape matching. Our matching and packing algorithm is a general approach for packing data according to a similarity metric. First an approximate k-nearest neighbor graph is constructed using vantage-point initialization, an improvement to previous work that decreases construction time while improving the quality of approximation. Then graph matching is iteratively performed to pack related items closely together. The end result is a dense index with good performance. We define a new query specification for shape matching that uses minimum and maximum shape constraints to explicitly specify the spatial requirements of the desired shape. This specification provides a natural language for performing volumetric shape matching and is readily supported by the geometry-based similarity search (GSS) tree, an indexing structure that maintains explicit representations of volumetric shape. We describe our implementation of a GSS tree for volumetric shape matching and provide a comprehensive evaluation of parameter sensitivity, performance, and scalability. Compared to previous bulk-loading algorithms, we find that matching and packing can construct a GSS-tree index in the same amount of time that is denser, flatter, and better performing, with an observed average performance improvement of 2X. PMID:26085707
Science and Technology Pocket Data Book.
ERIC Educational Resources Information Center
National Science Foundation, Washington, DC. Div. of Science Resources Studies.
This pocket guide contains a collection of graphed data, available in 1994, on science and technology funding patterns within the United States, public attitudes toward science and technology, and international trends in science and technology. Sections contain: (1) national research and development (R&D) funding patterns; (2) academic R&D…
Linking Models: Reasoning from Patterns to Tables and Equations
ERIC Educational Resources Information Center
Switzer, J. Matt
2013-01-01
Patterns are commonly used in middle years mathematics classrooms to teach students about functions and modelling with tables, graphs, and equations. Grade 6 students are expected to, "continue and create sequences involving whole numbers, fractions and decimals," and "describe the rule used to create the sequence." (Australian…
NASA Astrophysics Data System (ADS)
Zhang, Honghai; Abiose, Ademola K.; Campbell, Dwayne N.; Sonka, Milan; Martins, James B.; Wahle, Andreas
2010-03-01
Quantitative analysis of the left ventricular shape and motion patterns associated with left ventricular mechanical dyssynchrony (LVMD) is essential for diagnosis and treatment planning in congestive heart failure. Real-time 3D echocardiography (RT3DE) used for LVMD analysis is frequently limited by heavy speckle noise or partially incomplete data, thus a segmentation method utilizing learned global shape knowledge is beneficial. In this study, the endocardial surface of the left ventricle (LV) is segmented using a hybrid approach combining active shape model (ASM) with optimal graph search. The latter is used to achieve landmark refinement in the ASM framework. Optimal graph search translates the 3D segmentation into the detection of a minimum-cost closed set in a graph and can produce a globally optimal result. Various information-gradient, intensity distributions, and regional-property terms-are used to define the costs for the graph search. The developed method was tested on 44 RT3DE datasets acquired from 26 LVMD patients. The segmentation accuracy was assessed by surface positioning error and volume overlap measured for the whole LV as well as 16 standard LV regions. The segmentation produced very good results that were not achievable using ASM or graph search alone.
Building Specialized Multilingual Lexical Graphs Using Community Resources
NASA Astrophysics Data System (ADS)
Daoud, Mohammad; Boitet, Christian; Kageura, Kyo; Kitamoto, Asanobu; Mangeot, Mathieu; Daoud, Daoud
We are describing methods for compiling domain-dedicated multilingual terminological data from various resources. We focus on collecting data from online community users as a main source, therefore, our approach depends on acquiring contributions from volunteers (explicit approach), and it depends on analyzing users' behaviors to extract interesting patterns and facts (implicit approach). As a generic repository that can handle the collected multilingual terminological data, we are describing the concept of dedicated Multilingual Preterminological Graphs MPGs, and some automatic approaches for constructing them by analyzing the behavior of online community users. A Multilingual Preterminological Graph is a special lexical resource that contains massive amount of terms related to a special domain. We call it preterminological, because it is a raw material that can be used to build a standardized terminological repository. Building such a graph is difficult using traditional approaches, as it needs huge efforts by domain specialists and terminologists. In our approach, we build such a graph by analyzing the access log files of the website of the community, and by finding the important terms that have been used to search in that website, and their association with each other. We aim at making this graph as a seed repository so multilingual volunteers can contribute. We are experimenting this approach with the Digital Silk Road Project. We have used its access log files since its beginning in 2003, and obtained an initial graph of around 116000 terms. As an application, we used this graph to obtain a preterminological multilingual database that is serving a CLIR system for the DSR project.
A Multilayer Network Approach for Guiding Drug Repositioning in Neglected Diseases
Chernomoretz, Ariel; Agüero, Fernán
2016-01-01
Drug development for neglected diseases has been historically hampered due to lack of market incentives. The advent of public domain resources containing chemical information from high throughput screenings is changing the landscape of drug discovery for these diseases. In this work we took advantage of data from extensively studied organisms like human, mouse, E. coli and yeast, among others, to develop a novel integrative network model to prioritize and identify candidate drug targets in neglected pathogen proteomes, and bioactive drug-like molecules. We modeled genomic (proteins) and chemical (bioactive compounds) data as a multilayer weighted network graph that takes advantage of bioactivity data across 221 species, chemical similarities between 1.7 105 compounds and several functional relations among 1.67 105 proteins. These relations comprised orthology, sharing of protein domains, and shared participation in defined biochemical pathways. We showcase the application of this network graph to the problem of prioritization of new candidate targets, based on the information available in the graph for known compound-target associations. We validated this strategy by performing a cross validation procedure for known mouse and Trypanosoma cruzi targets and showed that our approach outperforms classic alignment-based approaches. Moreover, our model provides additional flexibility as two different network definitions could be considered, finding in both cases qualitatively different but sensible candidate targets. We also showcase the application of the network to suggest targets for orphan compounds that are active against Plasmodium falciparum in high-throughput screens. In this case our approach provided a reduced prioritization list of target proteins for the query molecules and showed the ability to propose new testable hypotheses for each compound. Moreover, we found that some predictions highlighted by our network model were supported by independent experimental validations as found post-facto in the literature. PMID:26735851
NASA Technical Reports Server (NTRS)
Zhang, Zhong
1997-01-01
The development of large-scale, composite software in a geographically distributed environment is an evolutionary process. Often, in such evolving systems, striving for consistency is complicated by many factors, because development participants have various locations, skills, responsibilities, roles, opinions, languages, terminology and different degrees of abstraction they employ. This naturally leads to many partial specifications or viewpoints. These multiple views on the system being developed usually overlap. From another aspect, these multiple views give rise to the potential for inconsistency. Existing CASE tools do not efficiently manage inconsistencies in distributed development environment for a large-scale project. Based on the ViewPoints framework the WHERE (Web-Based Hypertext Environment for requirements Evolution) toolkit aims to tackle inconsistency management issues within geographically distributed software development projects. Consequently, WHERE project helps make more robust software and support software assurance process. The long term goal of WHERE tools aims to the inconsistency analysis and management in requirements specifications. A framework based on Graph Grammar theory and TCMJAVA toolkit is proposed to detect inconsistencies among viewpoints. This systematic approach uses three basic operations (UNION, DIFFERENCE, INTERSECTION) to study the static behaviors of graphic and tabular notations. From these operations, subgraphs Query, Selection, Merge, Replacement operations can be derived. This approach uses graph PRODUCTIONS (rewriting rules) to study the dynamic transformations of graphs. We discuss the feasibility of implementation these operations. Also, We present the process of porting original TCM (Toolkit for Conceptual Modeling) project from C++ to Java programming language in this thesis. A scenario based on NASA International Space Station Specification is discussed to show the applicability of our approach. Finally, conclusion and future work about inconsistency management issues in WHERE project will be summarized.
A Multilayer Network Approach for Guiding Drug Repositioning in Neglected Diseases.
Berenstein, Ariel José; Magariños, María Paula; Chernomoretz, Ariel; Agüero, Fernán
2016-01-01
Drug development for neglected diseases has been historically hampered due to lack of market incentives. The advent of public domain resources containing chemical information from high throughput screenings is changing the landscape of drug discovery for these diseases. In this work we took advantage of data from extensively studied organisms like human, mouse, E. coli and yeast, among others, to develop a novel integrative network model to prioritize and identify candidate drug targets in neglected pathogen proteomes, and bioactive drug-like molecules. We modeled genomic (proteins) and chemical (bioactive compounds) data as a multilayer weighted network graph that takes advantage of bioactivity data across 221 species, chemical similarities between 1.7 105 compounds and several functional relations among 1.67 105 proteins. These relations comprised orthology, sharing of protein domains, and shared participation in defined biochemical pathways. We showcase the application of this network graph to the problem of prioritization of new candidate targets, based on the information available in the graph for known compound-target associations. We validated this strategy by performing a cross validation procedure for known mouse and Trypanosoma cruzi targets and showed that our approach outperforms classic alignment-based approaches. Moreover, our model provides additional flexibility as two different network definitions could be considered, finding in both cases qualitatively different but sensible candidate targets. We also showcase the application of the network to suggest targets for orphan compounds that are active against Plasmodium falciparum in high-throughput screens. In this case our approach provided a reduced prioritization list of target proteins for the query molecules and showed the ability to propose new testable hypotheses for each compound. Moreover, we found that some predictions highlighted by our network model were supported by independent experimental validations as found post-facto in the literature.
Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold
Li, Weizhong; Lopez, Rodrigo
2017-01-01
Abstract Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity. PMID:27923999
Menon, K Venugopal; Kumar, Dinesh; Thomas, Tessamma
2014-02-01
Study Design Preliminary evaluation of new tool. Objective To ascertain whether the newly developed content-based image retrieval (CBIR) software can be used successfully to retrieve images of similar cases of adolescent idiopathic scoliosis (AIS) from a database to help plan treatment without adhering to a classification scheme. Methods Sixty-two operated cases of AIS were entered into the newly developed CBIR database. Five new cases of different curve patterns were used as query images. The images were fed into the CBIR database that retrieved similar images from the existing cases. These were analyzed by a senior surgeon for conformity to the query image. Results Within the limits of variability set for the query system, all the resultant images conformed to the query image. One case had no similar match in the series. The other four retrieved several images that were matching with the query. No matching case was left out in the series. The postoperative images were then analyzed to check for surgical strategies. Broad guidelines for treatment could be derived from the results. More precise query settings, inclusion of bending films, and a larger database will enhance accurate retrieval and better decision making. Conclusion The CBIR system is an effective tool for accurate documentation and retrieval of scoliosis images. Broad guidelines for surgical strategies can be made from the postoperative images of the existing cases without adhering to any classification scheme.
Scratch Your Brain Where It Itches: Math Games, Tricks and Quick Activities, Book D-1 Algebra.
ERIC Educational Resources Information Center
Brumbaugh, Doug
This resource book for algebra contains games, tricks, and quick activities for the classroom. Categories of activities include puzzlers, patterns, manipulatives, measurement, graphing, and a section that contains reproducible statement and value cards. Twenty one puzzle problems, four pattern activities, and 11 quick activities that engage…
Pattern recognition tool based on complex network-based approach
NASA Astrophysics Data System (ADS)
Casanova, Dalcimar; Backes, André Ricardo; Martinez Bruno, Odemir
2013-02-01
This work proposed a generalization of the method proposed by the authors: 'A complex network-based approach for boundary shape analysis'. Instead of modelling a contour into a graph and use complex networks rules to characterize it, here, we generalize the technique. This way, the work proposes a mathematical tool for characterization signals, curves and set of points. To evaluate the pattern description power of the proposal, an experiment of plat identification based on leaf veins image are conducted. Leaf vein is a taxon characteristic used to plant identification proposes, and one of its characteristics is that these structures are complex, and difficult to be represented as a signal or curves and this way to be analyzed in a classical pattern recognition approach. Here, we model the veins as a set of points and model as graphs. As features, we use the degree and joint degree measurements in a dynamic evolution. The results demonstrates that the technique has a good power of discrimination and can be used for plant identification, as well as other complex pattern recognition tasks.
NASA Astrophysics Data System (ADS)
Ghaderi, A. H.; Darooneh, A. H.
The behavior of nonlinear systems can be analyzed by artificial neural networks. Air temperature change is one example of the nonlinear systems. In this work, a new neural network method is proposed for forecasting maximum air temperature in two cities. In this method, the regular graph concept is used to construct some partially connected neural networks that have regular structures. The learning results of fully connected ANN and networks with proposed method are compared. In some case, the proposed method has the better result than conventional ANN. After specifying the best network, the effect of input pattern numbers on the prediction is studied and the results show that the increase of input patterns has a direct effect on the prediction accuracy.
Modelling disease outbreaks in realistic urban social networks
NASA Astrophysics Data System (ADS)
Eubank, Stephen; Guclu, Hasan; Anil Kumar, V. S.; Marathe, Madhav V.; Srinivasan, Aravind; Toroczkai, Zoltán; Wang, Nan
2004-05-01
Most mathematical models for the spread of disease use differential equations based on uniform mixing assumptions or ad hoc models for the contact process. Here we explore the use of dynamic bipartite graphs to model the physical contact patterns that result from movements of individuals between specific locations. The graphs are generated by large-scale individual-based urban traffic simulations built on actual census, land-use and population-mobility data. We find that the contact network among people is a strongly connected small-world-like graph with a well-defined scale for the degree distribution. However, the locations graph is scale-free, which allows highly efficient outbreak detection by placing sensors in the hubs of the locations network. Within this large-scale simulation framework, we then analyse the relative merits of several proposed mitigation strategies for smallpox spread. Our results suggest that outbreaks can be contained by a strategy of targeted vaccination combined with early detection without resorting to mass vaccination of a population.
Fazeli Dehkordy, Soudabeh; Carlos, Ruth C; Hall, Kelli S; Dalton, Vanessa K
2014-09-01
Millions of people use online search engines everyday to find health-related information and voluntarily share their personal health status and behaviors in various Web sites. Thus, data from tracking of online information seeker's behavior offer potential opportunities for use in public health surveillance and research. Google Trends is a feature of Google which allows Internet users to graph the frequency of searches for a single term or phrase over time or by geographic region. We used Google Trends to describe patterns of information-seeking behavior in the subject of dense breasts and to examine their correlation with the passage or introduction of dense breast notification legislation. To capture the temporal variations of information seeking about dense breasts, the Web search query "dense breast" was entered in the Google Trends tool. We then mapped the dates of legislative actions regarding dense breasts that received widespread coverage in the lay media to information-seeking trends about dense breasts over time. Newsworthy events and legislative actions appear to correlate well with peaks in search volume of "dense breast". Geographic regions with the highest search volumes have passed, denied, or are currently considering the dense breast legislation. Our study demonstrated that any legislative action and respective news coverage correlate with increase in information seeking for "dense breast" on Google, suggesting that Google Trends has the potential to serve as a data source for policy-relevant research. Copyright © 2014 AUR. Published by Elsevier Inc. All rights reserved.
Dictionary-driven protein annotation.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-09-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Drory Retwitzer, Matan; Polishchuk, Maya; Churkin, Elena; Kifer, Ilona; Yakhini, Zohar; Barash, Danny
2015-01-01
Searching for RNA sequence-structure patterns is becoming an essential tool for RNA practitioners. Novel discoveries of regulatory non-coding RNAs in targeted organisms and the motivation to find them across a wide range of organisms have prompted the use of computational RNA pattern matching as an enhancement to sequence similarity. State-of-the-art programs differ by the flexibility of patterns allowed as queries and by their simplicity of use. In particular—no existing method is available as a user-friendly web server. A general program that searches for RNA sequence-structure patterns is RNA Structator. However, it is not available as a web server and does not provide the option to allow flexible gap pattern representation with an upper bound of the gap length being specified at any position in the sequence. Here, we introduce RNAPattMatch, a web-based application that is user friendly and makes sequence/structure RNA queries accessible to practitioners of various background and proficiency. It also extends RNA Structator and allows a more flexible variable gaps representation, in addition to analysis of results using energy minimization methods. RNAPattMatch service is available at http://www.cs.bgu.ac.il/rnapattmatch. A standalone version of the search tool is also available to download at the site. PMID:25940619
NASA Astrophysics Data System (ADS)
Yamaguchi, Atsuko; Ohashi, Takeyoshi; Kawasaki, Takahiro; Inoue, Osamu; Kawada, Hiroki
2013-04-01
A new method for calculating critical dimension (CDs) at the top and bottom of three-dimensional (3D) pattern profiles from a critical-dimension scanning electron microscope (CD-SEM) image, called as "T-sigma method", is proposed and evaluated. Without preparing a library of database in advance, T-sigma can estimate a feature of a pattern sidewall. Furthermore, it supplies the optimum edge-definition (i.e., threshold level for determining edge position from a CDSEM signal) to detect the top and bottom of the pattern. This method consists of three steps. First, two components of line-edge roughness (LER); noise-induced bias (i.e., LER bias) and unbiased component (i.e., bias-free LER) are calculated with set threshold level. Second, these components are calculated with various threshold values, and the threshold-dependence of these two components, "T-sigma graph", is obtained. Finally, the optimum threshold value for the top and the bottom edge detection are given by the analysis of T-sigma graph. T-sigma was applied to CD-SEM images of three kinds of resist-pattern samples. In addition, reference metrology was performed with atomic force microscope (AFM) and scanning transmission electron microscope (STEM). Sensitivity of CD measured by T-sigma to the reference CD was higher than or equal to that measured by the conventional edge definition. Regarding the absolute measurement accuracy, T-sigma showed better results than the conventional definition. Furthermore, T-sigma graphs were calculated from CD-SEM images of two kinds of resist samples and compared with corresponding STEM observation results. Both bias-free LER and LER bias increased as the detected edge point moved from the bottom to the top of the pattern in the case that the pattern had a straight sidewall and a round top. On the other hand, they were almost constant in the case that the pattern had a re-entrant profile. T-sigma will be able to reveal a re-entrant feature. From these results, it is found that T-sigma method can provide rough cross-sectional pattern features and achieve quick, easy and accurate measurements of top and bottom CD.
Search query data to monitor interest in behavior change: application for public health.
Carr, Lucas J; Dunsiger, Shira I
2012-01-01
There is a need for effective interventions and policies that target the leading preventable causes of death in the U.S. (e.g., smoking, overweight/obesity, physical inactivity). Such efforts could be aided by the use of publicly available, real-time search query data that illustrate times and locations of high and low public interest in behaviors related to preventable causes of death. This study explored patterns of search query activity for the terms 'weight', 'diet', 'fitness', and 'smoking' using Google Insights for Search. Search activity for 'weight', 'diet', 'fitness', and 'smoking' conducted within the United States via Google between January 4(th), 2004 (first date data was available) and November 28(th), 2011 (date of data download and analysis) were analyzed. Using a generalized linear model, we explored the effects of time (month) on mean relative search volume for all four terms. Models suggest a significant effect of month on mean search volume for all four terms. Search activity for all four terms was highest in January with observable declines throughout the remainder of the year. These findings demonstrate discernable temporal patterns of search activity for four areas of behavior change. These findings could be used to inform the timing, location and messaging of interventions, campaigns and policies targeting these behaviors.
NASA Astrophysics Data System (ADS)
Cheng, Shaobo; Zhang, Dong; Deng, Shiqing; Li, Xing; Li, Jun; Tan, Guotai; Zhu, Yimei; Zhu, Jing
2018-04-01
Topological defects and their interactions often arouse multiple types of emerging phenomena from edge states in Skyrmions to disclination pairs in liquid crystals. In hexagonal manganites, partial edge dislocations, a prototype topological defect, are ubiquitous and they significantly alter the topologically protected domains and their behaviors. Herein, combining electron microscopy experiment and graph theory analysis, we report a systematic study of the connections and configurations of domains in this dislocation embedded system. Rules for domain arrangement are established. The dividing line between domains, which can be attributed by the strain field of dislocations, is accurately described by a genus model from a higher dimension in the graph theory. Our results open a door for the understanding of domain patterns in topologically protected multiferroic systems.
Got Graphs? An Assessment of Data Visualization Tools
NASA Technical Reports Server (NTRS)
Schaefer, C. M.; Foy, M.
2015-01-01
Graphs are powerful tools for simplifying complex data. They are useful for quickly assessing patterns and relationships among one or more variables from a dataset. As the amount of data increases, it becomes more difficult to visualize potential associations. Lifetime Surveillance of Astronaut Health (LSAH) was charged with assessing its current visualization tools along with others on the market to determine whether new tools would be useful for supporting NASA's occupational surveillance effort. It was concluded by members of LSAH that the current tools hindered their ability to provide quick results to researchers working with the department. Due to the high volume of data requests and the many iterations of visualizations requested by researchers, software with a better ability to replicate graphs and edit quickly could improve LSAH's efficiency and lead to faster research results.
Emergent 1d Ising Behavior in AN Elementary Cellular Automaton Model
NASA Astrophysics Data System (ADS)
Kassebaum, Paul G.; Iannacchione, Germano S.
The fundamental nature of an evolving one-dimensional (1D) Ising model is investigated with an elementary cellular automaton (CA) simulation. The emergent CA simulation employs an ensemble of cells in one spatial dimension, each cell capable of two microstates interacting with simple nearest-neighbor rules and incorporating an external field. The behavior of the CA model provides insight into the dynamics of coupled two-state systems not expressible by exact analytical solutions. For instance, state progression graphs show the causal dynamics of a system through time in relation to the system's entropy. Unique graphical analysis techniques are introduced through difference patterns, diffusion patterns, and state progression graphs of the 1D ensemble visualizing the evolution. All analyses are consistent with the known behavior of the 1D Ising system. The CA simulation and new pattern recognition techniques are scalable (in both dimension, complexity, and size) and have many potential applications such as complex design of materials, control of agent systems, and evolutionary mechanism design.
VisualUrText: A Text Analytics Tool for Unstructured Textual Data
NASA Astrophysics Data System (ADS)
Zainol, Zuraini; Jaymes, Mohd T. H.; Nohuddin, Puteri N. E.
2018-05-01
The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.
Gurunathan, Rajalakshmi; Van Emden, Bernard; Panchanathan, Sethuraman; Kumar, Sudhir
2004-01-01
Background Modern developmental biology relies heavily on the analysis of embryonic gene expression patterns. Investigators manually inspect hundreds or thousands of expression patterns to identify those that are spatially similar and to ultimately infer potential gene interactions. However, the rapid accumulation of gene expression pattern data over the last two decades, facilitated by high-throughput techniques, has produced a need for the development of efficient approaches for direct comparison of images, rather than their textual descriptions, to identify spatially similar expression patterns. Results The effectiveness of the Binary Feature Vector (BFV) and Invariant Moment Vector (IMV) based digital representations of the gene expression patterns in finding biologically meaningful patterns was compared for a small (226 images) and a large (1819 images) dataset. For each dataset, an ordered list of images, with respect to a query image, was generated to identify overlapping and similar gene expression patterns, in a manner comparable to what a developmental biologist might do. The results showed that the BFV representation consistently outperforms the IMV representation in finding biologically meaningful matches when spatial overlap of the gene expression pattern and the genes involved are considered. Furthermore, we explored the value of conducting image-content based searches in a dataset where individual expression components (or domains) of multi-domain expression patterns were also included separately. We found that this technique improves performance of both IMV and BFV based searches. Conclusions We conclude that the BFV representation consistently produces a more extensive and better list of biologically useful patterns than the IMV representation. The high quality of results obtained scales well as the search database becomes larger, which encourages efforts to build automated image query and retrieval systems for spatial gene expression patterns. PMID:15603586
SKIMMR: facilitating knowledge discovery in life sciences by machine-aided skim reading
Burns, Gully A.P.C.
2014-01-01
Background. Unlike full reading, ‘skim-reading’ involves the process of looking quickly over information in an attempt to cover more material whilst still being able to retain a superficial view of the underlying content. Within this work, we specifically emulate this natural human activity by providing a dynamic graph-based view of entities automatically extracted from text. For the extraction, we use shallow parsing, co-occurrence analysis and semantic similarity computation techniques. Our main motivation is to assist biomedical researchers and clinicians in coping with increasingly large amounts of potentially relevant articles that are being published ongoingly in life sciences. Methods. To construct the high-level network overview of articles, we extract weighted binary statements from the text. We consider two types of these statements, co-occurrence and similarity, both organised in the same distributional representation (i.e., in a vector-space model). For the co-occurrence weights, we use point-wise mutual information that indicates the degree of non-random association between two co-occurring entities. For computing the similarity statement weights, we use cosine distance based on the relevant co-occurrence vectors. These statements are used to build fuzzy indices of terms, statements and provenance article identifiers, which support fuzzy querying and subsequent result ranking. These indexing and querying processes are then used to construct a graph-based interface for searching and browsing entity networks extracted from articles, as well as articles relevant to the networks being browsed. Last but not least, we describe a methodology for automated experimental evaluation of the presented approach. The method uses formal comparison of the graphs generated by our tool to relevant gold standards based on manually curated PubMed, TREC challenge and MeSH data. Results. We provide a web-based prototype (called ‘SKIMMR’) that generates a network of inter-related entities from a set of documents which a user may explore through our interface. When a particular area of the entity network looks interesting to a user, the tool displays the documents that are the most relevant to those entities of interest currently shown in the network. We present this as a methodology for browsing a collection of research articles. To illustrate the practical applicability of SKIMMR, we present examples of its use in the domains of Spinal Muscular Atrophy and Parkinson’s Disease. Finally, we report on the results of experimental evaluation using the two domains and one additional dataset based on the TREC challenge. The results show how the presented method for machine-aided skim reading outperforms tools like PubMed regarding focused browsing and informativeness of the browsing context. PMID:25097821
... 2) attainable (doable); and (3) forgiving (less than perfect). "Exercise more" is a great goal, but it's ... graph or table, however, remember that one day's diet and exercise patterns won't have a measurable ...
Stratification-Based Outlier Detection over the Deep Web.
Xian, Xuefeng; Zhao, Pengpeng; Sheng, Victor S; Fang, Ligang; Gu, Caidong; Yang, Yuanfeng; Cui, Zhiming
2016-01-01
For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve corresponding data. Therefore, traditional data mining methods cannot be directly applied. The primary contribution of this paper is to develop a new data mining method for outlier detection over deep web. In our approach, the query space of a deep web data source is stratified based on a pilot sample. Neighborhood sampling and uncertainty sampling are developed in this paper with the goal of improving recall and precision based on stratification. Finally, a careful performance evaluation of our algorithm confirms that our approach can effectively detect outliers in deep web.
A Note on Interfacing Object Warehouses and Mass Storage Systems for Data Mining Applications
NASA Technical Reports Server (NTRS)
Grossman, Robert L.; Northcutt, Dave
1996-01-01
Data mining is the automatic discovery of patterns, associations, and anomalies in data sets. Data mining requires numerically and statistically intensive queries. Our assumption is that data mining requires a specialized data management infrastructure to support the aforementioned intensive queries, but because of the sizes of data involved, this infrastructure is layered over a hierarchical storage system. In this paper, we discuss the architecture of a system which is layered for modularity, but exploits specialized lightweight services to maintain efficiency. Rather than use a full functioned database for example, we use light weight object services specialized for data mining. We propose using information repositories between layers so that components on either side of the layer can access information in the repositories to assist in making decisions about data layout, the caching and migration of data, the scheduling of queries, and related matters.
Stratification-Based Outlier Detection over the Deep Web
Xian, Xuefeng; Zhao, Pengpeng; Sheng, Victor S.; Fang, Ligang; Gu, Caidong; Yang, Yuanfeng; Cui, Zhiming
2016-01-01
For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve corresponding data. Therefore, traditional data mining methods cannot be directly applied. The primary contribution of this paper is to develop a new data mining method for outlier detection over deep web. In our approach, the query space of a deep web data source is stratified based on a pilot sample. Neighborhood sampling and uncertainty sampling are developed in this paper with the goal of improving recall and precision based on stratification. Finally, a careful performance evaluation of our algorithm confirms that our approach can effectively detect outliers in deep web. PMID:27313603
Horvath, Dragos; Marcou, Gilles; Varnek, Alexandre
2013-07-22
This study is an exhaustive analysis of the neighborhood behavior over a large coherent data set (ChEMBL target/ligand pairs of known Ki, for 165 targets with >50 associated ligands each). It focuses on similarity-based virtual screening (SVS) success defined by the ascertained optimality index. This is a weighted compromise between purity and retrieval rate of active hits in the neighborhood of an active query. One key issue addressed here is the impact of Tversky asymmetric weighing of query vs candidate features (represented as integer-value ISIDA colored fragment/pharmacophore triplet count descriptor vectors). The nearly a 3/4 million independent SVS runs showed that Tversky scores with a strong bias in favor of query-specific features are, by far, the most successful and the least failure-prone out of a set of nine other dissimilarity scores. These include classical Tanimoto, which failed to defend its privileged status in practical SVS applications. Tversky performance is not significantly conditioned by tuning of its bias parameter α. Both initial "guesses" of α = 0.9 and 0.7 were more successful than Tanimoto (at its turn, better than Euclid). Tversky was eventually tested in exhaustive similarity searching within the library of 1.6 M commercial + bioactive molecules at http://infochim.u-strasbg.fr/webserv/VSEngine.html , comparing favorably to Tanimoto in terms of "scaffold hopping" propensity. Therefore, it should be used at least as often as, perhaps in parallel to Tanimoto in SVS. Analysis with respect to query subclasses highlighted relationships of query complexity (simply expressed in terms of pharmacophore pattern counts) and/or target nature vs SVS success likelihood. SVS using more complex queries are more robust with respect to the choice of their operational premises (descriptors, metric). Yet, they are best handled by "pro-query" Tversky scores at α > 0.5. Among simpler queries, one may distinguish between "growable" (allowing for active analogs with additional features), and a few "conservative" queries not allowing any growth. These (typically bioactive amine transporter ligands) form the specific application domain of "pro-candidate" biased Tversky scores at α < 0.5.
Foundations for Streaming Model Transformations by Complex Event Processing.
Dávid, István; Ráth, István; Varró, Dániel
2018-01-01
Streaming model transformations represent a novel class of transformations to manipulate models whose elements are continuously produced or modified in high volume and with rapid rate of change. Executing streaming transformations requires efficient techniques to recognize activated transformation rules over a live model and a potentially infinite stream of events. In this paper, we propose foundations of streaming model transformations by innovatively integrating incremental model query, complex event processing (CEP) and reactive (event-driven) transformation techniques. Complex event processing allows to identify relevant patterns and sequences of events over an event stream. Our approach enables event streams to include model change events which are automatically and continuously populated by incremental model queries. Furthermore, a reactive rule engine carries out transformations on identified complex event patterns. We provide an integrated domain-specific language with precise semantics for capturing complex event patterns and streaming transformations together with an execution engine, all of which is now part of the Viatra reactive transformation framework. We demonstrate the feasibility of our approach with two case studies: one in an advanced model engineering workflow; and one in the context of on-the-fly gesture recognition.
An ontology design pattern for surface water features
Sinha, Gaurav; Mark, David; Kolas, Dave; Varanka, Dalia; Romero, Boleslo E.; Feng, Chen-Chieh; Usery, E. Lynn; Liebermann, Joshua; Sorokine, Alexandre
2014-01-01
Surface water is a primary concept of human experience but concepts are captured in cultures and languages in many different ways. Still, many commonalities exist due to the physical basis of many of the properties and categories. An abstract ontology of surface water features based only on those physical properties of landscape features has the best potential for serving as a foundational domain ontology for other more context-dependent ontologies. The Surface Water ontology design pattern was developed both for domain knowledge distillation and to serve as a conceptual building-block for more complex or specialized surface water ontologies. A fundamental distinction is made in this ontology between landscape features that act as containers (e.g., stream channels, basins) and the bodies of water (e.g., rivers, lakes) that occupy those containers. Concave (container) landforms semantics are specified in a Dry module and the semantics of contained bodies of water in a Wet module. The pattern is implemented in OWL, but Description Logic axioms and a detailed explanation is provided in this paper. The OWL ontology will be an important contribution to Semantic Web vocabulary for annotating surface water feature datasets. Also provided is a discussion of why there is a need to complement the pattern with other ontologies, especially the previously developed Surface Network pattern. Finally, the practical value of the pattern in semantic querying of surface water datasets is illustrated through an annotated geospatial dataset and sample queries using the classes of the Surface Water pattern.
Asking better questions: How presentation formats influence information search.
Wu, Charley M; Meder, Björn; Filimon, Flavia; Nelson, Jonathan D
2017-08-01
While the influence of presentation formats have been widely studied in Bayesian reasoning tasks, we present the first systematic investigation of how presentation formats influence information search decisions. Four experiments were conducted across different probabilistic environments, where subjects (N = 2,858) chose between 2 possible search queries, each with binary probabilistic outcomes, with the goal of maximizing classification accuracy. We studied 14 different numerical and visual formats for presenting information about the search environment, constructed across 6 design features that have been prominently related to improvements in Bayesian reasoning accuracy (natural frequencies, posteriors, complement, spatial extent, countability, and part-to-whole information). The posterior variants of the icon array and bar graph formats led to the highest proportion of correct responses, and were substantially better than the standard probability format. Results suggest that presenting information in terms of posterior probabilities and visualizing natural frequencies using spatial extent (a perceptual feature) were especially helpful in guiding search decisions, although environments with a mixture of probabilistic and certain outcomes were challenging across all formats. Subjects who made more accurate probability judgments did not perform better on the search task, suggesting that simple decision heuristics may be used to make search decisions without explicitly applying Bayesian inference to compute probabilities. We propose a new take-the-difference (TTD) heuristic that identifies the accuracy-maximizing query without explicit computation of posterior probabilities. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Boes, Peter; Ho, Meng Wei; Li, Zuofeng
2015-01-01
Image‐guided radiotherapy (IGRT), based on radiopaque markers placed in the prostate gland, was used for proton therapy of prostate patients. Orthogonal X‐rays and the IBA Digital Image Positioning System (DIPS) were used for setup correction prior to treatment and were repeated after treatment delivery. Following a rationale for margin estimates similar to that of van Herk,(1) the daily post‐treatment DIPS data were analyzed to determine if an adaptive radiotherapy plan was necessary. A Web application using ASP.NET MVC5, Entity Framework, and an SQL database was designed to automate this process. The designed features included state‐of‐the‐art Web technologies, a domain model closely matching the workflow, a database‐supporting concurrency and data mining, access to the DIPS database, secured user access and roles management, and graphing and analysis tools. The Model‐View‐Controller (MVC) paradigm allowed clean domain logic, unit testing, and extensibility. Client‐side technologies, such as jQuery, jQuery Plug‐ins, and Ajax, were adopted to achieve a rich user environment and fast response. Data models included patients, staff, treatment fields and records, correction vectors, DIPS images, and association logics. Data entry, analysis, workflow logics, and notifications were implemented. The system effectively modeled the clinical workflow and IGRT process. PACS number: 87 PMID:26103504
Phase-locked patterns of the Kuramoto model on 3-regular graphs
NASA Astrophysics Data System (ADS)
DeVille, Lee; Ermentrout, Bard
2016-09-01
We consider the existence of non-synchronized fixed points to the Kuramoto model defined on sparse networks: specifically, networks where each vertex has degree exactly three. We show that "most" such networks support multiple attracting phase-locked solutions that are not synchronized and study the depth and width of the basins of attraction of these phase-locked solutions. We also show that it is common in "large enough" graphs to find phase-locked solutions where one or more of the links have angle difference greater than π/2.
Phase-locked patterns of the Kuramoto model on 3-regular graphs.
DeVille, Lee; Ermentrout, Bard
2016-09-01
We consider the existence of non-synchronized fixed points to the Kuramoto model defined on sparse networks: specifically, networks where each vertex has degree exactly three. We show that "most" such networks support multiple attracting phase-locked solutions that are not synchronized and study the depth and width of the basins of attraction of these phase-locked solutions. We also show that it is common in "large enough" graphs to find phase-locked solutions where one or more of the links have angle difference greater than π/2.
A Survey of Kurdish Students' Sound Segment & Syllabic Pattern Errors in the Course of Learning EFL
ERIC Educational Resources Information Center
Mohammadi, Jahangir
2014-01-01
This paper is devoted to finding adequate answers to the following queries: (A) what are the segmental and syllabic pattern errors made by Kurdish students in their pronunciation? (B) Can the problematic areas in pronunciation be predicted by a systematic comparison of the sound systems of both native and target languages? (C) Can there be any…
An incremental database access method for autonomous interoperable databases
NASA Technical Reports Server (NTRS)
Roussopoulos, Nicholas; Sellis, Timos
1994-01-01
We investigated a number of design and performance issues of interoperable database management systems (DBMS's). The major results of our investigation were obtained in the areas of client-server database architectures for heterogeneous DBMS's, incremental computation models, buffer management techniques, and query optimization. We finished a prototype of an advanced client-server workstation-based DBMS which allows access to multiple heterogeneous commercial DBMS's. Experiments and simulations were then run to compare its performance with the standard client-server architectures. The focus of this research was on adaptive optimization methods of heterogeneous database systems. Adaptive buffer management accounts for the random and object-oriented access methods for which no known characterization of the access patterns exists. Adaptive query optimization means that value distributions and selectives, which play the most significant role in query plan evaluation, are continuously refined to reflect the actual values as opposed to static ones that are computed off-line. Query feedback is a concept that was first introduced to the literature by our group. We employed query feedback for both adaptive buffer management and for computing value distributions and selectivities. For adaptive buffer management, we use the page faults of prior executions to achieve more 'informed' management decisions. For the estimation of the distributions of the selectivities, we use curve-fitting techniques, such as least squares and splines, for regressing on these values.
New concepts for building vocabulary for cell image ontologies.
Plant, Anne L; Elliott, John T; Bhat, Talapady N
2011-12-21
There are significant challenges associated with the building of ontologies for cell biology experiments including the large numbers of terms and their synonyms. These challenges make it difficult to simultaneously query data from multiple experiments or ontologies. If vocabulary terms were consistently used and reused across and within ontologies, queries would be possible through shared terms. One approach to achieving this is to strictly control the terms used in ontologies in the form of a pre-defined schema, but this approach limits the individual researcher's ability to create new terms when needed to describe new experiments. Here, we propose the use of a limited number of highly reusable common root terms, and rules for an experimentalist to locally expand terms by adding more specific terms under more general root terms to form specific new vocabulary hierarchies that can be used to build ontologies. We illustrate the application of the method to build vocabularies and a prototype database for cell images that uses a visual data-tree of terms to facilitate sophisticated queries based on a experimental parameters. We demonstrate how the terminology might be extended by adding new vocabulary terms into the hierarchy of terms in an evolving process. In this approach, image data and metadata are handled separately, so we also describe a robust file-naming scheme to unambiguously identify image and other files associated with each metadata value. The prototype database http://sbd.nist.gov/ consists of more than 2000 images of cells and benchmark materials, and 163 metadata terms that describe experimental details, including many details about cell culture and handling. Image files of interest can be retrieved, and their data can be compared, by choosing one or more relevant metadata values as search terms. Metadata values for any dataset can be compared with corresponding values of another dataset through logical operations. Organizing metadata for cell imaging experiments under a framework of rules that include highly reused root terms will facilitate the addition of new terms into a vocabulary hierarchy and encourage the reuse of terms. These vocabulary hierarchies can be converted into XML schema or RDF graphs for displaying and querying, but this is not necessary for using it to annotate cell images. Vocabulary data trees from multiple experiments or laboratories can be aligned at the root terms to facilitate query development. This approach of developing vocabularies is compatible with the major advances in database technology and could be used for building the Semantic Web.
New concepts for building vocabulary for cell image ontologies
2011-01-01
Background There are significant challenges associated with the building of ontologies for cell biology experiments including the large numbers of terms and their synonyms. These challenges make it difficult to simultaneously query data from multiple experiments or ontologies. If vocabulary terms were consistently used and reused across and within ontologies, queries would be possible through shared terms. One approach to achieving this is to strictly control the terms used in ontologies in the form of a pre-defined schema, but this approach limits the individual researcher's ability to create new terms when needed to describe new experiments. Results Here, we propose the use of a limited number of highly reusable common root terms, and rules for an experimentalist to locally expand terms by adding more specific terms under more general root terms to form specific new vocabulary hierarchies that can be used to build ontologies. We illustrate the application of the method to build vocabularies and a prototype database for cell images that uses a visual data-tree of terms to facilitate sophisticated queries based on a experimental parameters. We demonstrate how the terminology might be extended by adding new vocabulary terms into the hierarchy of terms in an evolving process. In this approach, image data and metadata are handled separately, so we also describe a robust file-naming scheme to unambiguously identify image and other files associated with each metadata value. The prototype database http://sbd.nist.gov/ consists of more than 2000 images of cells and benchmark materials, and 163 metadata terms that describe experimental details, including many details about cell culture and handling. Image files of interest can be retrieved, and their data can be compared, by choosing one or more relevant metadata values as search terms. Metadata values for any dataset can be compared with corresponding values of another dataset through logical operations. Conclusions Organizing metadata for cell imaging experiments under a framework of rules that include highly reused root terms will facilitate the addition of new terms into a vocabulary hierarchy and encourage the reuse of terms. These vocabulary hierarchies can be converted into XML schema or RDF graphs for displaying and querying, but this is not necessary for using it to annotate cell images. Vocabulary data trees from multiple experiments or laboratories can be aligned at the root terms to facilitate query development. This approach of developing vocabularies is compatible with the major advances in database technology and could be used for building the Semantic Web. PMID:22188658
Kawazoe, Yoshimasa; Imai, Takeshi; Ohe, Kazuhiko
2016-04-05
Health level seven version 2.5 (HL7 v2.5) is a widespread messaging standard for information exchange between clinical information systems. By applying Semantic Web technologies for handling HL7 v2.5 messages, it is possible to integrate large-scale clinical data with life science knowledge resources. Showing feasibility of a querying method over large-scale resource description framework (RDF)-ized HL7 v2.5 messages using publicly available drug databases. We developed a method to convert HL7 v2.5 messages into the RDF. We also converted five kinds of drug databases into RDF and provided explicit links between the corresponding items among them. With those linked drug data, we then developed a method for query expansion to search the clinical data using semantic information on drug classes along with four types of temporal patterns. For evaluation purpose, medication orders and laboratory test results for a 3-year period at the University of Tokyo Hospital were used, and the query execution times were measured. Approximately 650 million RDF triples for medication orders and 790 million RDF triples for laboratory test results were converted. Taking three types of query in use cases for detecting adverse events of drugs as an example, we confirmed these queries were represented in SPARQL Protocol and RDF Query Language (SPARQL) using our methods and comparison with conventional query expressions were performed. The measurement results confirm that the query time is feasible and increases logarithmically or linearly with the amount of data and without diverging. The proposed methods enabled query expressions that separate knowledge resources and clinical data, thereby suggesting the feasibility for improving the usability of clinical data by enhancing the knowledge resources. We also demonstrate that when HL7 v2.5 messages are automatically converted into RDF, searches are still possible through SPARQL without modifying the structure. As such, the proposed method benefits not only our hospitals, but also numerous hospitals that handle HL7 v2.5 messages. Our approach highlights a potential of large-scale data federation techniques to retrieve clinical information, which could be applied as applications of clinical intelligence to improve clinical practices, such as adverse drug event monitoring and cohort selection for a clinical study as well as discovering new knowledge from clinical information.
GEM-TREND: a web tool for gene expression data mining toward relevant network discovery
Feng, Chunlai; Araki, Michihiro; Kunimoto, Ryo; Tamon, Akiko; Makiguchi, Hiroki; Niijima, Satoshi; Tsujimoto, Gozoh; Okuno, Yasushi
2009-01-01
Background DNA microarray technology provides us with a first step toward the goal of uncovering gene functions on a genomic scale. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO). To date, most researchers have been manually retrieving data from databases through web browsers using accession numbers (IDs) or keywords, but gene-expression patterns are not considered when retrieving such data. The Connectivity Map was recently introduced to compare gene expression data by introducing gene-expression signatures (represented by a set of genes with up- or down-regulated labels according to their biological states) and is available as a web tool for detecting similar gene-expression signatures from a limited data set (approximately 7,000 expression profiles representing 1,309 compounds). In order to support researchers to utilize the public gene expression data more effectively, we developed a web tool for finding similar gene expression data and generating its co-expression networks from a publicly available database. Results GEM-TREND, a web tool for searching gene expression data, allows users to search data from GEO using gene-expression signatures or gene expression ratio data as a query and retrieve gene expression data by comparing gene-expression pattern between the query and GEO gene expression data. The comparison methods are based on the nonparametric, rank-based pattern matching approach of Lamb et al. (Science 2006) with the additional calculation of statistical significance. The web tool was tested using gene expression ratio data randomly extracted from the GEO and with in-house microarray data, respectively. The results validated the ability of GEM-TREND to retrieve gene expression entries biologically related to a query from GEO. For further analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories. Conclusion GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at . PMID:19728865
GEM-TREND: a web tool for gene expression data mining toward relevant network discovery.
Feng, Chunlai; Araki, Michihiro; Kunimoto, Ryo; Tamon, Akiko; Makiguchi, Hiroki; Niijima, Satoshi; Tsujimoto, Gozoh; Okuno, Yasushi
2009-09-03
DNA microarray technology provides us with a first step toward the goal of uncovering gene functions on a genomic scale. In recent years, vast amounts of gene expression data have been collected, much of which are available in public databases, such as the Gene Expression Omnibus (GEO). To date, most researchers have been manually retrieving data from databases through web browsers using accession numbers (IDs) or keywords, but gene-expression patterns are not considered when retrieving such data. The Connectivity Map was recently introduced to compare gene expression data by introducing gene-expression signatures (represented by a set of genes with up- or down-regulated labels according to their biological states) and is available as a web tool for detecting similar gene-expression signatures from a limited data set (approximately 7,000 expression profiles representing 1,309 compounds). In order to support researchers to utilize the public gene expression data more effectively, we developed a web tool for finding similar gene expression data and generating its co-expression networks from a publicly available database. GEM-TREND, a web tool for searching gene expression data, allows users to search data from GEO using gene-expression signatures or gene expression ratio data as a query and retrieve gene expression data by comparing gene-expression pattern between the query and GEO gene expression data. The comparison methods are based on the nonparametric, rank-based pattern matching approach of Lamb et al. (Science 2006) with the additional calculation of statistical significance. The web tool was tested using gene expression ratio data randomly extracted from the GEO and with in-house microarray data, respectively. The results validated the ability of GEM-TREND to retrieve gene expression entries biologically related to a query from GEO. For further analysis, a network visualization interface is also provided, whereby genes and gene annotations are dynamically linked to external data repositories. GEM-TREND was developed to retrieve gene expression data by comparing query gene-expression pattern with those of GEO gene expression data. It could be a very useful resource for finding similar gene expression profiles and constructing its gene co-expression networks from a publicly available database. GEM-TREND was designed to be user-friendly and is expected to support knowledge discovery. GEM-TREND is freely available at http://cgs.pharm.kyoto-u.ac.jp/services/network.
C-quence: a tool for analyzing qualitative sequential data.
Duncan, Starkey; Collier, Nicholson T
2002-02-01
C-quence is a software application that matches sequential patterns of qualitative data specified by the user and calculates the rate of occurrence of these patterns in a data set. Although it was designed to facilitate analyses of face-to-face interaction, it is applicable to any data set involving categorical data and sequential information. C-quence queries are constructed using a graphical user interface. The program does not limit the complexity of the sequential patterns specified by the user.
Applying network theory to animal movements to identify properties of landscape space use.
Bastille-Rousseau, Guillaume; Douglas-Hamilton, Iain; Blake, Stephen; Northrup, Joseph M; Wittemyer, George
2018-04-01
Network (graph) theory is a popular analytical framework to characterize the structure and dynamics among discrete objects and is particularly effective at identifying critical hubs and patterns of connectivity. The identification of such attributes is a fundamental objective of animal movement research, yet network theory has rarely been applied directly to animal relocation data. We develop an approach that allows the analysis of movement data using network theory by defining occupied pixels as nodes and connection among these pixels as edges. We first quantify node-level (local) metrics and graph-level (system) metrics on simulated movement trajectories to assess the ability of these metrics to pull out known properties in movement paths. We then apply our framework to empirical data from African elephants (Loxodonta africana), giant Galapagos tortoises (Chelonoidis spp.), and mule deer (Odocoileous hemionus). Our results indicate that certain node-level metrics, namely degree, weight, and betweenness, perform well in capturing local patterns of space use, such as the definition of core areas and paths used for inter-patch movement. These metrics were generally applicable across data sets, indicating their robustness to assumptions structuring analysis or strategies of movement. Other metrics capture local patterns effectively, but were sensitive to specified graph properties, indicating case specific applications. Our analysis indicates that graph-level metrics are unlikely to outperform other approaches for the categorization of general movement strategies (central place foraging, migration, nomadism). By identifying critical nodes, our approach provides a robust quantitative framework to identify local properties of space use that can be used to evaluate the effect of the loss of specific nodes on range wide connectivity. Our network approach is intuitive, and can be implemented across imperfectly sampled or large-scale data sets efficiently, providing a framework for conservationists to analyze movement data. Functions created for the analyses are available within the R package moveNT. © 2018 by the Ecological Society of America.
Seasonal trends in tinnitus symptomatology: evidence from Internet search engine query data.
Plante, David T; Ingram, David G
2015-10-01
The primary aim of this study was to test the hypothesis that the symptom of tinnitus demonstrates a seasonal pattern with worsening in the winter relative to the summer using Internet search engine query data. Normalized search volume for the term 'tinnitus' from January 2004 through December 2013 was retrieved from Google Trends. Seasonal effects were evaluated using cosinor regression models. Primary countries of interest were the United States and Australia. Secondary exploratory analyses were also performed using data from Germany, the United Kingdom, Canada, Sweden, and Switzerland. Significant seasonal effects for 'tinnitus' search queries were found in the United States and Australia (p < 0.00001 for both countries), with peaks in the winter and troughs in the summer. Secondary analyses demonstrated similarly significant seasonal effects for Germany (p < 0.00001), Canada (p < 0.00001), and Sweden (p = 0.0008), again with increased search volume in the winter relative to the summer. Our findings indicate that there are significant seasonal trends for Internet search queries for tinnitus, with a zenith in winter months. Further research is indicated to determine the biological mechanisms underlying these findings, as they may provide insights into the pathophysiology of this common and debilitating medical symptom.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cheng, Shaobo; Zhang, Dong; Deng, Shiqing
Topological defects and their interactions often arouse multiple types of emerging phenomena from edge states in Skyrmions to disclination pairs in liquid crystals. In hexagonal manganites, partial edge dislocations, a prototype topological defect, are ubiquitous and they significantly alter the topologically protected domains and their behaviors. In this work, combining electron microscopy experiment and graph theory analysis, we report a systematic study of the connections and configurations of domains in this dislocation embedded system. Rules for domain arrangement are established. The dividing line between domains, which can be attributed by the strain field of dislocations, is accurately described by amore » genus model from a higher dimension in the graph theory. In conclusion, our results open a door for the understanding of domain patterns in topologically protected multiferroic systems.« less
Multiscale limited penetrable horizontal visibility graph for analyzing nonlinear time series
NASA Astrophysics Data System (ADS)
Gao, Zhong-Ke; Cai, Qing; Yang, Yu-Xuan; Dang, Wei-Dong; Zhang, Shan-Shan
2016-10-01
Visibility graph has established itself as a powerful tool for analyzing time series. We in this paper develop a novel multiscale limited penetrable horizontal visibility graph (MLPHVG). We use nonlinear time series from two typical complex systems, i.e., EEG signals and two-phase flow signals, to demonstrate the effectiveness of our method. Combining MLPHVG and support vector machine, we detect epileptic seizures from the EEG signals recorded from healthy subjects and epilepsy patients and the classification accuracy is 100%. In addition, we derive MLPHVGs from oil-water two-phase flow signals and find that the average clustering coefficient at different scales allows faithfully identifying and characterizing three typical oil-water flow patterns. These findings render our MLPHVG method particularly useful for analyzing nonlinear time series from the perspective of multiscale network analysis.
Cheng, Shaobo; Zhang, Dong; Deng, Shiqing; ...
2018-04-19
Topological defects and their interactions often arouse multiple types of emerging phenomena from edge states in Skyrmions to disclination pairs in liquid crystals. In hexagonal manganites, partial edge dislocations, a prototype topological defect, are ubiquitous and they significantly alter the topologically protected domains and their behaviors. In this work, combining electron microscopy experiment and graph theory analysis, we report a systematic study of the connections and configurations of domains in this dislocation embedded system. Rules for domain arrangement are established. The dividing line between domains, which can be attributed by the strain field of dislocations, is accurately described by amore » genus model from a higher dimension in the graph theory. In conclusion, our results open a door for the understanding of domain patterns in topologically protected multiferroic systems.« less
Volatility behavior of visibility graph EMD financial time series from Ising interacting system
NASA Astrophysics Data System (ADS)
Zhang, Bo; Wang, Jun; Fang, Wen
2015-08-01
A financial market dynamics model is developed and investigated by stochastic Ising system, where the Ising model is the most popular ferromagnetic model in statistical physics systems. Applying two graph based analysis and multiscale entropy method, we investigate and compare the statistical volatility behavior of return time series and the corresponding IMF series derived from the empirical mode decomposition (EMD) method. And the real stock market indices are considered to be comparatively studied with the simulation data of the proposed model. Further, we find that the degree distribution of visibility graph for the simulation series has the power law tails, and the assortative network exhibits the mixing pattern property. All these features are in agreement with the real market data, the research confirms that the financial model established by the Ising system is reasonable.
Figure-Ground Segmentation Using Factor Graphs
Shen, Huiying; Coughlan, James; Ivanchenko, Volodymyr
2009-01-01
Foreground-background segmentation has recently been applied [26,12] to the detection and segmentation of specific objects or structures of interest from the background as an efficient alternative to techniques such as deformable templates [27]. We introduce a graphical model (i.e. Markov random field)-based formulation of structure-specific figure-ground segmentation based on simple geometric features extracted from an image, such as local configurations of linear features, that are characteristic of the desired figure structure. Our formulation is novel in that it is based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, and demonstrate that the factor graph framework emerges naturally from a simple maximum entropy model of figure-ground segmentation. We cast our approach in a learning framework, in which the contributions of multiple grouping cues are learned from training data, and apply our framework to the problem of finding printed text in natural scenes. Experimental results are described, including a performance analysis that demonstrates the feasibility of the approach. PMID:20160994
Multiscale Feature Analysis of Salivary Gland Branching Morphogenesis
Baydil, Banu; Daley, William P.; Larsen, Melinda; Yener, Bülent
2012-01-01
Pattern formation in developing tissues involves dynamic spatio-temporal changes in cellular organization and subsequent evolution of functional adult structures. Branching morphogenesis is a developmental mechanism by which patterns are generated in many developing organs, which is controlled by underlying molecular pathways. Understanding the relationship between molecular signaling, cellular behavior and resulting morphological change requires quantification and categorization of the cellular behavior. In this study, tissue-level and cellular changes in developing salivary gland in response to disruption of ROCK-mediated signaling by are modeled by building cell-graphs to compute mathematical features capturing structural properties at multiple scales. These features were used to generate multiscale cell-graph signatures of untreated and ROCK signaling disrupted salivary gland organ explants. From confocal images of mouse submandibular salivary gland organ explants in which epithelial and mesenchymal nuclei were marked, a multiscale feature set capturing global structural properties, local structural properties, spectral, and morphological properties of the tissues was derived. Six feature selection algorithms and multiway modeling of the data was performed to identify distinct subsets of cell graph features that can uniquely classify and differentiate between different cell populations. Multiscale cell-graph analysis was most effective in classification of the tissue state. Cellular and tissue organization, as defined by a multiscale subset of cell-graph features, are both quantitatively distinct in epithelial and mesenchymal cell types both in the presence and absence of ROCK inhibitors. Whereas tensor analysis demonstrate that epithelial tissue was affected the most by inhibition of ROCK signaling, significant multiscale changes in mesenchymal tissue organization were identified with this analysis that were not identified in previous biological studies. We here show how to define and calculate a multiscale feature set as an effective computational approach to identify and quantify changes at multiple biological scales and to distinguish between different states in developing tissues. PMID:22403724
Pattern formations and optimal packing.
Mityushev, Vladimir
2016-04-01
Patterns of different symmetries may arise after solution to reaction-diffusion equations. Hexagonal arrays, layers and their perturbations are observed in different models after numerical solution to the corresponding initial-boundary value problems. We demonstrate an intimate connection between pattern formations and optimal random packing on the plane. The main study is based on the following two points. First, the diffusive flux in reaction-diffusion systems is approximated by piecewise linear functions in the framework of structural approximations. This leads to a discrete network approximation of the considered continuous problem. Second, the discrete energy minimization yields optimal random packing of the domains (disks) in the representative cell. Therefore, the general problem of pattern formations based on the reaction-diffusion equations is reduced to the geometric problem of random packing. It is demonstrated that all random packings can be divided onto classes associated with classes of isomorphic graphs obtained from the Delaunay triangulation. The unique optimal solution is constructed in each class of the random packings. If the number of disks per representative cell is finite, the number of classes of isomorphic graphs, hence, the number of optimal packings is also finite. Copyright © 2016 Elsevier Inc. All rights reserved.
Self-similarity analysis of eubacteria genome based on weighted graph.
Qi, Zhao-Hui; Li, Ling; Zhang, Zhi-Meng; Qi, Xiao-Qin
2011-07-07
We introduce a weighted graph model to investigate the self-similarity characteristics of eubacteria genomes. The regular treating in similarity comparison about genome is to discover the evolution distance among different genomes. Few people focus their attention on the overall statistical characteristics of each gene compared with other genes in the same genome. In our model, each genome is attributed to a weighted graph, whose topology describes the similarity relationship among genes in the same genome. Based on the related weighted graph theory, we extract some quantified statistical variables from the topology, and give the distribution of some variables derived from the largest social structure in the topology. The 23 eubacteria recently studied by Sorimachi and Okayasu are markedly classified into two different groups by their double logarithmic point-plots describing the similarity relationship among genes of the largest social structure in genome. The results show that the proposed model may provide us with some new sights to understand the structures and evolution patterns determined from the complete genomes. Copyright © 2011 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Levchuk, Georgiy; Colonna-Romano, John; Eslami, Mohammed
2017-05-01
The United States increasingly relies on cyber-physical systems to conduct military and commercial operations. Attacks on these systems have increased dramatically around the globe. The attackers constantly change their methods, making state-of-the-art commercial and military intrusion detection systems ineffective. In this paper, we present a model to identify functional behavior of network devices from netflow traces. Our model includes two innovations. First, we define novel features for a host IP using detection of application graph patterns in IP's host graph constructed from 5-min aggregated packet flows. Second, we present the first application, to the best of our knowledge, of Graph Semi-Supervised Learning (GSSL) to the space of IP behavior classification. Using a cyber-attack dataset collected from NetFlow packet traces, we show that GSSL trained with only 20% of the data achieves higher attack detection rates than Support Vector Machines (SVM) and Naïve Bayes (NB) classifiers trained with 80% of data points. We also show how to improve detection quality by filtering out web browsing data, and conclude with discussion of future research directions.
PREFACE: Complex Networks: from Biology to Information Technology
NASA Astrophysics Data System (ADS)
Barrat, A.; Boccaletti, S.; Caldarelli, G.; Chessa, A.; Latora, V.; Motter, A. E.
2008-06-01
The field of complex networks is one of the most active areas in contemporary statistical physics. Ten years after seminal work initiated the modern study of networks, interest in the field is in fact still growing, as indicated by the ever increasing number of publications in network science. The reason for such a resounding success is most likely the simplicity and broad significance of the approach that, through graph theory, allows researchers to address a variety of different complex systems within a common framework. This special issue comprises a selection of contributions presented at the workshop 'Complex Networks: from Biology to Information Technology' held in July 2007 in Pula (Cagliari), Italy as a satellite of the general conference STATPHYS23. The contributions cover a wide range of problems that are currently among the most important questions in the area of complex networks and that are likely to stimulate future research. The issue is organised into four sections. The first two sections describe 'methods' to study the structure and the dynamics of complex networks, respectively. After this methodological part, the issue proceeds with a section on applications to biological systems. The issue closes with a section concentrating on applications to the study of social and technological networks. The first section, entitled Methods: The Structure, consists of six contributions focused on the characterisation and analysis of structural properties of complex networks: The paper Motif-based communities in complex networks by Arenas et al is a study of the occurrence of characteristic small subgraphs in complex networks. These subgraphs, known as motifs, are used to define general classes of nodes and their communities by extending the mathematical expression of the Newman-Girvan modularity. The same line of research, aimed at characterising network structure through the analysis of particular subgraphs, is explored by Bianconi and Gulbahce in Algorithm for counting large directed loops. This work proposes a belief-propagation algorithm for counting long loops in directed networks, which is then applied to networks of different sizes and loop structure. In The anatomy of a large query graph, Baeza-Yates and Tiberi show that scale invariance is present also in the structure of a graph derived from query logs. This graph is determined not only by the queries but also by the subsequent actions of the users. The graph analysed in this study is generated by more than twenty million queries and is less sparse than suggested by previous studies. A different class of networks is considered by Travençolo and da F Costa in Hierarchical spatial organisation of geographical networks. This work proposes a hierarchical extension of the polygonality index as a means to characterise geographical planar networks and, in particular, to obtain more complete information about the spatial order of the network at progressive spatial scales. The paper Border trees of complex networks by Villas Boas et al focuses instead on the statistical properties of the boundary of graphs, constituted by the vertices of degree one (the leaves of border trees). The authors study the local properties, the depth, and the number of leaves of these border trees, finding that in some real networks more than half of the nodes belong to the border trees. The last contribution to the first section is The generation of random directed networks with prescribed 1-node and 2-node degree correlations by Zamora-López et al. This study deals with the generation of random directed networks and shows that often a large number of links cannot be 'randomised' without altering the degree correlations. This permits fast generation of ensembles of maximally random networks. In the section Methods: The Dynamics, significant attention is given to the study of synchronisation processes on networks: Díaz-Guilera's contribution Dynamics towards synchronisation in hierarchical networks consists of an overview of recent studies on hierarchical networks of phase oscillators. By analysing the evolution of the synchronous dynamics, one can infer details about the underlying network topology. Thus a connection between the dynamical and topological properties of the system is established. The paper Network synchronisation: optimal and pessimal scale-free topologies by Donetti et al explores an optimisation algorithm to study the properties of optimally synchronisable unweighted networks with scale-free degree distribution. It is shown that optimisation leads to a tendency towards disassortativity while networks that are optimally 'un-synchronisable' have a highly assortative string-like structure. The paper Critical line in undirected Kauffman Boolean networks—the role of percolation by Fronczak and Fronczak demonstrates that the percolation underlying the process of damage spreading impacts the position of the critical line in random boolean networks. The critical line results from the fact that the ordered behaviour of small clusters shields the chaotic behaviour of the giant component. In Impact of the updating scheme on stationary states of networks, Radicchi et al explore an interpolation between synchronous and asynchronous updating in a one-dimensional chain of Ising spins to locate a phase transition between phases with an absorbing and a fluctuating stationary state. The properties of attractors in the yeast cell-cycle network are also shown to depend sensitively on the updating mode. As this last contribution shows, a large part of the theoretical activity in the field can be applied to the study of biological systems. The section Biological Applications brings together the following contributions: In Applying weighted network measures to microarray distance matrices, Ahnert et al present a new approach to the analysis of weighted networks, which provides a generalisation to any network measure defined on unweighted networks. The clustering coefficient constructed using this approach is used to identify a number of biologically significant genes in data sets from microarray experiments. The paper Quantifying the taxonomic diversity in real species communities by Caretta Cartozo et al reports on universal statistical properties in taxonomic trees. The results, which are obtained by sampling a large pool of species from all over the world, suggest that it is possible to quantitatively distinguish real species assemblage from random collections. In the contribution Insights into biological information processing: structural and dynamical analysis of a human protein signalling network, de la Fuente et al investigate the dynamical properties of a human protein signalling network while accounting for edge directionality and topological properties both at the local and global scale. The relationship between the node degrees and the distribution of signals through the network is characterised using degree correlation profiles. A study of a brain network is presented by de Vico Fallani et al in Persistent patterns of interconnection in time-varying cortical networks estimated from high-resolution EEG recordings in humans during a simple motor act. The authors introduce an approach based on the estimate of time-varying graph indexes that allows the capture of schemes of communication within the network. The method is applied to a set of high resolution EEG data recorded from a group of subjects performing a simple foot movement. The last section, devoted to Social and Technological Applications, includes nine contributions in the broad area of infrastructure, economic, and social systems: The paper Uncovering individual and collective human dynamics from mobile phone records by Cándia et al explores extensive phone records resolved in both time and space to study collective behaviour and the occurrence of anomalous events. At the individual level, it is shown that the distribution of time intervals between consecutive calls is heavy tailed, which agrees with results previously reported on other human activities. In Mining the inner structure of the Web graph, Donato et al present a series of measurements of the Web, which offer a better understanding of the individual components of its bow-tie structure. The scale-free properties permeate all bow-tie components although they do not exhibit self-similarity and their inner structure is quite distinct. Effects of network topology on wealth distributions, by Garlaschelli and Loffredo, shows that a networked economic system self-organises towards a stationary state whose associated wealth distribution depends crucially on the underlying interaction network. In particular, this study implies that first-order topological properties alone (such as the scale-free property) are not enough to explain the emergence of the empirically observed mixed form of the wealth distribution. In the paper Resource allocation pattern in infrastructure networks, Kim and Motter show that real communication and transportation networks tend to exhibit larger load-to-capacity ratio in nodes and links with larger capacities. This surprising pattern, which is a consequence of decentralised evolution and network traffic fluctuations, suggests that infrastructure networks have evolved to prevent local failures but not necessarily large-scale failures that can be caused by cascading processes. The paper Consensus formation on coevolving networks: groups' formation and structure by Kozma and Barrat addresses the effect of adaptivity on a social model of opinion dynamics and consensus formation. The authors find that on adaptive networks the rewiring process fosters group formation by enhancing communication between agents of similar opinion, though it also makes possible the division of clusters. This result is significantly different from the percolation phenomena observed to govern the process in static networks. Capocci and Caldarelli, in the paper Folksonomies and clustering in the collaborative system CiteULike, analyse an online collaborative tagging system where users bookmark and annotate scientific papers. Such a system can be naturally represented as a tripartite graph whose nodes represent papers, users and tags connected by individual tag assignments. The semantics of tags is studied in order to uncover hidden relationships between tags. The authors find that the clustering coefficient reflects the semantical patterns among tags. Lambiotte's contribution, Majority rule on heterogeneous networks, focuses on the majority rule model for opinion formation when the agents interact through a complex network. It is shown that on networks with modular structures the system may exhibit an asymmetric regime, where nodes in different communities reach opposite average opinions. In addition, the node degree heterogeneity is shown to play an important role in the emergence of collective behaviour. In Structural analysis of behavioural networks from the Internet, Meiss et al analyse the structure of the Internet. The authors present a characterisation of the properties of the behavioural networks generated by several million users of the Abilene (Internet2) network. Structural features of these networks offer new insights into scaling properties of network activity and ways of distinguishing particular patterns of traffic. The final contribution, A social network's changing statistical properties and the quality of human innovation by Uzzi, is an analysis of the collaboration network of artists that made Broadway musicals in the post World War II period. It is shown that when the clustering coefficient in this network is low or high, the financial and artistic success of the industry is low while an intermediate level of clustering is associated with successful shows. We hope that this special issue will serve as a reference of the state of the knowledge in this exciting area of interdisciplinary research and that it will appeal to both experts and newcomers to the field. Finally, we would like to thank all participants of the workshop for their very significant contributions and the IOP Publishing team, particularly Rebecca Gillan, for the careful production of this special issue.
Graphical User Interface Development for Representing Air Flow Patterns
NASA Technical Reports Server (NTRS)
Chaudhary, Nilika
2004-01-01
In the Turbine Branch, scientists carry out experimental and computational work to advance the efficiency and diminish the noise production of jet engine turbines. One way to do this is by decreasing the heat that the turbine blades receive. Most of the experimental work is carried out by taking a single turbine blade and analyzing the air flow patterns around it, because this data indicates the sections of the turbine blade that are getting too hot. Since the cost of doing turbine blade air flow experiments is very high, researchers try to do computational work that fits the experimental data. The goal of computational fluid dynamics is for scientists to find a numerical way to predict the complex flow patterns around different turbine blades without physically having to perform tests or costly experiments. When visualizing flow patterns, scientists need a way to represent the flow conditions around a turbine blade. A researcher will assign specific zones that surround the turbine blade. In a two-dimensional view, the zones are usually quadrilaterals. The next step is to assign boundary conditions which define how the flow enters or exits one side of a zone. way of setting up computational zones and grids, visualizing flow patterns, and storing all the flow conditions in a file on the computer for future computation. Such a program is necessary because the only method for creating flow pattern graphs is by hand, which is tedious and time-consuming. By using a computer program to create the zones and grids, the graph would be faster to make and easier to edit. Basically, the user would run a program that is an editable graph. The user could click and drag with the mouse to form various zones and grids, then edit the locations of these grids, add flow and boundary conditions, and finally save the graph for future use and analysis. My goal this summer is to create a graphical user interface (GUI) that incorporates all of these elements. I am writing the program in Java, a language that is portable among platforms, because it can run on different operating systems such as Windows and Unix without having to be rewritten. I had no prior experience of programming in Java at the start of my internship; I am continuously learning as I create the program. I have written the part of the program that enables a user to draw several zones, edit them, and store their locations. The next phase of my project is to allow the user to click on the side of a zone and create a boundary condition for it. A previous intern wrote a program that allows the user to input boundary conditions. I can integrate the two programs to create a larger, more usable program. After that, I will develop a way for the user to save the graph for future reference. Another eventual goal is to make the GUI capable of creating three-dimensional zones as well. Researchers such as my mentor, Dr. David Ashpis, need a quick, user-friendly
Dynamic pattern matcher using incomplete data
NASA Technical Reports Server (NTRS)
Johnson, Gordon G. (Inventor); Wang, Lui (Inventor)
1993-01-01
This invention relates generally to pattern matching systems, and more particularly to a method for dynamically adapting the system to enhance the effectiveness of a pattern match. Apparatus and methods for calculating the similarity between patterns are known. There is considerable interest, however, in the storage and retrieval of data, particularly, when the search is called or initiated by incomplete information. For many search algorithms, a query initiating a data search requires exact information, and the data file is searched for an exact match. Inability to find an exact match thus results in a failure of the system or method.
Search Query Data to Monitor Interest in Behavior Change: Application for Public Health
Carr, Lucas J.; Dunsiger, Shira I.
2012-01-01
There is a need for effective interventions and policies that target the leading preventable causes of death in the U.S. (e.g., smoking, overweight/obesity, physical inactivity). Such efforts could be aided by the use of publicly available, real-time search query data that illustrate times and locations of high and low public interest in behaviors related to preventable causes of death. Objectives This study explored patterns of search query activity for the terms ‘weight’, ‘diet’, ‘fitness’, and ‘smoking’ using Google Insights for Search. Methods Search activity for ‘weight’, ‘diet’, ‘fitness’, and ‘smoking’ conducted within the United States via Google between January 4th, 2004 (first date data was available) and November 28th, 2011 (date of data download and analysis) were analyzed. Using a generalized linear model, we explored the effects of time (month) on mean relative search volume for all four terms. Results Models suggest a significant effect of month on mean search volume for all four terms. Search activity for all four terms was highest in January with observable declines throughout the remainder of the year. Conclusions These findings demonstrate discernable temporal patterns of search activity for four areas of behavior change. These findings could be used to inform the timing, location and messaging of interventions, campaigns and policies targeting these behaviors. PMID:23110198
An automated algorithm for determining photometric redshifts of quasars
NASA Astrophysics Data System (ADS)
Wang, Dan; Zhang, Yanxia; Zhao, Yongheng
2010-07-01
We employ k-nearest neighbor algorithm (KNN) for photometric redshift measurement of quasars with the Fifth Data Release (DR5) of the Sloan Digital Sky Survey (SDSS). KNN is an instance learning algorithm where the result of new instance query is predicted based on the closest training samples. The regressor do not use any model to fit and only based on memory. Given a query quasar, we find the known quasars or (training points) closest to the query point, whose redshift value is simply assigned to be the average of the values of its k nearest neighbors. Three kinds of different colors (PSF, Model or Fiber) and spectral redshifts are used as input parameters, separatively. The combination of the three kinds of colors is also taken as input. The experimental results indicate that the best input pattern is PSF + Model + Fiber colors in all experiments. With this pattern, 59.24%, 77.34% and 84.68% of photometric redshifts are obtained within ▵z < 0.1, 0.2 and 0.3, respectively. If only using one kind of colors as input, the model colors achieve the best performance. However, when using two kinds of colors, the best result is achieved by PSF + Fiber colors. In addition, nearest neighbor method (k = 1) shows its superiority compared to KNN (k ≠ 1) for the given sample.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Demeure, I.M.
The research presented here is concerned with representation techniques and tools to support the design, prototyping, simulation, and evaluation of message-based parallel, distributed computations. The author describes ParaDiGM-Parallel, Distributed computation Graph Model-a visual representation technique for parallel, message-based distributed computations. ParaDiGM provides several views of a computation depending on the aspect of concern. It is made of two complementary submodels, the DCPG-Distributed Computing Precedence Graph-model, and the PAM-Process Architecture Model-model. DCPGs are precedence graphs used to express the functionality of a computation in terms of tasks, message-passing, and data. PAM graphs are used to represent the partitioning of a computationmore » into schedulable units or processes, and the pattern of communication among those units. There is a natural mapping between the two models. He illustrates the utility of ParaDiGM as a representation technique by applying it to various computations (e.g., an adaptive global optimization algorithm, the client-server model). ParaDiGM representations are concise. They can be used in documenting the design and the implementation of parallel, distributed computations, in describing such computations to colleagues, and in comparing and contrasting various implementations of the same computation. He then describes VISA-VISual Assistant, a software tool to support the design, prototyping, and simulation of message-based parallel, distributed computations. VISA is based on the ParaDiGM model. In particular, it supports the editing of ParaDiGM graphs to describe the computations of interest, and the animation of these graphs to provide visual feedback during simulations. The graphs are supplemented with various attributes, simulation parameters, and interpretations which are procedures that can be executed by VISA.« less
Image databases: Problems and perspectives
NASA Technical Reports Server (NTRS)
Gudivada, V. Naidu
1989-01-01
With the increasing number of computer graphics, image processing, and pattern recognition applications, economical storage, efficient representation and manipulation, and powerful and flexible query languages for retrieval of image data are of paramount importance. These and related issues pertinent to image data bases are examined.
A Study of the Efficiency of Spatial Indexing Methods Applied to Large Astronomical Databases
NASA Astrophysics Data System (ADS)
Donaldson, Tom; Berriman, G. Bruce; Good, John; Shiao, Bernie
2018-01-01
Spatial indexing of astronomical databases generally uses quadrature methods, which partition the sky into cells used to create an index (usually a B-tree) written as database column. We report the results of a study to compare the performance of two common indexing methods, HTM and HEALPix, on Solaris and Windows database servers installed with a PostgreSQL database, and a Windows Server installed with MS SQL Server. The indexing was applied to the 2MASS All-Sky Catalog and to the Hubble Source catalog. On each server, the study compared indexing performance by submitting 1 million queries at each index level with random sky positions and random cone search radius, which was computed on a logarithmic scale between 1 arcsec and 1 degree, and measuring the time to complete the query and write the output. These simulated queries, intended to model realistic use patterns, were run in a uniform way on many combinations of indexing method and indexing level. The query times in all simulations are strongly I/O-bound and are linear with number of records returned for large numbers of sources. There are, however, considerable differences between simulations, which reveal that hardware I/O throughput is a more important factor in managing the performance of a DBMS than the choice of indexing scheme. The choice of index itself is relatively unimportant: for comparable index levels, the performance is consistent within the scatter of the timings. At small index levels (large cells; e.g. level 4; cell size 3.7 deg), there is large scatter in the timings because of wide variations in the number of sources found in the cells. At larger index levels, performance improves and scatter decreases, but the improvement at level 8 (14 min) and higher is masked to some extent in the timing scatter caused by the range of query sizes. At very high levels (20; 0.0004 arsec), the granularity of the cells becomes so high that a large number of extraneous empty cells begin to degrade performance. Thus, for the use patterns studied here the database performance is not critically dependent on the exact choices of index or level.
Topological patterns in street networks of self-organized urban settlements
NASA Astrophysics Data System (ADS)
Buhl, J.; Gautrais, J.; Reeves, N.; Solé, R. V.; Valverde, S.; Kuntz, P.; Theraulaz, G.
2006-02-01
Many urban settlements result from a spatially distributed, decentralized building process. Here we analyze the topological patterns of organization of a large collection of such settlements using the approach of complex networks. The global efficiency (based on the inverse of shortest-path lengths), robustness to disconnections and cost (in terms of length) of these graphs is studied and their possible origins analyzed. A wide range of patterns is found, from tree-like settlements (highly vulnerable to random failures) to meshed urban patterns. The latter are shown to be more robust and efficient.
Emergent spectral properties of river network topology: an optimal channel network approach.
Abed-Elmdoust, Armaghan; Singh, Arvind; Yang, Zong-Liang
2017-09-13
Characterization of river drainage networks has been a subject of research for many years. However, most previous studies have been limited to quantities which are loosely connected to the topological properties of these networks. In this work, through a graph-theoretic formulation of drainage river networks, we investigate the eigenvalue spectra of their adjacency matrix. First, we introduce a graph theory model for river networks and explore the properties of the network through its adjacency matrix. Next, we show that the eigenvalue spectra of such complex networks follow distinct patterns and exhibit striking features including a spectral gap in which no eigenvalue exists as well as a finite number of zero eigenvalues. We show that such spectral features are closely related to the branching topology of the associated river networks. In this regard, we find an empirical relation for the spectral gap and nullity in terms of the energy dissipation exponent of the drainage networks. In addition, the eigenvalue distribution is found to follow a finite-width probability density function with certain skewness which is related to the drainage pattern. Our results are based on optimal channel network simulations and validated through examples obtained from physical experiments on landscape evolution. These results suggest the potential of the spectral graph techniques in characterizing and modeling river networks.
ERIC Educational Resources Information Center
School Science Review, 1981
1981-01-01
Outlines several laboratory procedures and demonstrations including electric fields using sawdust, experiments with capacitors, particle spacing in a vapor and a liquid, metrology, momentum, Moire patterns and interference fringes, equipping for practical electronics, and using programmable calculators for rapid plotting of graphs. (DS)
Seasonal Cycles in Curiosity First Two Martian Years
2016-05-11
By monitoring weather throughout two Martian years since landing in Gale Crater in 2012, NASA Curiosity Mars rover has documented seasonal patterns such as shown in these graphs of temperature, water-vapor content and air pressure.
ERIC Educational Resources Information Center
Young, Sharon L.
1991-01-01
Presented are activities that focus on gathering, using, and interpreting data about fingerprints as a basis for integrating mathematics and science. Patterns, classification, logical reasoning, and mathematical relationships are explored by making graphs, classifying fingerprints, and matching identical fingerprints. A parent-involvement activity…
Martínez-Costa, Catalina; Cornet, Ronald; Karlsson, Daniel; Schulz, Stefan; Kalra, Dipak
2015-05-01
To improve semantic interoperability of electronic health records (EHRs) by ontology-based mediation across syntactically heterogeneous representations of the same or similar clinical information. Our approach is based on a semantic layer that consists of: (1) a set of ontologies supported by (2) a set of semantic patterns. The first aspect of the semantic layer helps standardize the clinical information modeling task and the second shields modelers from the complexity of ontology modeling. We applied this approach to heterogeneous representations of an excerpt of a heart failure summary. Using a set of finite top-level patterns to derive semantic patterns, we demonstrate that those patterns, or compositions thereof, can be used to represent information from clinical models. Homogeneous querying of the same or similar information, when represented according to heterogeneous clinical models, is feasible. Our approach focuses on the meaning embedded in EHRs, regardless of their structure. This complex task requires a clear ontological commitment (ie, agreement to consistently use the shared vocabulary within some context), together with formalization rules. These requirements are supported by semantic patterns. Other potential uses of this approach, such as clinical models validation, require further investigation. We show how an ontology-based representation of a clinical summary, guided by semantic patterns, allows homogeneous querying of heterogeneous information structures. Whether there are a finite number of top-level patterns is an open question. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Response-Guided Community Detection: Application to Climate Index Discovery
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bello, Gonzalo; Angus, Michael; Pedemane, Navya
Discovering climate indices-time series that summarize spatiotemporal climate patterns-is a key task in the climate science domain. In this work, we approach this task as a problem of response-guided community detection; that is, identifying communities in a graph associated with a response variable of interest. To this end, we propose a general strategy for response-guided community detection that explicitly incorporates information of the response variable during the community detection process, and introduce a graph representation of spatiotemporal data that leverages information from multiple variables. We apply our proposed methodology to the discovery of climate indices associated with seasonal rainfall variability.more » Our results suggest that our methodology is able to capture the underlying patterns known to be associated with the response variable of interest and to improve its predictability compared to existing methodologies for data-driven climate index discovery and official forecasts.« less
Xiong, Zheng; He, Yinyan; Hattrick-Simpers, Jason R; Hu, Jianjun
2017-03-13
The creation of composition-processing-structure relationships currently represents a key bottleneck for data analysis for high-throughput experimental (HTE) material studies. Here we propose an automated phase diagram attribution algorithm for HTE data analysis that uses a graph-based segmentation algorithm and Delaunay tessellation to create a crystal phase diagram from high throughput libraries of X-ray diffraction (XRD) patterns. We also propose the sample-pair based objective evaluation measures for the phase diagram prediction problem. Our approach was validated using 278 diffraction patterns from a Fe-Ga-Pd composition spread sample with a prediction precision of 0.934 and a Matthews Correlation Coefficient score of 0.823. The algorithm was then applied to the open Ni-Mn-Al thin-film composition spread sample to obtain the first predicted phase diagram mapping for that sample.
On a phase diagram for random neural networks with embedded spike timing dependent plasticity.
Turova, Tatyana S; Villa, Alessandro E P
2007-01-01
This paper presents an original mathematical framework based on graph theory which is a first attempt to investigate the dynamics of a model of neural networks with embedded spike timing dependent plasticity. The neurons correspond to integrate-and-fire units located at the vertices of a finite subset of 2D lattice. There are two types of vertices, corresponding to the inhibitory and the excitatory neurons. The edges are directed and labelled by the discrete values of the synaptic strength. We assume that there is an initial firing pattern corresponding to a subset of units that generate a spike. The number of activated externally vertices is a small fraction of the entire network. The model presented here describes how such pattern propagates throughout the network as a random walk on graph. Several results are compared with computational simulations and new data are presented for identifying critical parameters of the model.
NASA Astrophysics Data System (ADS)
Ke, Xianhua; Jiang, Hao; Lv, Wen; Liu, Shiyuan
2016-03-01
Triple patterning (TP) lithography becomes a feasible technology for manufacturing as the feature size further scale down to sub 14/10 nm. In TP, a layout is decomposed into three masks followed with exposures and etches/freezing processes respectively. Previous works mostly focus on layout decomposition with minimal conflicts and stitches simultaneously. However, since any existence of native conflict will result in layout re-design/modification and reperforming the time-consuming decomposition, the effective method that can be aware of native conflicts (NCs) in layout is desirable. In this paper, a bin-based library matching method is proposed for NCs detection and layout decomposition. First, a layout is divided into bins and the corresponding conflict graph in each bin is constructed. Then, we match the conflict graph in a prebuilt colored library, and as a result the NCs can be located and highlighted quickly.
Universal structures of normal and pathological heart rate variability.
Gañán-Calvo, Alfonso M; Fajardo-López, Juan
2016-02-25
The circulatory system of living organisms is an autonomous mechanical system softly tuned with the respiratory system, and both developed by evolution as a response to the complex oxygen demand patterns associated with motion. Circulatory health is rooted in adaptability, which entails an inherent variability. Here, we show that a generalized N-dimensional normalized graph representing heart rate variability reveals two universal arrhythmic patterns as specific signatures of health one reflects cardiac adaptability, and the other the cardiac-respiratory rate tuning. In addition, we identify at least three universal arrhythmic profiles whose presences raise in proportional detriment of the two healthy ones in pathological conditions (myocardial infarction; heart failure; and recovery from sudden death). The presence of the identified universal arrhythmic structures together with the position of the centre of mass of the heart rate variability graph provide a unique quantitative assessment of the health-pathology gradient.
Dictionary-driven protein annotation
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-01-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Connecting Provenance with Semantic Descriptions in the NASA Earth Exchange (NEX)
NASA Astrophysics Data System (ADS)
Votava, P.; Michaelis, A.; Nemani, R. R.
2012-12-01
NASA Earth Exchange (NEX) is a data, modeling and knowledge collaboratory that houses NASA satellite data, climate data and ancillary data where a focused community may come together to share modeling and analysis codes, scientific results, knowledge and expertise on a centralized platform. Some of the main goals of NEX are transparency and repeatability and to that extent we have been adding components that enable tracking of provenance of both scientific processes and datasets produced by these processes. As scientific processes become more complex, they are often developed collaboratively and it becomes increasingly important for the research team to be able to track the development of the process and the datasets that are produced along the way. Additionally, we want to be able to link the processes and the datasets developed on NEX to an existing information and knowledge, so that the users can query and compare the provenance of any dataset or process with regard to the component-specific attributes such as data quality, geographic location, related publications, user comments and annotations etc. We have developed several ontologies that describe datasets and workflow components available on NEX using the OWL ontology language as well as a simple ontology that provides linking mechanism to the collected provenance information. The provenance is captured in two ways - we utilize existing provenance infrastructure of VisTrails, which is used as a workflow engine on NEX, and we extend the captured provenance using the PROV data model expressed through the PROV-O ontology. We do this in order to link and query the provenance easier in the context of the existing NEX information and knowledge. The captured provenance graph is processed and stored using RDFlib with MySQL backend that can be queried using either RDFLib or SPARQL. As a concrete example, we show how this information is captured during anomaly detection process in large satellite datasets.
Metnitz, P G; Laback, P; Popow, C; Laback, O; Lenz, K; Hiesmayr, M
1995-01-01
Patient Data Management Systems (PDMS) for ICUs collect, present and store clinical data. Various intentions make analysis of those digitally stored data desirable, such as quality control or scientific purposes. The aim of the Intensive Care Data Evaluation project (ICDEV), was to provide a database tool for the analysis of data recorded at various ICUs at the University Clinics of Vienna. General Hospital of Vienna, with two different PDMSs used: CareVue 9000 (Hewlett Packard, Andover, USA) at two ICUs (one medical ICU and one neonatal ICU) and PICIS Chart+ (PICIS, Paris, France) at one Cardiothoracic ICU. CONCEPT AND METHODS: Clinically oriented analysis of the data collected in a PDMS at an ICU was the beginning of the development. After defining the database structure we established a client-server based database system under Microsoft Windows NI and developed a user friendly data quering application using Microsoft Visual C++ and Visual Basic; ICDEV was successfully installed at three different ICUs, adjustment to the different PDMS configurations were done within a few days. The database structure developed by us enables a powerful query concept representing an 'EXPERT QUESTION COMPILER' which may help to answer almost any clinical questions. Several program modules facilitate queries at the patient, group and unit level. Results from ICDEV-queries are automatically transferred to Microsoft Excel for display (in form of configurable tables and graphs) and further processing. The ICDEV concept is configurable for adjustment to different intensive care information systems and can be used to support computerized quality control. However, as long as there exists no sufficient artifact recognition or data validation software for automatically recorded patient data, the reliability of these data and their usage for computer assisted quality control remain unclear and should be further studied.
A method for independent component graph analysis of resting-state fMRI.
Ribeiro de Paula, Demetrius; Ziegler, Erik; Abeyasinghe, Pubuditha M; Das, Tushar K; Cavaliere, Carlo; Aiello, Marco; Heine, Lizette; di Perri, Carol; Demertzi, Athena; Noirhomme, Quentin; Charland-Verville, Vanessa; Vanhaudenhuyse, Audrey; Stender, Johan; Gomez, Francisco; Tshibanda, Jean-Flory L; Laureys, Steven; Owen, Adrian M; Soddu, Andrea
2017-03-01
Independent component analysis (ICA) has been extensively used for reducing task-free BOLD fMRI recordings into spatial maps and their associated time-courses. The spatially identified independent components can be considered as intrinsic connectivity networks (ICNs) of non-contiguous regions. To date, the spatial patterns of the networks have been analyzed with techniques developed for volumetric data. Here, we detail a graph building technique that allows these ICNs to be analyzed with graph theory. First, ICA was performed at the single-subject level in 15 healthy volunteers using a 3T MRI scanner. The identification of nine networks was performed by a multiple-template matching procedure and a subsequent component classification based on the network "neuronal" properties. Second, for each of the identified networks, the nodes were defined as 1,015 anatomically parcellated regions. Third, between-node functional connectivity was established by building edge weights for each networks. Group-level graph analysis was finally performed for each network and compared to the classical network. Network graph comparison between the classically constructed network and the nine networks showed significant differences in the auditory and visual medial networks with regard to the average degree and the number of edges, while the visual lateral network showed a significant difference in the small-worldness. This novel approach permits us to take advantage of the well-recognized power of ICA in BOLD signal decomposition and, at the same time, to make use of well-established graph measures to evaluate connectivity differences. Moreover, by providing a graph for each separate network, it can offer the possibility to extract graph measures in a specific way for each network. This increased specificity could be relevant for studying pathological brain activity or altered states of consciousness as induced by anesthesia or sleep, where specific networks are known to be altered in different strength.
Brain Tumor Segmentation Using Deep Belief Networks and Pathological Knowledge.
Zhan, Tianming; Chen, Yi; Hong, Xunning; Lu, Zhenyu; Chen, Yunjie
2017-01-01
In this paper, we propose an automatic brain tumor segmentation method based on Deep Belief Networks (DBNs) and pathological knowledge. The proposed method is targeted against gliomas (both low and high grade) obtained in multi-sequence magnetic resonance images (MRIs). Firstly, a novel deep architecture is proposed to combine the multi-sequences intensities feature extraction with classification to get the classification probabilities of each voxel. Then, graph cut based optimization is executed on the classification probabilities to strengthen the spatial relationships of voxels. At last, pathological knowledge of gliomas is applied to remove some false positives. Our method was validated in the Brain Tumor Segmentation Challenge 2012 and 2013 databases (BRATS 2012, 2013). The performance of segmentation results demonstrates our proposal providing a competitive solution with stateof- the-art methods. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
CentiServer: A Comprehensive Resource, Web-Based Application and R Package for Centrality Analysis.
Jalili, Mahdi; Salehzadeh-Yazdi, Ali; Asgari, Yazdan; Arab, Seyed Shahriar; Yaghmaie, Marjan; Ghavamzadeh, Ardeshir; Alimoghaddam, Kamran
2015-01-01
Various disciplines are trying to solve one of the most noteworthy queries and broadly used concepts in biology, essentiality. Centrality is a primary index and a promising method for identifying essential nodes, particularly in biological networks. The newly created CentiServer is a comprehensive online resource that provides over 110 definitions of different centrality indices, their computational methods, and algorithms in the form of an encyclopedia. In addition, CentiServer allows users to calculate 55 centralities with the help of an interactive web-based application tool and provides a numerical result as a comma separated value (csv) file format or a mapped graphical format as a graph modeling language (GML) file. The standalone version of this application has been developed in the form of an R package. The web-based application (CentiServer) and R package (centiserve) are freely available at http://www.centiserver.org/.
CentiServer: A Comprehensive Resource, Web-Based Application and R Package for Centrality Analysis
Jalili, Mahdi; Salehzadeh-Yazdi, Ali; Asgari, Yazdan; Arab, Seyed Shahriar; Yaghmaie, Marjan; Ghavamzadeh, Ardeshir; Alimoghaddam, Kamran
2015-01-01
Various disciplines are trying to solve one of the most noteworthy queries and broadly used concepts in biology, essentiality. Centrality is a primary index and a promising method for identifying essential nodes, particularly in biological networks. The newly created CentiServer is a comprehensive online resource that provides over 110 definitions of different centrality indices, their computational methods, and algorithms in the form of an encyclopedia. In addition, CentiServer allows users to calculate 55 centralities with the help of an interactive web-based application tool and provides a numerical result as a comma separated value (csv) file format or a mapped graphical format as a graph modeling language (GML) file. The standalone version of this application has been developed in the form of an R package. The web-based application (CentiServer) and R package (centiserve) are freely available at http://www.centiserver.org/ PMID:26571275
UCbase 2.0: ultraconserved sequences database (2014 update)
Lomonaco, Vincenzo; Martoglia, Riccardo; Mandreoli, Federica; Anderlucci, Laura; Emmett, Warren; Bicciato, Silvio; Taccioli, Cristian
2014-01-01
UCbase 2.0 (http://ucbase.unimore.it) is an update, extension and evolution of UCbase, a Web tool dedicated to the analysis of ultraconserved sequences (UCRs). UCRs are 481 sequences >200 bases sharing 100% identity among human, mouse and rat genomes. They are frequently located in genomic regions known to be involved in cancer or differentially expressed in human leukemias and carcinomas. UCbase 2.0 is a platform-independent Web resource that includes the updated version of the human genome annotation (hg19), information linking disorders to chromosomal coordinates based on the Systematized Nomenclature of Medicine classification, a query tool to search for Single Nucleotide Polymorphisms (SNPs) and a new text box to directly interrogate the database using a MySQL interface. To facilitate the interactive visual interpretation of UCR chromosomal positioning, UCbase 2.0 now includes a graph visualization interface directly linked to UCSC genome browser. Database URL: http://ucbase.unimore.it PMID:24951797
Semantic technologies in a decision support system
NASA Astrophysics Data System (ADS)
Wasielewska, K.; Ganzha, M.; Paprzycki, M.; Bǎdicǎ, C.; Ivanovic, M.; Lirkov, I.
2015-10-01
The aim of our work is to design a decision support system based on ontological representation of domain(s) and semantic technologies. Specifically, we consider the case when Grid / Cloud user describes his/her requirements regarding a "resource" as a class expression from an ontology, while the instances of (the same) ontology represent available resources. The goal is to help the user to find the best option with respect to his/her requirements, while remembering that user's knowledge may be "limited." In this context, we discuss multiple approaches based on semantic data processing, which involve different "forms" of user interaction with the system. Specifically, we consider: (a) ontological matchmaking based on SPARQL queries and class expression, (b) graph-based semantic closeness of instances representing user requirements (constructed from the class expression) and available resources, and (c) multicriterial analysis based on the AHP method, which utilizes expert domain knowledge (also ontologically represented).
Thoth: Software for data visualization & statistics
NASA Astrophysics Data System (ADS)
Laher, R. R.
2016-10-01
Thoth is a standalone software application with a graphical user interface for making it easy to query, display, visualize, and analyze tabular data stored in relational databases and data files. From imported data tables, it can create pie charts, bar charts, scatter plots, and many other kinds of data graphs with simple menus and mouse clicks (no programming required), by leveraging the open-source JFreeChart library. It also computes useful table-column data statistics. A mature tool, having underwent development and testing over several years, it is written in the Java computer language, and hence can be run on any computing platform that has a Java Virtual Machine and graphical-display capability. It can be downloaded and used by anyone free of charge, and has general applicability in science, engineering, medical, business, and other fields. Special tools and features for common tasks in astronomy and astrophysical research are included in the software.
Using soft-hard fusion for misinformation detection and pattern of life analysis in OSINT
NASA Astrophysics Data System (ADS)
Levchuk, Georgiy; Shabarekh, Charlotte
2017-05-01
Today's battlefields are shifting to "denied areas", where the use of U.S. Military air and ground assets is limited. To succeed, the U.S. intelligence analysts increasingly rely on available open-source intelligence (OSINT) which is fraught with inconsistencies, biased reporting and fake news. Analysts need automated tools for retrieval of information from OSINT sources, and these solutions must identify and resolve conflicting and deceptive information. In this paper, we present a misinformation detection model (MDM) which converts text to attributed knowledge graphs and runs graph-based analytics to identify misinformation. At the core of our solution is identification of knowledge conflicts in the fused multi-source knowledge graph, and semi-supervised learning to compute locally consistent reliability and credibility scores for the documents and sources, respectively. We present validation of proposed method using an open source dataset constructed from the online investigations of MH17 downing in Eastern Ukraine.
Measuring geographic segregation: a graph-based approach
NASA Astrophysics Data System (ADS)
Hong, Seong-Yun; Sadahiro, Yukio
2014-04-01
Residential segregation is a multidimensional phenomenon that encompasses several conceptually distinct aspects of geographical separation between populations. While various indices have been developed as a response to different definitions of segregation, the reliance on such single-figure indices could oversimplify the complex, multidimensional phenomena. In this regard, this paper suggests an alternative graph-based approach that provides more detailed information than simple indices: The concentration profile graphically conveys information about how evenly a population group is distributed over the study region, and the spatial proximity profile depicts the degree of clustering across different threshold levels. These graphs can also be summarized into single numbers for comparative purposes, but the interpretation can be more accurate by inspecting the additional information. To demonstrate the use of these methods, the residential patterns of three major ethnic groups in Auckland, namely Māori, Pacific peoples, and Asians, are examined using the 2006 census data.
Goldstone STDN 9-meter radiation test
NASA Astrophysics Data System (ADS)
Blain, J. R.
1981-12-01
The Goldstone spaceflight tracking and data network (STDN) 9-meter tests were conducted from February through July 1981 to characterize the near-field radiation patterns of the S-band and fourth harmonic frequency emissions. The test configurations and results are presented with graphs of the antenna patterns. The tests indicated that X-band leakage may be suppressed to levels of approximately -190 dBm/sq cm at 200 meters.
Neural network for intelligent query of an FBI forensic database
NASA Astrophysics Data System (ADS)
Uvanni, Lee A.; Rainey, Timothy G.; Balasubramanian, Uma; Brettle, Dean W.; Weingard, Fred; Sibert, Robert W.; Birnbaum, Eric
1997-02-01
Examiner is an automated fired cartridge case identification system utilizing a dual-use neural network pattern recognition technology, called the statistical-multiple object detection and location system (S-MODALS) developed by Booz(DOT)Allen & Hamilton, Inc. in conjunction with Rome Laboratory. S-MODALS was originally designed for automatic target recognition (ATR) of tactical and strategic military targets using multisensor fusion [electro-optical (EO), infrared (IR), and synthetic aperture radar (SAR)] sensors. Since S-MODALS is a learning system readily adaptable to problem domains other than automatic target recognition, the pattern matching problem of microscopic marks for firearms evidence was analyzed using S-MODALS. The physics; phenomenology; discrimination and search strategies; robustness requirements; error level and confidence level propagation that apply to the pattern matching problem of military targets were found to be applicable to the ballistic domain as well. The Examiner system uses S-MODALS to rank a set of queried cartridge case images from the most similar to the least similar image in reference to an investigative fired cartridge case image. The paper presents three independent tests and evaluation studies of the Examiner system utilizing the S-MODALS technology for the Federal Bureau of Investigation.
Accuracy and Completeness of Clinical Coding Using ICD-10 for Ambulatory Visits
Horsky, Jan; Drucker, Elizabeth A.; Ramelson, Harley Z.
2017-01-01
This study describes a simulation of diagnostic coding using an EHR. Twenty-three ambulatory clinicians were asked to enter appropriate codes for six standardized scenarios with two different EHRs. Their interactions with the query interface were analyzed for patterns and variations in search strategies and the resulting sets of entered codes for accuracy and completeness. Just over a half of entered codes were appropriate for a given scenario and about a quarter were omitted. Crohn’s disease and diabetes scenarios had the highest rate of inappropriate coding and code variation. The omission rate was higher for secondary than for primary visit diagnoses. Codes for immunization, dialysis dependence and nicotine dependence were the most often omitted. We also found a high rate of variation in the search terms used to query the EHR for the same diagnoses. Changes to the training of clinicians and improved design of EHR query modules may lower the rate of inappropriate and omitted codes. PMID:29854158
Interactive and Versatile Navigation of Structural Databases.
Korb, Oliver; Kuhn, Bernd; Hert, Jérôme; Taylor, Neil; Cole, Jason; Groom, Colin; Stahl, Martin
2016-05-12
We present CSD-CrossMiner, a novel tool for pharmacophore-based searches in crystal structure databases. Intuitive pharmacophore queries describing, among others, protein-ligand interaction patterns, ligand scaffolds, or protein environments can be built and modified interactively. Matching crystal structures are overlaid onto the query and visualized as soon as they are available, enabling the researcher to quickly modify a hypothesis on the fly. We exemplify the utility of the approach by showing applications relevant to real-world drug discovery projects, including the identification of novel fragments for a specific protein environment or scaffold hopping. The ability to concurrently search protein-ligand binding sites extracted from the Protein Data Bank (PDB) and small organic molecules from the Cambridge Structural Database (CSD) using the same pharmacophore query further emphasizes the flexibility of CSD-CrossMiner. We believe that CSD-CrossMiner closes an important gap in mining structural data and will allow users to extract more value from the growing number of available crystal structures.
Faster Bit-Parallel Algorithms for Unordered Pseudo-tree Matching and Tree Homeomorphism
NASA Astrophysics Data System (ADS)
Kaneta, Yusaku; Arimura, Hiroki
In this paper, we consider the unordered pseudo-tree matching problem, which is a problem of, given two unordered labeled trees P and T, finding all occurrences of P in T via such many-one embeddings that preserve node labels and parent-child relationship. This problem is closely related to tree pattern matching problem for XPath queries with child axis only. If m > w , we present an efficient algorithm that solves the problem in O(nm log(w)/w) time using O(hm/w + mlog(w)/w) space and O(m log(w)) preprocessing on a unit-cost arithmetic RAM model with addition, where m is the number of nodes in P, n is the number of nodes in T, h is the height of T, and w is the word length. We also discuss a modification of our algorithm for the unordered tree homeomorphism problem, which corresponds to a tree pattern matching problem for XPath queries with descendant axis only.
Dehkordy, Soudabeh Fazeli; Carlos, Ruth C.; Hall, Kelli S.; Dalton, Vanessa K.
2015-01-01
Rationale and Objectives Millions of people use online search engines every day to find health-related information and voluntarily share their personal health status and behaviors in various Web sites. Thus, data from tracking of online information seeker’s behavior offer potential opportunities for use in public health surveillance and research. Google Trends is a feature of Google which allows internet users to graph the frequency of searches for a single term or phrase over time or by geographic region. We used Google Trends to describe patterns of information seeking behavior in the subject of dense breasts and to examine their correlation with the passage or introduction of dense breast notification legislation. Materials and Methods In order to capture the temporal variations of information seeking about dense breasts, the web search query “dense breast” was entered in the Google Trends tool. We then mapped the dates of legislative actions regarding dense breasts that received widespread coverage in the lay media to information seeking trends about dense breasts over time. Results Newsworthy events and legislative actions appear to correlate well with peaks in search volume of “dense breast”. Geographic regions with the highest search volumes have either passed, denied, or are currently considering the dense breast legislation. Conclusions Our study demonstrated that any legislative action and respective news coverage correlate with increase in information seeking for “dense breast” on Google, suggesting that Google Trends has the potential to serve as a data source for policy-relevant research. PMID:24998689
Building a SuAVE browse interface to R2R's Linked Data
NASA Astrophysics Data System (ADS)
Clark, D.; Stocks, K. I.; Arko, R. A.; Zaslavsky, I.; Whitenack, T.
2017-12-01
The Rolling Deck to Repository program (R2R) is creating and evaluating a new browse portal based on the SuAVE platform and the R2R linked data graph. R2R manages the underway sensor data collected by the fleet of US academic research vessels, and provides a discovery and access point to those data at its website, www.rvdata.us. R2R has a database-driven search interface, but seeks a more capable and extensible browse interface that could be built off of the substantial R2R linked data resources. R2R's Linked Data graph organizes its data holdings around key concepts (e.g. cruise, vessel, device type, operator, award, organization, publication), anchored by persistent identifiers where feasible. The "Survey Analysis via Visual Exploration" or SuAVE platform (suave.sdsc.edu) is a system for online publication, sharing, and analysis of images and metadata. It has been implemented as an interface to diverse data collections, but has not been driven off of linked data in the past. SuAVE supports several features of interest to R2R, including faceted searching, collaborative annotations, efficient subsetting, Google maps-like navigation over an image gallery, and several types of data analysis. Our initial SuAVE-based implementation was through a CSV export from the R2R PostGIS-enabled PostgreSQL database. This served to demonstrate the utility of SuAVE but was static and required reloading as R2R data holdings grew. We are now working to implement a SPARQL-based ("RDF Query Language") service that directly leverages the R2R Linked Data graph and offers the ability to subset and/or customize output.We will show examples of SuAVE faceted searches on R2R linked data concepts, and discuss our experience to date with this work in progress.
NASA Astrophysics Data System (ADS)
Abidin, Anas Zainul; D'Souza, Adora M.; Nagarajan, Mahesh B.; Wismüller, Axel
2016-03-01
The use of functional Magnetic Resonance Imaging (fMRI) has provided interesting insights into our understanding of the brain. In clinical setups these scans have been used to detect and study changes in the brain network properties in various neurological disorders. A large percentage of subjects infected with HIV present cognitive deficits, which are known as HIV associated neurocognitive disorder (HAND). In this study we propose to use our novel technique named Mutual Connectivity Analysis (MCA) to detect differences in brain networks in subjects with and without HIV infection. Resting state functional MRI scans acquired from 10 subjects (5 HIV+ and 5 HIV-) were subject to standard preprocessing routines. Subsequently, the average time-series for each brain region of the Automated Anatomic Labeling (AAL) atlas are extracted and used with the MCA framework to obtain a graph characterizing the interactions between them. The network graphs obtained for different subjects are then compared using Network-Based Statistics (NBS), which is an approach to detect differences between graphs edges while controlling for the family-wise error rate when mass univariate testing is performed. Applying this approach on the graphs obtained yields a single network encompassing 42 nodes and 65 edges, which is significantly different between the two subject groups. Specifically connections to the regions in and around the basal ganglia are significantly decreased. Also some nodes corresponding to the posterior cingulate cortex are affected. These results are inline with our current understanding of pathophysiological mechanisms of HIV associated neurocognitive disease (HAND) and other HIV based fMRI connectivity studies. Hence, we illustrate the applicability of our novel approach with network-based statistics in a clinical case-control study to detect differences connectivity patterns.
Automated diagnosis of interstitial lung diseases and emphysema in MDCT imaging
NASA Astrophysics Data System (ADS)
Fetita, Catalin; Chang Chien, Kuang-Che; Brillet, Pierre-Yves; Prêteux, Françoise
2007-09-01
Diffuse lung diseases (DLD) include a heterogeneous group of non-neoplasic disease resulting from damage to the lung parenchyma by varying patterns of inflammation. Characterization and quantification of DLD severity using MDCT, mainly in interstitial lung diseases and emphysema, is an important issue in clinical research for the evaluation of new therapies. This paper develops a 3D automated approach for detection and diagnosis of diffuse lung diseases such as fibrosis/honeycombing, ground glass and emphysema. The proposed methodology combines multi-resolution 3D morphological filtering (exploiting the sup-constrained connection cost operator) and graph-based classification for a full characterization of the parenchymal tissue. The morphological filtering performs a multi-level segmentation of the low- and medium-attenuated lung regions as well as their classification with respect to a granularity criterion (multi-resolution analysis). The original intensity range of the CT data volume is thus reduced in the segmented data to a number of levels equal to the resolution depth used (generally ten levels). The specificity of such morphological filtering is to extract tissue patterns locally contrasting with their neighborhood and of size inferior to the resolution depth, while preserving their original shape. A multi-valued hierarchical graph describing the segmentation result is built-up according to the resolution level and the adjacency of the different segmented components. The graph nodes are then enriched with the textural information carried out by their associated components. A graph analysis-reorganization based on the nodes attributes delivers the final classification of the lung parenchyma in normal and ILD/emphysematous regions. It also makes possible to discriminate between different types, or development stages, among the same class of diseases.
Enhanced Contact Graph Routing (ECGR) MACHETE Simulation Model
NASA Technical Reports Server (NTRS)
Segui, John S.; Jennings, Esther H.; Clare, Loren P.
2013-01-01
Contact Graph Routing (CGR) for Delay/Disruption Tolerant Networking (DTN) space-based networks makes use of the predictable nature of node contacts to make real-time routing decisions given unpredictable traffic patterns. The contact graph will have been disseminated to all nodes before the start of route computation. CGR was designed for space-based networking environments where future contact plans are known or are independently computable (e.g., using known orbital dynamics). For each data item (known as a bundle in DTN), a node independently performs route selection by examining possible paths to the destination. Route computation could conceivably run thousands of times a second, so computational load is important. This work refers to the simulation software model of Enhanced Contact Graph Routing (ECGR) for DTN Bundle Protocol in JPL's MACHETE simulation tool. The simulation model was used for performance analysis of CGR and led to several performance enhancements. The simulation model was used to demonstrate the improvements of ECGR over CGR as well as other routing methods in space network scenarios. ECGR moved to using earliest arrival time because it is a global monotonically increasing metric that guarantees the safety properties needed for the solution's correctness since route re-computation occurs at each node to accommodate unpredicted changes (e.g., traffic pattern, link quality). Furthermore, using earliest arrival time enabled the use of the standard Dijkstra algorithm for path selection. The Dijkstra algorithm for path selection has a well-known inexpensive computational cost. These enhancements have been integrated into the open source CGR implementation. The ECGR model is also useful for route metric experimentation and comparisons with other DTN routing protocols particularly when combined with MACHETE's space networking models and Delay Tolerant Link State Routing (DTLSR) model.
NASA Astrophysics Data System (ADS)
Sharma, Harshita; Zerbe, Norman; Heim, Daniel; Wienert, Stephan; Lohmann, Sebastian; Hellwich, Olaf; Hufnagl, Peter
2016-03-01
This paper describes a novel graph-based method for efficient representation and subsequent classification in histological whole slide images of gastric cancer. Her2/neu immunohistochemically stained and haematoxylin and eosin stained histological sections of gastric carcinoma are digitized. Immunohistochemical staining is used in practice by pathologists to determine extent of malignancy, however, it is laborious to visually discriminate the corresponding malignancy levels in the more commonly used haematoxylin and eosin stain, and this study attempts to solve this problem using a computer-based method. Cell nuclei are first isolated at high magnification using an automatic cell nuclei segmentation strategy, followed by construction of cell nuclei attributed relational graphs of the tissue regions. These graphs represent tissue architecture comprehensively, as they contain information about cell nuclei morphology as vertex attributes, along with knowledge of neighborhood in the form of edge linking and edge attributes. Global graph characteristics are derived and ensemble learning is used to discriminate between three types of malignancy levels, namely, non-tumor, Her2/neu positive tumor and Her2/neu negative tumor. Performance is compared with state of the art methods including four texture feature groups (Haralick, Gabor, Local Binary Patterns and Varma Zisserman features), color and intensity features, and Voronoi diagram and Delaunay triangulation. Texture, color and intensity information is also combined with graph-based knowledge, followed by correlation analysis. Quantitative assessment is performed using two cross validation strategies. On investigating the experimental results, it can be concluded that the proposed method provides a promising way for computer-based analysis of histopathological images of gastric cancer.
Akama, Hiroyuki; Miyake, Maki; Jung, Jaeyoung; Murphy, Brian
2015-01-01
In this study, we introduce an original distance definition for graphs, called the Markov-inverse-F measure (MiF). This measure enables the integration of classical graph theory indices with new knowledge pertaining to structural feature extraction from semantic networks. MiF improves the conventional Jaccard and/or Simpson indices, and reconciles both the geodesic information (random walk) and co-occurrence adjustment (degree balance and distribution). We measure the effectiveness of graph-based coefficients through the application of linguistic graph information for a neural activity recorded during conceptual processing in the human brain. Specifically, the MiF distance is computed between each of the nouns used in a previous neural experiment and each of the in-between words in a subgraph derived from the Edinburgh Word Association Thesaurus of English. From the MiF-based information matrix, a machine learning model can accurately obtain a scalar parameter that specifies the degree to which each voxel in (the MRI image of) the brain is activated by each word or each principal component of the intermediate semantic features. Furthermore, correlating the voxel information with the MiF-based principal components, a new computational neurolinguistics model with a network connectivity paradigm is created. This allows two dimensions of context space to be incorporated with both semantic and neural distributional representations.
Ashkenazy, Haim; Abadi, Shiran; Martz, Eric; Chay, Ofer; Mayrose, Itay; Pupko, Tal; Ben-Tal, Nir
2016-01-01
The degree of evolutionary conservation of an amino acid in a protein or a nucleic acid in DNA/RNA reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and function of the macromolecule. The ConSurf web server (http://consurf.tau.ac.il), established over 15 years ago, analyses the evolutionary pattern of the amino/nucleic acids of the macromolecule to reveal regions that are important for structure and/or function. Starting from a query sequence or structure, the server automatically collects homologues, infers their multiple sequence alignment and reconstructs a phylogenetic tree that reflects their evolutionary relations. These data are then used, within a probabilistic framework, to estimate the evolutionary rates of each sequence position. Here we introduce several new features into ConSurf, including automatic selection of the best evolutionary model used to infer the rates, the ability to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree that enables interactively rerunning ConSurf with the taxa of a sub-tree. PMID:27166375
Griffin: A Tool for Symbolic Inference of Synchronous Boolean Molecular Networks.
Muñoz, Stalin; Carrillo, Miguel; Azpeitia, Eugenio; Rosenblueth, David A
2018-01-01
Boolean networks are important models of biochemical systems, located at the high end of the abstraction spectrum. A number of Boolean gene networks have been inferred following essentially the same method. Such a method first considers experimental data for a typically underdetermined "regulation" graph. Next, Boolean networks are inferred by using biological constraints to narrow the search space, such as a desired set of (fixed-point or cyclic) attractors. We describe Griffin , a computer tool enhancing this method. Griffin incorporates a number of well-established algorithms, such as Dubrova and Teslenko's algorithm for finding attractors in synchronous Boolean networks. In addition, a formal definition of regulation allows Griffin to employ "symbolic" techniques, able to represent both large sets of network states and Boolean constraints. We observe that when the set of attractors is required to be an exact set, prohibiting additional attractors, a naive Boolean coding of this constraint may be unfeasible. Such cases may be intractable even with symbolic methods, as the number of Boolean constraints may be astronomically large. To overcome this problem, we employ an Artificial Intelligence technique known as "clause learning" considerably increasing Griffin 's scalability. Without clause learning only toy examples prohibiting additional attractors are solvable: only one out of seven queries reported here is answered. With clause learning, by contrast, all seven queries are answered. We illustrate Griffin with three case studies drawn from the Arabidopsis thaliana literature. Griffin is available at: http://turing.iimas.unam.mx/griffin.
The Bologna Annotation Resource (BAR 3.0): improving protein functional annotation
Casadio, Rita
2017-01-01
Abstract BAR 3.0 updates our server BAR (Bologna Annotation Resource) for predicting protein structural and functional features from sequence. We increase data volume, query capabilities and information conveyed to the user. The core of BAR 3.0 is a graph-based clustering procedure of UniProtKB sequences, following strict pairwise similarity criteria (sequence identity ≥40% with alignment coverage ≥90%). Each cluster contains the available annotation downloaded from UniProtKB, GO, PFAM and PDB. After statistical validation, GO terms and PFAM domains are cluster-specific and annotate new sequences entering the cluster after satisfying similarity constraints. BAR 3.0 includes 28 869 663 sequences in 1 361 773 clusters, of which 22.2% (22 241 661 sequences) and 47.4% (24 555 055 sequences) have at least one validated GO term and one PFAM domain, respectively. 1.4% of the clusters (36% of all sequences) include PDB structures and the cluster is associated to a hidden Markov model that allows building template-target alignment suitable for structural modeling. Some other 3 399 026 sequences are singletons. BAR 3.0 offers an improved search interface, allowing queries by UniProtKB-accession, Fasta sequence, GO-term, PFAM-domain, organism, PDB and ligand/s. When evaluated on the CAFA2 targets, BAR 3.0 largely outperforms our previous version and scores among state-of-the-art methods. BAR 3.0 is publicly available and accessible at http://bar.biocomp.unibo.it/bar3. PMID:28453653
Bhadra, Pratiti; Pal, Debnath
2017-04-01
Dynamics is integral to the function of proteins, yet the use of molecular dynamics (MD) simulation as a technique remains under-explored for molecular function inference. This is more important in the context of genomics projects where novel proteins are determined with limited evolutionary information. Recently we developed a method to match the query protein's flexible segments to infer function using a novel approach combining analysis of residue fluctuation-graphs and auto-correlation vectors derived from coarse-grained (CG) MD trajectory. The method was validated on a diverse dataset with sequence identity between proteins as low as 3%, with high function-recall rates. Here we share its implementation as a publicly accessible web service, named DynFunc (Dynamics Match for Function) to query protein function from ≥1 µs long CG dynamics trajectory information of protein subunits. Users are provided with the custom-developed coarse-grained molecular mechanics (CGMM) forcefield to generate the MD trajectories for their protein of interest. On upload of trajectory information, the DynFunc web server identifies specific flexible regions of the protein linked to putative molecular function. Our unique application does not use evolutionary information to infer molecular function from MD information and can, therefore, work for all proteins, including moonlighting and the novel ones, whenever structural information is available. Our pipeline is expected to be of utility to all structural biologists working with novel proteins and interested in moonlighting functions. Copyright © 2017 Elsevier Ltd. All rights reserved.
GALAHAD: 1. Pharmacophore identification by hypermolecular alignment of ligands in 3D
NASA Astrophysics Data System (ADS)
Richmond, Nicola J.; Abrams, Charlene A.; Wolohan, Philippa R. N.; Abrahamian, Edmond; Willett, Peter; Clark, Robert D.
2006-09-01
Alignment of multiple ligands based on shared pharmacophoric and pharmacosteric features is a long-recognized challenge in drug discovery and development. This is particularly true when the spatial overlap between structures is incomplete, in which case no good template molecule is likely to exist. Pair-wise rigid ligand alignment based on linear assignment (the LAMDA algorithm) has the potential to address this problem (Richmond et al. in J Mol Graph Model 23:199-209, 2004). Here we present the version of LAMDA embodied in the GALAHAD program, which carries out multi-way alignments by iterative construction of hypermolecules that retain the aggregate as well as the individual attributes of the ligands. We have also generalized the cost function from being purely atom-based to being one that operates on ionic, hydrogen bonding, hydrophobic and steric features. Finally, we have added the ability to generate useful partial-match 3D search queries from the hypermolecules obtained. By running frozen conformations through the GALAHAD program, one can utilize the extended version of LAMDA to generate pharmacophores and pharmacosteres that agree well with crystal structure alignments for a range of literature datasets, with minor adjustments of the default parameters generating even better models. Allowing for inclusion of partial match constraints in the queries yields pharmacophores that are consistently a superset of full-match pharmacophores identified in previous analyses, with the additional features representing points of potentially beneficial interaction with the target.
Evaluating a NoSQL Alternative for Chilean Virtual Observatory Services
NASA Astrophysics Data System (ADS)
Antognini, J.; Araya, M.; Solar, M.; Valenzuela, C.; Lira, F.
2015-09-01
Currently, the standards and protocols for data access in the Virtual Observatory architecture (DAL) are generally implemented with relational databases based on SQL. In particular, the Astronomical Data Query Language (ADQL), language used by IVOA to represent queries to VO services, was created to satisfy the different data access protocols, such as Simple Cone Search. ADQL is based in SQL92, and has extra functionality implemented using PgSphere. An emergent alternative to SQL are the so called NoSQL databases, which can be classified in several categories such as Column, Document, Key-Value, Graph, Object, etc.; each one recommended for different scenarios. Within their notable characteristics we can find: schema-free, easy replication support, simple API, Big Data, etc. The Chilean Virtual Observatory (ChiVO) is developing a functional prototype based on the IVOA architecture, with the following relevant factors: Performance, Scalability, Flexibility, Complexity, and Functionality. Currently, it's very difficult to compare these factors, due to a lack of alternatives. The objective of this paper is to compare NoSQL alternatives with SQL through the implementation of a Web API REST that satisfies ChiVO's needs: a SESAME-style name resolver for the data from ALMA. Therefore, we propose a test scenario by configuring a NoSQL database with data from different sources and evaluating the feasibility of creating a Simple Cone Search service and its performance. This comparison will allow to pave the way for the application of Big Data databases in the Virtual Observatory.
An Intelligent System for Document Retrieval in Distributed Office Environments.
ERIC Educational Resources Information Center
Mukhopadhyay, Uttam; And Others
1986-01-01
MINDS (Multiple Intelligent Node Document Servers) is a distributed system of knowledge-based query engines for efficiently retrieving multimedia documents in an office environment of distributed workstations. By learning document distribution patterns and user interests and preferences during system usage, it customizes document retrievals for…
Scale-Independent Relational Query Processing
ERIC Educational Resources Information Center
Armbrust, Michael Paul
2013-01-01
An increasingly common pattern is for newly-released web applications to succumb to a "Success Disaster". In this scenario, overloaded database machines and resultant high response times destroy a previously good user experience, just as a site is becoming popular. Unfortunately, the data independence provided by a traditional relational…
Classroom Proven Motivational Mathematics Games, Monograph No. 1.
ERIC Educational Resources Information Center
Michigan Council of Teachers of Mathematics.
This collection includes 50 mathematical games and puzzles for classroom use at all grade levels. Also included is a wide variety of activities with cubes, flash cards, graphs, dots, number patterns, geometric shapes, cross-number puzzles, and magic squares. (MM)
NASA Astrophysics Data System (ADS)
Saputra, M. A.; Prajitno, P.
2018-04-01
Blood glucose is the molecule needed for human life, it usually measured invasively (by taking blood). but that measurement is still very vulnerable. The alternative method namely the non-invasive method is very interesting. In addition, the article [1] explains the relationship between the movement of the arterial pulse with glucose concentration, therefore the research study to investigate the correlation between the blood glucose and the movement of laser speckle pattern resulted from the arterial movement will be promising as the non-invasive method for measuring the blood glucose concentration. In this study, the laser speckle pattern imaging method, where the microscopically movement of the object is illuminated by a laser beam and recorded by the high-speed camera in a certain interval time, are used to identify the movement patterns of the artery. From the image processing, the graphs such as electrocardiograph (ECG) can be extracted. The average of the maximum peaks of the graph can be correlated with the blood glucose concentration in the blood, as the same as shown in the article [2]. From the data that has been obtained in this research, the movement of the speckle tends to increase in accordance with the rise of blood glucose concentration.
Temporal dynamics and impact of event interactions in cyber-social populations
NASA Astrophysics Data System (ADS)
Zhang, Yi-Qing; Li, Xiang
2013-03-01
The advance of information technologies provides powerful measures to digitize social interactions and facilitate quantitative investigations. To explore large-scale indoor interactions of a social population, we analyze 18 715 users' Wi-Fi access logs recorded in a Chinese university campus during 3 months, and define event interaction (EI) to characterize the concurrent interactions of multiple users inferred by their geographic coincidences—co-locating in the same small region at the same time. We propose three rules to construct a transmission graph, which depicts the topological and temporal features of event interactions. The vertex dynamics of transmission graph tells that the active durations of EIs fall into the truncated power-law distributions, which is independent on the number of involved individuals. The edge dynamics of transmission graph reports that the transmission durations present a truncated power-law pattern independent on the daily and weekly periodicities. Besides, in the aggregated transmission graph, low-degree vertices previously neglected in the aggregated static networks may participate in the large-degree EIs, which is verified by three data sets covering different sizes of social populations with various rendezvouses. This work highlights the temporal significance of event interactions in cyber-social populations.
Gaussian covariance graph models accounting for correlated marker effects in genome-wide prediction.
Martínez, C A; Khare, K; Rahman, S; Elzo, M A
2017-10-01
Several statistical models used in genome-wide prediction assume uncorrelated marker allele substitution effects, but it is known that these effects may be correlated. In statistics, graphical models have been identified as a useful tool for covariance estimation in high-dimensional problems and it is an area that has recently experienced a great expansion. In Gaussian covariance graph models (GCovGM), the joint distribution of a set of random variables is assumed to be Gaussian and the pattern of zeros of the covariance matrix is encoded in terms of an undirected graph G. In this study, methods adapting the theory of GCovGM to genome-wide prediction were developed (Bayes GCov, Bayes GCov-KR and Bayes GCov-H). In simulated data sets, improvements in correlation between phenotypes and predicted breeding values and accuracies of predicted breeding values were found. Our models account for correlation of marker effects and permit to accommodate general structures as opposed to models proposed in previous studies, which consider spatial correlation only. In addition, they allow incorporation of biological information in the prediction process through its use when constructing graph G, and their extension to the multi-allelic loci case is straightforward. © 2017 Blackwell Verlag GmbH.
Multiresolution analysis over graphs for a motor imagery based online BCI game.
Asensio-Cubero, Javier; Gan, John Q; Palaniappan, Ramaswamy
2016-01-01
Multiresolution analysis (MRA) over graph representation of EEG data has proved to be a promising method for offline brain-computer interfacing (BCI) data analysis. For the first time we aim to prove the feasibility of the graph lifting transform in an online BCI system. Instead of developing a pointer device or a wheel-chair controller as test bed for human-machine interaction, we have designed and developed an engaging game which can be controlled by means of imaginary limb movements. Some modifications to the existing MRA analysis over graphs for BCI have also been proposed, such as the use of common spatial patterns for feature extraction at the different levels of decomposition, and sequential floating forward search as a best basis selection technique. In the online game experiment we obtained for three classes an average classification rate of 63.0% for fourteen naive subjects. The application of a best basis selection method helps significantly decrease the computing resources needed. The present study allows us to further understand and assess the benefits of the use of tailored wavelet analysis for processing motor imagery data and contributes to the further development of BCI for gaming purposes. Copyright © 2015 Elsevier Ltd. All rights reserved.
Associative Pattern Recognition In Analog VLSI Circuits
NASA Technical Reports Server (NTRS)
Tawel, Raoul
1995-01-01
Winner-take-all circuit selects best-match stored pattern. Prototype cascadable very-large-scale integrated (VLSI) circuit chips built and tested to demonstrate concept of electronic associative pattern recognition. Based on low-power, sub-threshold analog complementary oxide/semiconductor (CMOS) VLSI circuitry, each chip can store 128 sets (vectors) of 16 analog values (vector components), vectors representing known patterns as diverse as spectra, histograms, graphs, or brightnesses of pixels in images. Chips exploit parallel nature of vector quantization architecture to implement highly parallel processing in relatively simple computational cells. Through collective action, cells classify input pattern in fraction of microsecond while consuming power of few microwatts.
Netgram: Visualizing Communities in Evolving Networks
Mall, Raghvendra; Langone, Rocco; Suykens, Johan A. K.
2015-01-01
Real-world complex networks are dynamic in nature and change over time. The change is usually observed in the interactions within the network over time. Complex networks exhibit community like structures. A key feature of the dynamics of complex networks is the evolution of communities over time. Several methods have been proposed to detect and track the evolution of these groups over time. However, there is no generic tool which visualizes all the aspects of group evolution in dynamic networks including birth, death, splitting, merging, expansion, shrinkage and continuation of groups. In this paper, we propose Netgram: a tool for visualizing evolution of communities in time-evolving graphs. Netgram maintains evolution of communities over 2 consecutive time-stamps in tables which are used to create a query database using the sql outer-join operation. It uses a line-based visualization technique which adheres to certain design principles and aesthetic guidelines. Netgram uses a greedy solution to order the initial community information provided by the evolutionary clustering technique such that we have fewer line cross-overs in the visualization. This makes it easier to track the progress of individual communities in time evolving graphs. Netgram is a generic toolkit which can be used with any evolutionary community detection algorithm as illustrated in our experiments. We use Netgram for visualization of topic evolution in the NIPS conference over a period of 11 years and observe the emergence and merging of several disciplines in the field of information processing systems. PMID:26356538
Event Detection for Hydrothermal Plumes: A case study at Grotto Vent
NASA Astrophysics Data System (ADS)
Bemis, K. G.; Ozer, S.; Xu, G.; Rona, P. A.; Silver, D.
2012-12-01
Evidence is mounting that geologic events such as volcanic eruptions (and intrusions) and earthquakes (near and far) influence the flow rates and temperatures of hydrothermal systems. Connecting such suppositions to observations of hydrothermal output is challenging, but new ongoing time series have the potential to capture such events. This study explores using activity detection, a technique modified from computer vision, to identify pre-defined events within an extended time series recorded by COVIS (Cabled Observatory Vent Imaging Sonar) and applies it to a time series, with gaps, from Sept 2010 to the present; available measurements include plume orientation, plume rise rate, and diffuse flow area at the NEPTUNE Canada Observatory at Grotto Vent, Main Endeavour Field, Juan de Fuca Ridge. Activity detection is the process of finding a pattern (activity) in a data set containing many different types of patterns. Among many approaches proposed to model and detect activities, we have chosen a graph-based technique, Petri Nets, as they do not require training data to model the activity. They use the domain expert's knowledge to build the activity as a combination of feature states and their transitions (actions). Starting from a conceptual model of how hydrothermal plumes respond to daily tides, we have developed a Petri Net based detection algorithm that identifies deviations from the specified response. Initially we assumed that the orientation of the plume would change smoothly and symmetrically in a consistent daily pattern. However, results indicate that the rate of directional changes varies. The present Petri Net detects unusually large and rapid changes in direction or amount of bending; however inspection of Figure 1 suggests that many of the events detected may be artifacts resulting from gaps in the data or from the large temporal spacing. Still, considerable complexity overlies the "normal" tidal response pattern (the data has a dominant frequency of ~12.9 hours). We are in the process of defining several events of particular scientific interest: 1) transient behavioral changes associated with atmospheric storms, earthquakes or volcanic intrusions or eruptions, 2) mutual interaction of neighboring plumes on each other's behavior, and 3) rapid shifts in plume direction that indicate the presence of unusual currents or changes in currents. We will query the existing data to see if these relationships are ever observed as well as testing our understanding of the "normal" pattern of response to tidal currents.Figure 1. Arrows indicate plume orientation at a given time (time axis in days after 9/29/10) and stars indicate times when orientation changes rapidly.
Determination of geographic variance in stroke prevalence using Internet search engine analytics.
Walcott, Brian P; Nahed, Brian V; Kahle, Kristopher T; Redjal, Navid; Coumans, Jean-Valery
2011-06-01
Previous methods to determine stroke prevalence, such as nationwide surveys, are labor-intensive endeavors. Recent advances in search engine query analytics have led to a new metric for disease surveillance to evaluate symptomatic phenomenon, such as influenza. The authors hypothesized that the use of search engine query data can determine the prevalence of stroke. The Google Insights for Search database was accessed to analyze anonymized search engine query data. The authors' search strategy utilized common search queries used when attempting either to identify the signs and symptoms of a stroke or to perform stroke education. The search logic was as follows: (stroke signs + stroke symptoms + mini stroke--heat) from January 1, 2005, to December 31, 2010. The relative number of searches performed (the interest level) for this search logic was established for all 50 states and the District of Columbia. A Pearson product-moment correlation coefficient was calculated from the statespecific stroke prevalence data previously reported. Web search engine interest level was available for all 50 states and the District of Columbia over the time period for January 1, 2005-December 31, 2010. The interest level was highest in Alabama and Tennessee (100 and 96, respectively) and lowest in California and Virginia (58 and 53, respectively). The Pearson correlation coefficient (r) was calculated to be 0.47 (p = 0.0005, 2-tailed). Search engine query data analysis allows for the determination of relative stroke prevalence. Further investigation will reveal the reliability of this metric to determine temporal pattern analysis and prevalence in this and other symptomatic diseases.
NASA Astrophysics Data System (ADS)
Slynko, Inna; Da Silva, Franck; Bret, Guillaume; Rognan, Didier
2016-09-01
High affinity ligands for a given target tend to share key molecular interactions with important anchoring amino acids and therefore often present quite conserved interaction patterns. This simple concept was formalized in a topological knowledge-based scoring function (GRIM) for selecting the most appropriate docking poses from previously X-rayed interaction patterns. GRIM first converts protein-ligand atomic coordinates (docking poses) into a simple 3D graph describing the corresponding interaction pattern. In a second step, proposed graphs are compared to that found from template structures in the Protein Data Bank. Last, all docking poses are rescored according to an empirical score (GRIMscore) accounting for overlap of maximum common subgraphs. Taking the opportunity of the public D3R Grand Challenge 2015, GRIM was used to rescore docking poses for 36 ligands (6 HSP90α inhibitors, 30 MAP4K4 inhibitors) prior to the release of the corresponding protein-ligand X-ray structures. When applied to the HSP90α dataset, for which many protein-ligand X-ray structures are already available, GRIM provided very high quality solutions (mean rmsd = 1.06 Å, n = 6) as top-ranked poses, and significantly outperformed a state-of-the-art scoring function. In the case of MAP4K4 inhibitors, for which preexisting 3D knowledge is scarce and chemical diversity is much larger, the accuracy of GRIM poses decays (mean rmsd = 3.18 Å, n = 30) although GRIM still outperforms an energy-based scoring function. GRIM rescoring appears to be quite robust with comparison to the other approaches competing for the same challenge (42 submissions for the HSP90 dataset, 27 for the MAP4K4 dataset) as it ranked 3rd and 2nd respectively, for the two investigated datasets. The rescoring method is quite simple to implement, independent on a docking engine, and applicable to any target for which at least one holo X-ray structure is available.
Optimizing graph-based patterns to extract biomedical events from the literature
2015-01-01
In BioNLP-ST 2013 We participated in the BioNLP 2013 shared tasks on event extraction. Our extraction method is based on the search for an approximate subgraph isomorphism between key context dependencies of events and graphs of input sentences. Our system was able to address both the GENIA (GE) task focusing on 13 molecular biology related event types and the Cancer Genetics (CG) task targeting a challenging group of 40 cancer biology related event types with varying arguments concerning 18 kinds of biological entities. In addition to adapting our system to the two tasks, we also attempted to integrate semantics into the graph matching scheme using a distributional similarity model for more events, and evaluated the event extraction impact of using paths of all possible lengths as key context dependencies beyond using only the shortest paths in our system. We achieved a 46.38% F-score in the CG task (ranking 3rd) and a 48.93% F-score in the GE task (ranking 4th). After BioNLP-ST 2013 We explored three ways to further extend our event extraction system in our previously published work: (1) We allow non-essential nodes to be skipped, and incorporated a node skipping penalty into the subgraph distance function of our approximate subgraph matching algorithm. (2) Instead of assigning a unified subgraph distance threshold to all patterns of an event type, we learned a customized threshold for each pattern. (3) We implemented the well-known Empirical Risk Minimization (ERM) principle to optimize the event pattern set by balancing prediction errors on training data against regularization. When evaluated on the official GE task test data, these extensions help to improve the extraction precision from 62% to 65%. However, the overall F-score stays equivalent to the previous performance due to a 1% drop in recall. PMID:26551594
SSMILes: Investigating Various Volcanic Eruptions and Volcano Heights.
ERIC Educational Resources Information Center
Wagner-Pine, Linda; Keith, Donna Graham
1994-01-01
Presents an integrated math/science activity that shows students the differences among the three types of volcanoes using observation, classification, graphing, sorting, problem solving, measurement, averages, pattern relationships, calculators, computers, and research skills. Includes reproducible student worksheet. Lists 13 teacher resources.…
Primary Place. Math Projects That Count.
ERIC Educational Resources Information Center
Buschman, Larry; And Others
1993-01-01
Offers elementary math-centered recycling activities and ideas on transforming throwaways into valuable classroom resources. The math activities teach estimating, counting, measuring, weighing, graphing, patterning, thinking, comparing, proportion, and dimensions. The recycling ideas present ways to use pieces of trash to create educational games.…
Visibility in the topology of complex networks
NASA Astrophysics Data System (ADS)
Tsiotas, Dimitrios; Charakopoulos, Avraam
2018-09-01
Taking its inspiration from the visibility algorithm, which was proposed by Lacasa et al. (2008) to convert a time-series into a complex network, this paper develops and proposes a novel expansion of this algorithm that allows generating a visibility graph from a complex network instead of a time-series that is currently applicable. The purpose of this approach is to apply the idea of visibility from the field of time-series to complex networks in order to interpret the network topology as a landscape. Visibility in complex networks is a multivariate property producing an associated visibility graph that maps the ability of a node "to see" other nodes in the network that lie beyond the range of its neighborhood, in terms of a control-attribute. Within this context, this paper examines the visibility topology produced by connectivity (degree) in comparison with the original (source) network, in order to detect what patterns or forces describe the mechanism under which a network is converted to a visibility graph. The overall analysis shows that visibility is a property that increases the connectivity in networks, it may contribute to pattern recognition (among which the detection of the scale-free topology) and it is worth to be applied to complex networks in order to reveal the potential of signal processing beyond the range of its neighborhood. Generally, this paper promotes interdisciplinary research in complex networks providing new insights to network science.