Large-Scale 1:1 Computing Initiatives: An Open Access Database
ERIC Educational Resources Information Center
Richardson, Jayson W.; McLeod, Scott; Flora, Kevin; Sauers, Nick J.; Kannan, Sathiamoorthy; Sincar, Mehmet
2013-01-01
This article details the spread and scope of large-scale 1:1 computing initiatives around the world. What follows is a review of the existing literature around 1:1 programs followed by a description of the large-scale 1:1 database. Main findings include: 1) the XO and the Classmate PC dominate large-scale 1:1 initiatives; 2) if professional…
Mackey, Aaron J; Pearson, William R
2004-10-01
Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Design and implementation of a distributed large-scale spatial database system based on J2EE
NASA Astrophysics Data System (ADS)
Gong, Jianya; Chen, Nengcheng; Zhu, Xinyan; Zhang, Xia
2003-03-01
With the increasing maturity of distributed object technology, CORBA, .NET and EJB are universally used in traditional IT field. However, theories and practices of distributed spatial database need farther improvement in virtue of contradictions between large scale spatial data and limited network bandwidth or between transitory session and long transaction processing. Differences and trends among of CORBA, .NET and EJB are discussed in details, afterwards the concept, architecture and characteristic of distributed large-scale seamless spatial database system based on J2EE is provided, which contains GIS client application, web server, GIS application server and spatial data server. Moreover the design and implementation of components of GIS client application based on JavaBeans, the GIS engine based on servlet, the GIS Application server based on GIS enterprise JavaBeans(contains session bean and entity bean) are explained.Besides, the experiments of relation of spatial data and response time under different conditions are conducted, which proves that distributed spatial database system based on J2EE can be used to manage, distribute and share large scale spatial data on Internet. Lastly, a distributed large-scale seamless image database based on Internet is presented.
Large Scale Landslide Database System Established for the Reservoirs in Southern Taiwan
NASA Astrophysics Data System (ADS)
Tsai, Tsai-Tsung; Tsai, Kuang-Jung; Shieh, Chjeng-Lun
2017-04-01
Typhoon Morakot seriously attack southern Taiwan awaken the public awareness of large scale landslide disasters. Large scale landslide disasters produce large quantity of sediment due to negative effects on the operating functions of reservoirs. In order to reduce the risk of these disasters within the study area, the establishment of a database for hazard mitigation / disaster prevention is necessary. Real time data and numerous archives of engineering data, environment information, photo, and video, will not only help people make appropriate decisions, but also bring the biggest concern for people to process and value added. The study tried to define some basic data formats / standards from collected various types of data about these reservoirs and then provide a management platform based on these formats / standards. Meanwhile, in order to satisfy the practicality and convenience, the large scale landslide disasters database system is built both provide and receive information abilities, which user can use this large scale landslide disasters database system on different type of devices. IT technology progressed extreme quick, the most modern system might be out of date anytime. In order to provide long term service, the system reserved the possibility of user define data format /standard and user define system structure. The system established by this study was based on HTML5 standard language, and use the responsive web design technology. This will make user can easily handle and develop this large scale landslide disasters database system.
Using SQL Databases for Sequence Similarity Searching and Analysis.
Pearson, William R; Mackey, Aaron J
2017-09-13
Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Using Large-Scale Databases in Evaluation: Advances, Opportunities, and Challenges
ERIC Educational Resources Information Center
Penuel, William R.; Means, Barbara
2011-01-01
Major advances in the number, capabilities, and quality of state, national, and transnational databases have opened up new opportunities for evaluators. Both large-scale data sets collected for administrative purposes and those collected by other researchers can provide data for a variety of evaluation-related activities. These include (a)…
Orthographic and Phonological Neighborhood Databases across Multiple Languages.
Marian, Viorica
2017-01-01
The increased globalization of science and technology and the growing number of bilinguals and multilinguals in the world have made research with multiple languages a mainstay for scholars who study human function and especially those who focus on language, cognition, and the brain. Such research can benefit from large-scale databases and online resources that describe and measure lexical, phonological, orthographic, and semantic information. The present paper discusses currently-available resources and underscores the need for tools that enable measurements both within and across multiple languages. A general review of language databases is followed by a targeted introduction to databases of orthographic and phonological neighborhoods. A specific focus on CLEARPOND illustrates how databases can be used to assess and compare neighborhood information across languages, to develop research materials, and to provide insight into broad questions about language. As an example of how using large-scale databases can answer questions about language, a closer look at neighborhood effects on lexical access reveals that not only orthographic, but also phonological neighborhoods can influence visual lexical access both within and across languages. We conclude that capitalizing upon large-scale linguistic databases can advance, refine, and accelerate scientific discoveries about the human linguistic capacity.
[Privacy and public benefit in using large scale health databases].
Yamamoto, Ryuichi
2014-01-01
In Japan, large scale heath databases were constructed in a few years, such as National Claim insurance and health checkup database (NDB) and Japanese Sentinel project. But there are some legal issues for making adequate balance between privacy and public benefit by using such databases. NDB is carried based on the act for elderly person's health care but in this act, nothing is mentioned for using this database for general public benefit. Therefore researchers who use this database are forced to pay much concern about anonymization and information security that may disturb the research work itself. Japanese Sentinel project is a national project to detecting drug adverse reaction using large scale distributed clinical databases of large hospitals. Although patients give the future consent for general such purpose for public good, it is still under discussion using insufficiently anonymized data. Generally speaking, researchers of study for public benefit will not infringe patient's privacy, but vague and complex requirements of legislation about personal data protection may disturb the researches. Medical science does not progress without using clinical information, therefore the adequate legislation that is simple and clear for both researchers and patients is strongly required. In Japan, the specific act for balancing privacy and public benefit is now under discussion. The author recommended the researchers including the field of pharmacology should pay attention to, participate in the discussion of, and make suggestion to such act or regulations.
Ice Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel
NASA Technical Reports Server (NTRS)
Broeren, Andy; Potapczuk, Mark; Lee, Sam; Malone, Adam; Paul, Ben; Woodard, Brian
2016-01-01
The design and certification of modern transport airplanes for flight in icing conditions increasing relies on three-dimensional numerical simulation tools for ice accretion prediction. There is currently no publically available, high-quality, ice accretion database upon which to evaluate the performance of icing simulation tools for large-scale swept wings that are representative of modern commercial transport airplanes. The purpose of this presentation is to present the results of a series of icing wind tunnel test campaigns whose aim was to provide an ice accretion database for large-scale, swept wings.
Large-scale annotation of small-molecule libraries using public databases.
Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A
2007-01-01
While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.
Application of Large-Scale Database-Based Online Modeling to Plant State Long-Term Estimation
NASA Astrophysics Data System (ADS)
Ogawa, Masatoshi; Ogai, Harutoshi
Recently, attention has been drawn to the local modeling techniques of a new idea called “Just-In-Time (JIT) modeling”. To apply “JIT modeling” to a large amount of database online, “Large-scale database-based Online Modeling (LOM)” has been proposed. LOM is a technique that makes the retrieval of neighboring data more efficient by using both “stepwise selection” and quantization. In order to predict the long-term state of the plant without using future data of manipulated variables, an Extended Sequential Prediction method of LOM (ESP-LOM) has been proposed. In this paper, the LOM and the ESP-LOM are introduced.
Iris indexing based on local intensity order pattern
NASA Astrophysics Data System (ADS)
Emerich, Simina; Malutan, Raul; Crisan, Septimiu; Lefkovits, Laszlo
2017-03-01
In recent years, iris biometric systems have increased in popularity and have been proven that are capable of handling large-scale databases. The main advantage of these systems is accuracy and reliability. A proper iris patterns classification is expected to reduce the matching time in huge databases. This paper presents an iris indexing technique based on Local Intensity Order Pattern. The performance of the present approach is evaluated on UPOL database and is compared with other recent systems designed for iris indexing. The results illustrate the potential of the proposed method for large scale iris identification.
Intra-reach headwater fish assemblage structure
McKenna, James E.
2017-01-01
Large-scale conservation efforts can take advantage of modern large databases and regional modeling and assessment methods. However, these broad-scale efforts often assume uniform average habitat conditions and/or species assemblages within stream reaches.
2009-01-01
Background Insertional mutagenesis is an effective method for functional genomic studies in various organisms. It can rapidly generate easily tractable mutations. A large-scale insertional mutagenesis with the piggyBac (PB) transposon is currently performed in mice at the Institute of Developmental Biology and Molecular Medicine (IDM), Fudan University in Shanghai, China. This project is carried out via collaborations among multiple groups overseeing interconnected experimental steps and generates a large volume of experimental data continuously. Therefore, the project calls for an efficient database system for recording, management, statistical analysis, and information exchange. Results This paper presents a database application called MP-PBmice (insertional mutation mapping system of PB Mutagenesis Information Center), which is developed to serve the on-going large-scale PB insertional mutagenesis project. A lightweight enterprise-level development framework Struts-Spring-Hibernate is used here to ensure constructive and flexible support to the application. The MP-PBmice database system has three major features: strict access-control, efficient workflow control, and good expandability. It supports the collaboration among different groups that enter data and exchange information on daily basis, and is capable of providing real time progress reports for the whole project. MP-PBmice can be easily adapted for other large-scale insertional mutation mapping projects and the source code of this software is freely available at http://www.idmshanghai.cn/PBmice. Conclusion MP-PBmice is a web-based application for large-scale insertional mutation mapping onto the mouse genome, implemented with the widely used framework Struts-Spring-Hibernate. This system is already in use by the on-going genome-wide PB insertional mutation mapping project at IDM, Fudan University. PMID:19958505
A Review of Stellar Abundance Databases and the Hypatia Catalog Database
NASA Astrophysics Data System (ADS)
Hinkel, Natalie Rose
2018-01-01
The astronomical community is interested in elements from lithium to thorium, from solar twins to peculiarities of stellar evolution, because they give insight into different regimes of star formation and evolution. However, while some trends between elements and other stellar or planetary properties are well known, many other trends are not as obvious and are a point of conflict. For example, stars that host giant planets are found to be consistently enriched in iron, but the same cannot be definitively said for any other element. Therefore, it is time to take advantage of large stellar abundance databases in order to better understand not only the large-scale patterns, but also the more subtle, small-scale trends within the data.In this overview to the special session, I will present a review of large stellar abundance databases that are both currently available (i.e. RAVE, APOGEE) and those that will soon be online (i.e. Gaia-ESO, GALAH). Additionally, I will discuss the Hypatia Catalog Database (www.hypatiacatalog.com) -- which includes abundances from individual literature sources that observed stars within 150pc. The Hypatia Catalog currently contains 72 elements as measured within ~6000 stars, with a total of ~240,000 unique abundance determinations. The online database offers a variety of solar normalizations, stellar properties, and planetary properties (where applicable) that can all be viewed through multiple interactive plotting interfaces as well as in a tabular format. By analyzing stellar abundances for large populations of stars and from a variety of different perspectives, a wealth of information can be revealed on both large and small scales.
USDA-ARS?s Scientific Manuscript database
Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...
DEXTER: Disease-Expression Relation Extraction from Text.
Gupta, Samir; Dingerdissen, Hayley; Ross, Karen E; Hu, Yu; Wu, Cathy H; Mazumder, Raja; Vijay-Shanker, K
2018-01-01
Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression-disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL: http://biotm.cis.udel.edu/DEXTER.
Spasojevic, Marko J; Bahlai, Christie A; Bradley, Bethany A; Butterfield, Bradley J; Tuanmu, Mao-Ning; Sistla, Seeta; Wiederholt, Ruscena; Suding, Katharine N
2016-04-01
Understanding the mechanisms underlying ecosystem resilience - why some systems have an irreversible response to disturbances while others recover - is critical for conserving biodiversity and ecosystem function in the face of global change. Despite the widespread acceptance of a positive relationship between biodiversity and resilience, empirical evidence for this relationship remains fairly limited in scope and localized in scale. Assessing resilience at the large landscape and regional scales most relevant to land management and conservation practices has been limited by the ability to measure both diversity and resilience over large spatial scales. Here, we combined tools used in large-scale studies of biodiversity (remote sensing and trait databases) with theoretical advances developed from small-scale experiments to ask whether the functional diversity within a range of woodland and forest ecosystems influences the recovery of productivity after wildfires across the four-corner region of the United States. We additionally asked how environmental variation (topography, macroclimate) across this geographic region influences such resilience, either directly or indirectly via changes in functional diversity. Using path analysis, we found that functional diversity in regeneration traits (fire tolerance, fire resistance, resprout ability) was a stronger predictor of the recovery of productivity after wildfire than the functional diversity of seed mass or species richness. Moreover, slope, elevation, and aspect either directly or indirectly influenced the recovery of productivity, likely via their effect on microclimate, while macroclimate had no direct or indirect effects. Our study provides some of the first direct empirical evidence for functional diversity increasing resilience at large spatial scales. Our approach highlights the power of combining theory based on local-scale studies with tools used in studies at large spatial scales and trait databases to understand pressing environmental issues. © 2015 John Wiley & Sons Ltd.
The EpiSLI Database: A Publicly Available Database on Speech and Language
ERIC Educational Resources Information Center
Tomblin, J. Bruce
2010-01-01
Purpose: This article describes a database that was created in the process of conducting a large-scale epidemiologic study of specific language impairment (SLI). As such, this database will be referred to as the EpiSLI database. Children with SLI have unexpected and unexplained difficulties learning and using spoken language. Although there is no…
Pattern-based, multi-scale segmentation and regionalization of EOSD land cover
NASA Astrophysics Data System (ADS)
Niesterowicz, Jacek; Stepinski, Tomasz F.
2017-10-01
The Earth Observation for Sustainable Development of Forests (EOSD) map is a 25 m resolution thematic map of Canadian forests. Because of its large spatial extent and relatively high resolution the EOSD is difficult to analyze using standard GIS methods. In this paper we propose multi-scale segmentation and regionalization of EOSD as new methods for analyzing EOSD on large spatial scales. Segments, which we refer to as forest land units (FLUs), are delineated as tracts of forest characterized by cohesive patterns of EOSD categories; we delineated from 727 to 91,885 FLUs within the spatial extent of EOSD depending on the selected scale of a pattern. Pattern of EOSD's categories within each FLU is described by 1037 landscape metrics. A shapefile containing boundaries of all FLUs together with an attribute table listing landscape metrics make up an SQL-searchable spatial database providing detailed information on composition and pattern of land cover types in Canadian forest. Shapefile format and extensive attribute table pertaining to the entire legend of EOSD are designed to facilitate broad range of investigations in which assessment of composition and pattern of forest over large areas is needed. We calculated four such databases using different spatial scales of pattern. We illustrate the use of FLU database for producing forest regionalization maps of two Canadian provinces, Quebec and Ontario. Such maps capture the broad scale variability of forest at the spatial scale of the entire province. We also demonstrate how FLU database can be used to map variability of landscape metrics, and thus the character of landscape, over the entire Canada.
ERIC Educational Resources Information Center
Rice, Michael; Gladstone, William; Weir, Michael
2004-01-01
We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a…
Multiresource inventories incorporating GIS, GPS, and database management systems
Loukas G. Arvanitis; Balaji Ramachandran; Daniel P. Brackett; Hesham Abd-El Rasol; Xuesong Du
2000-01-01
Large-scale natural resource inventories generate enormous data sets. Their effective handling requires a sophisticated database management system. Such a system must be robust enough to efficiently store large amounts of data and flexible enough to allow users to manipulate a wide variety of information. In a pilot project, related to a multiresource inventory of the...
ERIC Educational Resources Information Center
Alexopoulou, Theodora; Michel, Marije; Murakami, Akira; Meurers, Detmar
2017-01-01
Large-scale learner corpora collected from online language learning platforms, such as the EF-Cambridge Open Language Database (EFCAMDAT), provide opportunities to analyze learner data at an unprecedented scale. However, interpreting the learner language in such corpora requires a precise understanding of tasks: How does the prompt and input of a…
LSD: Large Survey Database framework
NASA Astrophysics Data System (ADS)
Juric, Mario
2012-09-01
The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures.
Active Exploration of Large 3D Model Repositories.
Gao, Lin; Cao, Yan-Pei; Lai, Yu-Kun; Huang, Hao-Zhi; Kobbelt, Leif; Hu, Shi-Min
2015-12-01
With broader availability of large-scale 3D model repositories, the need for efficient and effective exploration becomes more and more urgent. Existing model retrieval techniques do not scale well with the size of the database since often a large number of very similar objects are returned for a query, and the possibilities to refine the search are quite limited. We propose an interactive approach where the user feeds an active learning procedure by labeling either entire models or parts of them as "like" or "dislike" such that the system can automatically update an active set of recommended models. To provide an intuitive user interface, candidate models are presented based on their estimated relevance for the current query. From the methodological point of view, our main contribution is to exploit not only the similarity between a query and the database models but also the similarities among the database models themselves. We achieve this by an offline pre-processing stage, where global and local shape descriptors are computed for each model and a sparse distance metric is derived that can be evaluated efficiently even for very large databases. We demonstrate the effectiveness of our method by interactively exploring a repository containing over 100 K models.
NASA Astrophysics Data System (ADS)
Chen, J.; Wang, D.; Zhao, R. L.; Zhang, H.; Liao, A.; Jiu, J.
2014-04-01
Geospatial databases are irreplaceable national treasure of immense importance. Their up-to-dateness referring to its consistency with respect to the real world plays a critical role in its value and applications. The continuous updating of map databases at 1:50,000 scales is a massive and difficult task for larger countries of the size of more than several million's kilometer squares. This paper presents the research and technological development to support the national map updating at 1:50,000 scales in China, including the development of updating models and methods, production tools and systems for large-scale and rapid updating, as well as the design and implementation of the continuous updating workflow. The use of many data sources and the integration of these data to form a high accuracy, quality checked product were required. It had in turn required up to date techniques of image matching, semantic integration, generalization, data base management and conflict resolution. Design and develop specific software tools and packages to support the large-scale updating production with high resolution imagery and large-scale data generalization, such as map generalization, GIS-supported change interpretation from imagery, DEM interpolation, image matching-based orthophoto generation, data control at different levels. A national 1:50,000 databases updating strategy and its production workflow were designed, including a full coverage updating pattern characterized by all element topographic data modeling, change detection in all related areas, and whole process data quality controlling, a series of technical production specifications, and a network of updating production units in different geographic places in the country.
NASA Astrophysics Data System (ADS)
Gong, L.
2013-12-01
Large-scale hydrological models and land surface models are by far the only tools for accessing future water resources in climate change impact studies. Those models estimate discharge with large uncertainties, due to the complex interaction between climate and hydrology, the limited quality and availability of data, as well as model uncertainties. A new purely data-based scale-extrapolation method is proposed, to estimate water resources for a large basin solely from selected small sub-basins, which are typically two-orders-of-magnitude smaller than the large basin. Those small sub-basins contain sufficient information, not only on climate and land surface, but also on hydrological characteristics for the large basin In the Baltic Sea drainage basin, best discharge estimation for the gauged area was achieved with sub-basins that cover 2-4% of the gauged area. There exist multiple sets of sub-basins that resemble the climate and hydrology of the basin equally well. Those multiple sets estimate annual discharge for gauged area consistently well with 5% average error. The scale-extrapolation method is completely data-based; therefore it does not force any modelling error into the prediction. The multiple predictions are expected to bracket the inherent variations and uncertainties of the climate and hydrology of the basin. The method can be applied in both un-gauged basins and un-gauged periods with uncertainty estimation.
Sadygov, Rovshan G; Cociorva, Daniel; Yates, John R
2004-12-01
Database searching is an essential element of large-scale proteomics. Because these methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and sequence: descriptive, interpretative, stochastic and probability-based matching. We review the basic concepts used by most search algorithms, the computational modeling of peptide identification and current challenges and limitations of this approach for protein identification.
Digital geomorphological landslide hazard mapping of the Alpago area, Italy
NASA Astrophysics Data System (ADS)
van Westen, Cees J.; Soeters, Rob; Sijmons, Koert
Large-scale geomorphological maps of mountainous areas are traditionally made using complex symbol-based legends. They can serve as excellent "geomorphological databases", from which an experienced geomorphologist can extract a large amount of information for hazard mapping. However, these maps are not designed to be used in combination with a GIS, due to their complex cartographic structure. In this paper, two methods are presented for digital geomorphological mapping at large scales using GIS and digital cartographic software. The methods are applied to an area with a complex geomorphological setting on the Borsoia catchment, located in the Alpago region, near Belluno in the Italian Alps. The GIS database set-up is presented with an overview of the data layers that have been generated and how they are interrelated. The GIS database was also converted into a paper map, using a digital cartographic package. The resulting largescale geomorphological hazard map is attached. The resulting GIS database and cartographic product can be used to analyse the hazard type and hazard degree for each polygon, and to find the reasons for the hazard classification.
High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joslyn, Cliff A.; Adolf, Robert D.; Al-Saffar, Sinan
2010-10-04
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors.« less
Large-Scale medical image analytics: Recent methodologies, applications and Future directions.
Zhang, Shaoting; Metaxas, Dimitris
2016-10-01
Despite the ever-increasing amount and complexity of annotated medical image data, the development of large-scale medical image analysis algorithms has not kept pace with the need for methods that bridge the semantic gap between images and diagnoses. The goal of this position paper is to discuss and explore innovative and large-scale data science techniques in medical image analytics, which will benefit clinical decision-making and facilitate efficient medical data management. Particularly, we advocate that the scale of image retrieval systems should be significantly increased at which interactive systems can be effective for knowledge discovery in potentially large databases of medical images. For clinical relevance, such systems should return results in real-time, incorporate expert feedback, and be able to cope with the size, quality, and variety of the medical images and their associated metadata for a particular domain. The design, development, and testing of the such framework can significantly impact interactive mining in medical image databases that are growing rapidly in size and complexity and enable novel methods of analysis at much larger scales in an efficient, integrated fashion. Copyright © 2016. Published by Elsevier B.V.
Content Is King: Databases Preserve the Collective Information of Science.
Yates, John R
2018-04-01
Databases store sequence information experimentally gathered to create resources that further science. In the last 20 years databases have become critical components of fields like proteomics where they provide the basis for large-scale and high-throughput proteomic informatics. Amos Bairoch, winner of the Association of Biomolecular Resource Facilities Frederick Sanger Award, has created some of the important databases proteomic research depends upon for accurate interpretation of data.
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases
NASA Astrophysics Data System (ADS)
Dykstra, Dave
2012-12-01
One of the main attractions of non-relational “NoSQL” databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It also compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dykstra, Dave
One of the main attractions of non-relational NoSQL databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It alsomore » compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.« less
Large-scale feature searches of collections of medical imagery
NASA Astrophysics Data System (ADS)
Hedgcock, Marcus W.; Karshat, Walter B.; Levitt, Tod S.; Vosky, D. N.
1993-09-01
Large scale feature searches of accumulated collections of medical imagery are required for multiple purposes, including clinical studies, administrative planning, epidemiology, teaching, quality improvement, and research. To perform a feature search of large collections of medical imagery, one can either search text descriptors of the imagery in the collection (usually the interpretation), or (if the imagery is in digital format) the imagery itself. At our institution, text interpretations of medical imagery are all available in our VA Hospital Information System. These are downloaded daily into an off-line computer. The text descriptors of most medical imagery are usually formatted as free text, and so require a user friendly database search tool to make searches quick and easy for any user to design and execute. We are tailoring such a database search tool (Liveview), developed by one of the authors (Karshat). To further facilitate search construction, we are constructing (from our accumulated interpretation data) a dictionary of medical and radiological terms and synonyms. If the imagery database is digital, the imagery which the search discovers is easily retrieved from the computer archive. We describe our database search user interface, with examples, and compare the efficacy of computer assisted imagery searches from a clinical text database with manual searches. Our initial work on direct feature searches of digital medical imagery is outlined.
Open source database of images DEIMOS: extension for large-scale subjective image quality assessment
NASA Astrophysics Data System (ADS)
Vítek, Stanislav
2014-09-01
DEIMOS (Database of Images: Open Source) is an open-source database of images and video sequences for testing, verification and comparison of various image and/or video processing techniques such as compression, reconstruction and enhancement. This paper deals with extension of the database allowing performing large-scale web-based subjective image quality assessment. Extension implements both administrative and client interface. The proposed system is aimed mainly at mobile communication devices, taking into account advantages of HTML5 technology; it means that participants don't need to install any application and assessment could be performed using web browser. The assessment campaign administrator can select images from the large database and then apply rules defined by various test procedure recommendations. The standard test procedures may be fully customized and saved as a template. Alternatively the administrator can define a custom test, using images from the pool and other components, such as evaluating forms and ongoing questionnaires. Image sequence is delivered to the online client, e.g. smartphone or tablet, as a fully automated assessment sequence or viewer can decide on timing of the assessment if required. Environmental data and viewing conditions (e.g. illumination, vibrations, GPS coordinates, etc.), may be collected and subsequently analyzed.
Large-scale silviculture experiments of western Oregon and Washington.
Nathan J. Poage; Paul D. Anderson
2007-01-01
We review 12 large-scale silviculture experiments (LSSEs) in western Washington and Oregon with which the Pacific Northwest Research Station of the USDA Forest Service is substantially involved. We compiled and arrayed information about the LSSEs as a series of matrices in a relational database, which is included on the compact disc published with this report and...
Intelligent Interfaces for Mining Large-Scale RNAi-HCS Image Databases
Lin, Chen; Mak, Wayne; Hong, Pengyu; Sepp, Katharine; Perrimon, Norbert
2010-01-01
Recently, High-content screening (HCS) has been combined with RNA interference (RNAi) to become an essential image-based high-throughput method for studying genes and biological networks through RNAi-induced cellular phenotype analyses. However, a genome-wide RNAi-HCS screen typically generates tens of thousands of images, most of which remain uncategorized due to the inadequacies of existing HCS image analysis tools. Until now, it still requires highly trained scientists to browse a prohibitively large RNAi-HCS image database and produce only a handful of qualitative results regarding cellular morphological phenotypes. For this reason we have developed intelligent interfaces to facilitate the application of the HCS technology in biomedical research. Our new interfaces empower biologists with computational power not only to effectively and efficiently explore large-scale RNAi-HCS image databases, but also to apply their knowledge and experience to interactive mining of cellular phenotypes using Content-Based Image Retrieval (CBIR) with Relevance Feedback (RF) techniques. PMID:21278820
van Staa, T-P; Klungel, O; Smeeth, L
2014-06-01
A solid foundation of evidence of the effects of an intervention is a prerequisite of evidence-based medicine. The best source of such evidence is considered to be randomized trials, which are able to avoid confounding. However, they may not always estimate effectiveness in clinical practice. Databases that collate anonymized electronic health records (EHRs) from different clinical centres have been widely used for many years in observational studies. Randomized point-of-care trials have been initiated recently to recruit and follow patients using the data from EHR databases. In this review, we describe how EHR databases can be used for conducting large-scale simple trials and discuss the advantages and disadvantages of their use. © 2014 The Association for the Publication of the Journal of Internal Medicine.
High performance semantic factoring of giga-scale semantic graph databases.
DOE Office of Scientific and Technical Information (OSTI.GOV)
al-Saffar, Sinan; Adolf, Bob; Haglin, David
2010-10-01
As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.« less
Uniform standards for genome databases in forest and fruit trees
USDA-ARS?s Scientific Manuscript database
TreeGenes and tfGDR serve the international forestry and fruit tree genomics research communities, respectively. These databases hold similar sequence data and provide resources for the submission and recovery of this information in order to enable comparative genomics research. Large-scale genotype...
Very large database of lipids: rationale and design.
Martin, Seth S; Blaha, Michael J; Toth, Peter P; Joshi, Parag H; McEvoy, John W; Ahmed, Haitham M; Elshazly, Mohamed B; Swiger, Kristopher J; Michos, Erin D; Kwiterovich, Peter O; Kulkarni, Krishnaji R; Chimera, Joseph; Cannon, Christopher P; Blumenthal, Roger S; Jones, Steven R
2013-11-01
Blood lipids have major cardiovascular and public health implications. Lipid-lowering drugs are prescribed based in part on categorization of patients into normal or abnormal lipid metabolism, yet relatively little emphasis has been placed on: (1) the accuracy of current lipid measures used in clinical practice, (2) the reliability of current categorizations of dyslipidemia states, and (3) the relationship of advanced lipid characterization to other cardiovascular disease biomarkers. To these ends, we developed the Very Large Database of Lipids (NCT01698489), an ongoing database protocol that harnesses deidentified data from the daily operations of a commercial lipid laboratory. The database includes individuals who were referred for clinical purposes for a Vertical Auto Profile (Atherotech Inc., Birmingham, AL), which directly measures cholesterol concentrations of low-density lipoprotein, very low-density lipoprotein, intermediate-density lipoprotein, high-density lipoprotein, their subclasses, and lipoprotein(a). Individual Very Large Database of Lipids studies, ranging from studies of measurement accuracy, to dyslipidemia categorization, to biomarker associations, to characterization of rare lipid disorders, are investigator-initiated and utilize peer-reviewed statistical analysis plans to address a priori hypotheses/aims. In the first database harvest (Very Large Database of Lipids 1.0) from 2009 to 2011, there were 1 340 614 adult and 10 294 pediatric patients; the adult sample had a median age of 59 years (interquartile range, 49-70 years) with even representation by sex. Lipid distributions closely matched those from the population-representative National Health and Nutrition Examination Survey. The second harvest of the database (Very Large Database of Lipids 2.0) is underway. Overall, the Very Large Database of Lipids database provides an opportunity for collaboration and new knowledge generation through careful examination of granular lipid data on a large scale. © 2013 Wiley Periodicals, Inc.
2013-01-01
Background A large-scale, highly accurate, machine-understandable drug-disease treatment relationship knowledge base is important for computational approaches to drug repurposing. The large body of published biomedical research articles and clinical case reports available on MEDLINE is a rich source of FDA-approved drug-disease indication as well as drug-repurposing knowledge that is crucial for applying FDA-approved drugs for new diseases. However, much of this information is buried in free text and not captured in any existing databases. The goal of this study is to extract a large number of accurate drug-disease treatment pairs from published literature. Results In this study, we developed a simple but highly accurate pattern-learning approach to extract treatment-specific drug-disease pairs from 20 million biomedical abstracts available on MEDLINE. We extracted a total of 34,305 unique drug-disease treatment pairs, the majority of which are not included in existing structured databases. Our algorithm achieved a precision of 0.904 and a recall of 0.131 in extracting all pairs, and a precision of 0.904 and a recall of 0.842 in extracting frequent pairs. In addition, we have shown that the extracted pairs strongly correlate with both drug target genes and therapeutic classes, therefore may have high potential in drug discovery. Conclusions We demonstrated that our simple pattern-learning relationship extraction algorithm is able to accurately extract many drug-disease pairs from the free text of biomedical literature that are not captured in structured databases. The large-scale, accurate, machine-understandable drug-disease treatment knowledge base that is resultant of our study, in combination with pairs from structured databases, will have high potential in computational drug repurposing tasks. PMID:23742147
ERIC Educational Resources Information Center
Hampden-Thompson, Gillian; Lubben, Fred; Bennett, Judith
2011-01-01
Quantitative secondary analysis of large-scale data can be combined with in-depth qualitative methods. In this paper, we discuss the role of this combined methods approach in examining the uptake of physics and chemistry in post compulsory schooling for students in England. The secondary data analysis of the National Pupil Database (NPD) served…
Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency.
Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio
2015-01-01
Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.
Advancing the large-scale CCS database for metabolomics and lipidomics at the machine-learning era.
Zhou, Zhiwei; Tu, Jia; Zhu, Zheng-Jiang
2018-02-01
Metabolomics and lipidomics aim to comprehensively measure the dynamic changes of all metabolites and lipids that are present in biological systems. The use of ion mobility-mass spectrometry (IM-MS) for metabolomics and lipidomics has facilitated the separation and the identification of metabolites and lipids in complex biological samples. The collision cross-section (CCS) value derived from IM-MS is a valuable physiochemical property for the unambiguous identification of metabolites and lipids. However, CCS values obtained from experimental measurement and computational modeling are limited available, which significantly restricts the application of IM-MS. In this review, we will discuss the recently developed machine-learning based prediction approach, which could efficiently generate precise CCS databases in a large scale. We will also highlight the applications of CCS databases to support metabolomics and lipidomics. Copyright © 2017 Elsevier Ltd. All rights reserved.
Competitive code-based fast palmprint identification using a set of cover trees
NASA Astrophysics Data System (ADS)
Yue, Feng; Zuo, Wangmeng; Zhang, David; Wang, Kuanquan
2009-06-01
A palmprint identification system recognizes a query palmprint image by searching for its nearest neighbor from among all the templates in a database. When applied on a large-scale identification system, it is often necessary to speed up the nearest-neighbor searching process. We use competitive code, which has very fast feature extraction and matching speed, for palmprint identification. To speed up the identification process, we extend the cover tree method and propose to use a set of cover trees to facilitate the fast and accurate nearest-neighbor searching. We can use the cover tree method because, as we show, the angular distance used in competitive code can be decomposed into a set of metrics. Using the Hong Kong PolyU palmprint database (version 2) and a large-scale palmprint database, our experimental results show that the proposed method searches for nearest neighbors faster than brute force searching.
Validation of a common data model for active safety surveillance research
Ryan, Patrick B; Reich, Christian G; Hartzema, Abraham G; Stang, Paul E
2011-01-01
Objective Systematic analysis of observational medical databases for active safety surveillance is hindered by the variation in data models and coding systems. Data analysts often find robust clinical data models difficult to understand and ill suited to support their analytic approaches. Further, some models do not facilitate the computations required for systematic analysis across many interventions and outcomes for large datasets. Translating the data from these idiosyncratic data models to a common data model (CDM) could facilitate both the analysts' understanding and the suitability for large-scale systematic analysis. In addition to facilitating analysis, a suitable CDM has to faithfully represent the source observational database. Before beginning to use the Observational Medical Outcomes Partnership (OMOP) CDM and a related dictionary of standardized terminologies for a study of large-scale systematic active safety surveillance, the authors validated the model's suitability for this use by example. Validation by example To validate the OMOP CDM, the model was instantiated into a relational database, data from 10 different observational healthcare databases were loaded into separate instances, a comprehensive array of analytic methods that operate on the data model was created, and these methods were executed against the databases to measure performance. Conclusion There was acceptable representation of the data from 10 observational databases in the OMOP CDM using the standardized terminologies selected, and a range of analytic methods was developed and executed with sufficient performance to be useful for active safety surveillance. PMID:22037893
Alecu, I M; Zheng, Jingjing; Zhao, Yan; Truhlar, Donald G
2010-09-14
Optimized scale factors for calculating vibrational harmonic and fundamental frequencies and zero-point energies have been determined for 145 electronic model chemistries, including 119 based on approximate functionals depending on occupied orbitals, 19 based on single-level wave function theory, three based on the neglect-of-diatomic-differential-overlap, two based on doubly hybrid density functional theory, and two based on multicoefficient correlation methods. Forty of the scale factors are obtained from large databases, which are also used to derive two universal scale factor ratios that can be used to interconvert between scale factors optimized for various properties, enabling the derivation of three key scale factors at the effort of optimizing only one of them. A reduced scale factor optimization model is formulated in order to further reduce the cost of optimizing scale factors, and the reduced model is illustrated by using it to obtain 105 additional scale factors. Using root-mean-square errors from the values in the large databases, we find that scaling reduces errors in zero-point energies by a factor of 2.3 and errors in fundamental vibrational frequencies by a factor of 3.0, but it reduces errors in harmonic vibrational frequencies by only a factor of 1.3. It is shown that, upon scaling, the balanced multicoefficient correlation method based on coupled cluster theory with single and double excitations (BMC-CCSD) can lead to very accurate predictions of vibrational frequencies. With a polarized, minimally augmented basis set, the density functionals with zero-point energy scale factors closest to unity are MPWLYP1M (1.009), τHCTHhyb (0.989), BB95 (1.012), BLYP (1.013), BP86 (1.014), B3LYP (0.986), MPW3LYP (0.986), and VSXC (0.986).
WikiPEATia - a web based platform for assembling peatland data through ‘crowd sourcing’
NASA Astrophysics Data System (ADS)
Wisser, D.; Glidden, S.; Fieseher, C.; Treat, C. C.; Routhier, M.; Frolking, S. E.
2009-12-01
The Earth System Science community is realizing that peatlands are an important and unique terrestrial ecosystem that has not yet been well-integrated into large-scale earth system analyses. A major hurdle is the lack of accessible, geospatial data of peatland distribution, coupled with data on peatland properties (e.g., vegetation composition, peat depth, basal dates, soil chemistry, peatland class) at the global scale. This data, however, is available at the local scale. Although a comprehensive global database on peatlands probably lags similar data on more economically important ecosystems such as forests, grasslands, croplands, a large amount of field data have been collected over the past several decades. A few efforts have been made to map peatlands at large scales but existing data have not been assembled into a single geospatial database that is publicly accessible or do not depict data with a level of detail that is needed in the Earth System Science Community. A global peatland database would contribute to advances in a number of research fields such as hydrology, vegetation and ecosystem modeling, permafrost modeling, and earth system modeling. We present a Web 2.0 approach that uses state-of-the-art webserver and innovative online mapping technologies and is designed to create such a global database through ‘crowd-sourcing’. Primary functions of the online system include form-driven textual user input of peatland research metadata, spatial data input of peatland areas via a mapping interface, database editing and querying editing capabilities, as well as advanced visualization and data analysis tools. WikiPEATia provides an integrated information technology platform for assembling, integrating, and posting peatland-related geospatial datasets facilitates and encourages research community involvement. A successful effort will make existing peatland data much more useful to the research community, and will help to identify significant data gaps.
Ogishima, Soichi; Takai, Takako; Shimokawa, Kazuro; Nagaie, Satoshi; Tanaka, Hiroshi; Nakaya, Jun
2015-01-01
The Tohoku Medical Megabank project is a national project to revitalization of the disaster area in the Tohoku region by the Great East Japan Earthquake, and have conducted large-scale prospective genome-cohort study. Along with prospective genome-cohort study, we have developed integrated database and knowledge base which will be key database for realizing personalized prevention and medicine.
Hermjakob, Henning; Montecchi-Palazzi, Luisa; Bader, Gary; Wojcik, Jérôme; Salwinski, Lukasz; Ceol, Arnaud; Moore, Susan; Orchard, Sandra; Sarkans, Ugis; von Mering, Christian; Roechert, Bernd; Poux, Sylvain; Jung, Eva; Mersch, Henning; Kersey, Paul; Lappe, Michael; Li, Yixue; Zeng, Rong; Rana, Debashis; Nikolski, Macha; Husi, Holger; Brun, Christine; Shanker, K; Grant, Seth G N; Sander, Chris; Bork, Peer; Zhu, Weimin; Pandey, Akhilesh; Brazma, Alvis; Jacq, Bernard; Vidal, Marc; Sherman, David; Legrain, Pierre; Cesareni, Gianni; Xenarios, Ioannis; Eisenberg, David; Steipe, Boris; Hogue, Chris; Apweiler, Rolf
2004-02-01
A major goal of proteomics is the complete description of the protein interaction network underlying cell physiology. A large number of small scale and, more recently, large-scale experiments have contributed to expanding our understanding of the nature of the interaction network. However, the necessary data integration across experiments is currently hampered by the fragmentation of publicly available protein interaction data, which exists in different formats in databases, on authors' websites or sometimes only in print publications. Here, we propose a community standard data model for the representation and exchange of protein interaction data. This data model has been jointly developed by members of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome Organization (HUPO), and is supported by major protein interaction data providers, in particular the Biomolecular Interaction Network Database (BIND), Cellzome (Heidelberg, Germany), the Database of Interacting Proteins (DIP), Dana Farber Cancer Institute (Boston, MA, USA), the Human Protein Reference Database (HPRD), Hybrigenics (Paris, France), the European Bioinformatics Institute's (EMBL-EBI, Hinxton, UK) IntAct, the Molecular Interactions (MINT, Rome, Italy) database, the Protein-Protein Interaction Database (PPID, Edinburgh, UK) and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, EMBL, Heidelberg, Germany).
Soranno, Patricia A; Bissell, Edward G; Cheruvelil, Kendra S; Christel, Samuel T; Collins, Sarah M; Fergus, C Emi; Filstrup, Christopher T; Lapierre, Jean-Francois; Lottig, Noah R; Oliver, Samantha K; Scott, Caren E; Smith, Nicole J; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A; Gries, Corinna; Henry, Emily N; Skaff, Nick K; Stanley, Emily H; Stow, Craig A; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E
2015-01-01
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
Soranno, Patricia A.; Bissell, E.G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Fergus, C. Emi; Filstrup, Christopher T.; Lapierre, Jean-Francois; Lotting, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E.
2015-01-01
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
Scale out databases for CERN use cases
NASA Astrophysics Data System (ADS)
Baranowski, Zbigniew; Grzybek, Maciej; Canali, Luca; Lanza Garcia, Daniel; Surdy, Kacper
2015-12-01
Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.
Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency
Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio
2015-01-01
Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB. PMID:26558254
Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan
2017-03-15
The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation. Copyright © 2017 Elsevier Inc. All rights reserved.
Iavindrasana, Jimison; Depeursinge, Adrien; Ruch, Patrick; Spahni, Stéphane; Geissbuhler, Antoine; Müller, Henning
2007-01-01
The diagnostic and therapeutic processes, as well as the development of new treatments, are hindered by the fragmentation of information which underlies them. In a multi-institutional research study database, the clinical information system (CIS) contains the primary data input. An important part of the money of large scale clinical studies is often paid for data creation and maintenance. The objective of this work is to design a decentralized, scalable, reusable database architecture with lower maintenance costs for managing and integrating distributed heterogeneous data required as basis for a large-scale research project. Technical and legal aspects are taken into account based on various use case scenarios. The architecture contains 4 layers: data storage and access are decentralized at their production source, a connector as a proxy between the CIS and the external world, an information mediator as a data access point and the client side. The proposed design will be implemented inside six clinical centers participating in the @neurIST project as part of a larger system on data integration and reuse for aneurism treatment.
Multiple Object Retrieval in Image Databases Using Hierarchical Segmentation Tree
ERIC Educational Resources Information Center
Chen, Wei-Bang
2012-01-01
The purpose of this research is to develop a new visual information analysis, representation, and retrieval framework for automatic discovery of salient objects of user's interest in large-scale image databases. In particular, this dissertation describes a content-based image retrieval framework which supports multiple-object retrieval. The…
Visual Attention Modeling for Stereoscopic Video: A Benchmark and Computational Model.
Fang, Yuming; Zhang, Chi; Li, Jing; Lei, Jianjun; Perreira Da Silva, Matthieu; Le Callet, Patrick
2017-10-01
In this paper, we investigate the visual attention modeling for stereoscopic video from the following two aspects. First, we build one large-scale eye tracking database as the benchmark of visual attention modeling for stereoscopic video. The database includes 47 video sequences and their corresponding eye fixation data. Second, we propose a novel computational model of visual attention for stereoscopic video based on Gestalt theory. In the proposed model, we extract the low-level features, including luminance, color, texture, and depth, from discrete cosine transform coefficients, which are used to calculate feature contrast for the spatial saliency computation. The temporal saliency is calculated by the motion contrast from the planar and depth motion features in the stereoscopic video sequences. The final saliency is estimated by fusing the spatial and temporal saliency with uncertainty weighting, which is estimated by the laws of proximity, continuity, and common fate in Gestalt theory. Experimental results show that the proposed method outperforms the state-of-the-art stereoscopic video saliency detection models on our built large-scale eye tracking database and one other database (DML-ITRACK-3D).
Temporal and Fine-Grained Pedestrian Action Recognition on Driving Recorder Database
Satoh, Yutaka; Aoki, Yoshimitsu; Oikawa, Shoko; Matsui, Yasuhiro
2018-01-01
The paper presents an emerging issue of fine-grained pedestrian action recognition that induces an advanced pre-crush safety to estimate a pedestrian intention in advance. The fine-grained pedestrian actions include visually slight differences (e.g., walking straight and crossing), which are difficult to distinguish from each other. It is believed that the fine-grained action recognition induces a pedestrian intention estimation for a helpful advanced driver-assistance systems (ADAS). The following difficulties have been studied to achieve a fine-grained and accurate pedestrian action recognition: (i) In order to analyze the fine-grained motion of a pedestrian appearance in the vehicle-mounted drive recorder, a method to describe subtle change of motion characteristics occurring in a short time is necessary; (ii) even when the background moves greatly due to the driving of the vehicle, it is necessary to detect changes in subtle motion of the pedestrian; (iii) the collection of large-scale fine-grained actions is very difficult, and therefore a relatively small database should be focused. We find out how to learn an effective recognition model with only a small-scale database. Here, we have thoroughly evaluated several types of configurations to explore an effective approach in fine-grained pedestrian action recognition without a large-scale database. Moreover, two different datasets have been collected in order to raise the issue. Finally, our proposal attained 91.01% on National Traffic Science and Environment Laboratory database (NTSEL) and 53.23% on the near-miss driving recorder database (NDRDB). The paper has improved +8.28% and +6.53% from baseline two-stream fusion convnets. PMID:29461473
The statistical power to detect cross-scale interactions at macroscales
Wagner, Tyler; Fergus, C. Emi; Stow, Craig A.; Cheruvelil, Kendra S.; Soranno, Patricia A.
2016-01-01
Macroscale studies of ecological phenomena are increasingly common because stressors such as climate and land-use change operate at large spatial and temporal scales. Cross-scale interactions (CSIs), where ecological processes operating at one spatial or temporal scale interact with processes operating at another scale, have been documented in a variety of ecosystems and contribute to complex system dynamics. However, studies investigating CSIs are often dependent on compiling multiple data sets from different sources to create multithematic, multiscaled data sets, which results in structurally complex, and sometimes incomplete data sets. The statistical power to detect CSIs needs to be evaluated because of their importance and the challenge of quantifying CSIs using data sets with complex structures and missing observations. We studied this problem using a spatially hierarchical model that measures CSIs between regional agriculture and its effects on the relationship between lake nutrients and lake productivity. We used an existing large multithematic, multiscaled database, LAke multiscaled GeOSpatial, and temporal database (LAGOS), to parameterize the power analysis simulations. We found that the power to detect CSIs was more strongly related to the number of regions in the study rather than the number of lakes nested within each region. CSI power analyses will not only help ecologists design large-scale studies aimed at detecting CSIs, but will also focus attention on CSI effect sizes and the degree to which they are ecologically relevant and detectable with large data sets.
The Sequenced Angiosperm Genomes and Genome Databases.
Chen, Fei; Dong, Wei; Zhang, Jiawei; Guo, Xinyue; Chen, Junhao; Wang, Zhengjia; Lin, Zhenguo; Tang, Haibao; Zhang, Liangsheng
2018-01-01
Angiosperms, the flowering plants, provide the essential resources for human life, such as food, energy, oxygen, and materials. They also promoted the evolution of human, animals, and the planet earth. Despite the numerous advances in genome reports or sequencing technologies, no review covers all the released angiosperm genomes and the genome databases for data sharing. Based on the rapid advances and innovations in the database reconstruction in the last few years, here we provide a comprehensive review for three major types of angiosperm genome databases, including databases for a single species, for a specific angiosperm clade, and for multiple angiosperm species. The scope, tools, and data of each type of databases and their features are concisely discussed. The genome databases for a single species or a clade of species are especially popular for specific group of researchers, while a timely-updated comprehensive database is more powerful for address of major scientific mysteries at the genome scale. Considering the low coverage of flowering plants in any available database, we propose construction of a comprehensive database to facilitate large-scale comparative studies of angiosperm genomes and to promote the collaborative studies of important questions in plant biology.
The Sequenced Angiosperm Genomes and Genome Databases
Chen, Fei; Dong, Wei; Zhang, Jiawei; Guo, Xinyue; Chen, Junhao; Wang, Zhengjia; Lin, Zhenguo; Tang, Haibao; Zhang, Liangsheng
2018-01-01
Angiosperms, the flowering plants, provide the essential resources for human life, such as food, energy, oxygen, and materials. They also promoted the evolution of human, animals, and the planet earth. Despite the numerous advances in genome reports or sequencing technologies, no review covers all the released angiosperm genomes and the genome databases for data sharing. Based on the rapid advances and innovations in the database reconstruction in the last few years, here we provide a comprehensive review for three major types of angiosperm genome databases, including databases for a single species, for a specific angiosperm clade, and for multiple angiosperm species. The scope, tools, and data of each type of databases and their features are concisely discussed. The genome databases for a single species or a clade of species are especially popular for specific group of researchers, while a timely-updated comprehensive database is more powerful for address of major scientific mysteries at the genome scale. Considering the low coverage of flowering plants in any available database, we propose construction of a comprehensive database to facilitate large-scale comparative studies of angiosperm genomes and to promote the collaborative studies of important questions in plant biology. PMID:29706973
EPA'S LANDSCAPE SCIENCES RESEARCH: NUTRIENT POLLUTION, FLOODING, AND HABITAT
There is a growing need to understand the pattern of landscape change at regional scales and to determine how such changes affect environmental values. Key to conducting these assessments is the development of land-cover databases that permit large-scale analyses, such as an exam...
Hirano, Yoko; Asami, Yuko; Kuribayashi, Kazuhiko; Kitazaki, Shigeru; Yamamoto, Yuji; Fujimoto, Yoko
2018-05-01
Many pharmacoepidemiologic studies using large-scale databases have recently been utilized to evaluate the safety and effectiveness of drugs in Western countries. In Japan, however, conventional methodology has been applied to postmarketing surveillance (PMS) to collect safety and effectiveness information on new drugs to meet regulatory requirements. Conventional PMS entails enormous costs and resources despite being an uncontrolled observational study method. This study is aimed at examining the possibility of database research as a more efficient pharmacovigilance approach by comparing a health care claims database and PMS with regard to the characteristics and safety profiles of sertraline-prescribed patients. The characteristics of sertraline-prescribed patients recorded in a large-scale Japanese health insurance claims database developed by MinaCare Co. Ltd. were scanned and compared with the PMS results. We also explored the possibility of detecting signals indicative of adverse reactions based on the claims database by using sequence symmetry analysis. Diabetes mellitus, hyperlipidemia, and hyperthyroidism served as exploratory events, and their detection criteria for the claims database were reported by the Pharmaceuticals and Medical Devices Agency in Japan. Most of the characteristics of sertraline-prescribed patients in the claims database did not differ markedly from those in the PMS. There was no tendency for higher risks of the exploratory events after exposure to sertraline, and this was consistent with sertraline's known safety profile. Our results support the concept of using database research as a cost-effective pharmacovigilance tool that is free of selection bias . Further investigation using database research is required to confirm our preliminary observations. Copyright © 2018. Published by Elsevier Inc.
Consistency Analysis of Genome-Scale Models of Bacterial Metabolism: A Metamodel Approach
Ponce-de-Leon, Miguel; Calle-Espinosa, Jorge; Peretó, Juli; Montero, Francisco
2015-01-01
Genome-scale metabolic models usually contain inconsistencies that manifest as blocked reactions and gap metabolites. With the purpose to detect recurrent inconsistencies in metabolic models, a large-scale analysis was performed using a previously published dataset of 130 genome-scale models. The results showed that a large number of reactions (~22%) are blocked in all the models where they are present. To unravel the nature of such inconsistencies a metamodel was construed by joining the 130 models in a single network. This metamodel was manually curated using the unconnected modules approach, and then, it was used as a reference network to perform a gap-filling on each individual genome-scale model. Finally, a set of 36 models that had not been considered during the construction of the metamodel was used, as a proof of concept, to extend the metamodel with new biochemical information, and to assess its impact on gap-filling results. The analysis performed on the metamodel allowed to conclude: 1) the recurrent inconsistencies found in the models were already present in the metabolic database used during the reconstructions process; 2) the presence of inconsistencies in a metabolic database can be propagated to the reconstructed models; 3) there are reactions not manifested as blocked which are active as a consequence of some classes of artifacts, and; 4) the results of an automatic gap-filling are highly dependent on the consistency and completeness of the metamodel or metabolic database used as the reference network. In conclusion the consistency analysis should be applied to metabolic databases in order to detect and fill gaps as well as to detect and remove artifacts and redundant information. PMID:26629901
Future of applied watershed science at regional scales
Lee Benda; Daniel Miller; Steve Lanigan; Gordon Reeves
2009-01-01
Resource managers must deal increasingly with land use and conservation plans applied at large spatial scales (watersheds, landscapes, states, regions) involving multiple interacting federal agencies and stakeholders. Access to a geographically focused and application-oriented database would allow users in different locations and with different concerns to quickly...
Distributed database kriging for adaptive sampling (D²KAS)
Roehm, Dominic; Pavel, Robert S.; Barros, Kipton; ...
2015-03-18
We present an adaptive sampling method supplemented by a distributed database and a prediction method for multiscale simulations using the Heterogeneous Multiscale Method. A finite-volume scheme integrates the macro-scale conservation laws for elastodynamics, which are closed by momentum and energy fluxes evaluated at the micro-scale. In the original approach, molecular dynamics (MD) simulations are launched for every macro-scale volume element. Our adaptive sampling scheme replaces a large fraction of costly micro-scale MD simulations with fast table lookup and prediction. The cloud database Redis provides the plain table lookup, and with locality aware hashing we gather input data for our predictionmore » scheme. For the latter we use kriging, which estimates an unknown value and its uncertainty (error) at a specific location in parameter space by using weighted averages of the neighboring points. We find that our adaptive scheme significantly improves simulation performance by a factor of 2.5 to 25, while retaining high accuracy for various choices of the algorithm parameters.« less
QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors.
Tarasova, Olga A; Urusova, Aleksandra F; Filimonov, Dmitry A; Nicklaus, Marc C; Zakharov, Alexey V; Poroikov, Vladimir V
2015-07-27
Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.
GLAD: a system for developing and deploying large-scale bioinformatics grid.
Teo, Yong-Meng; Wang, Xianbing; Ng, Yew-Kwong
2005-03-01
Grid computing is used to solve large-scale bioinformatics problems with gigabytes database by distributing the computation across multiple platforms. Until now in developing bioinformatics grid applications, it is extremely tedious to design and implement the component algorithms and parallelization techniques for different classes of problems, and to access remotely located sequence database files of varying formats across the grid. In this study, we propose a grid programming toolkit, GLAD (Grid Life sciences Applications Developer), which facilitates the development and deployment of bioinformatics applications on a grid. GLAD has been developed using ALiCE (Adaptive scaLable Internet-based Computing Engine), a Java-based grid middleware, which exploits the task-based parallelism. Two bioinformatics benchmark applications, such as distributed sequence comparison and distributed progressive multiple sequence alignment, have been developed using GLAD.
What do data used to develop ground-motion prediction equations tell us about motions near faults?
Boore, David M.
2014-01-01
A large database of ground motions from shallow earthquakes occurring in active tectonic regions around the world, recently developed in the Pacific Earthquake Engineering Center’s NGA-West2 project, has been used to investigate what such a database can say about the properties and processes of crustal fault zones. There are a relatively small number of near-rupture records, implying that few recordings in the database are within crustal fault zones, but the records that do exist emphasize the complexity of ground-motion amplitudes and polarization close to individual faults. On average over the whole data set, however, the scaling of ground motions with magnitude at a fixed distance, and the distance dependence of the ground motions, seem to be largely consistent with simple seismological models of source scaling, path propagation effects, and local site amplification. The data show that ground motions close to large faults, as measured by elastic response spectra, tend to saturate and become essentially constant for short periods. This saturation seems to be primarily a geometrical effect, due to the increasing size of the rupture surface with magnitude, and not due to a breakdown in self similarity.
What Do Data Used to Develop Ground-Motion Prediction Equations Tell Us About Motions Near Faults?
NASA Astrophysics Data System (ADS)
Boore, David M.
2014-11-01
A large database of ground motions from shallow earthquakes occurring in active tectonic regions around the world, recently developed in the Pacific Earthquake Engineering Center's NGA-West2 project, has been used to investigate what such a database can say about the properties and processes of crustal fault zones. There are a relatively small number of near-rupture records, implying that few recordings in the database are within crustal fault zones, but the records that do exist emphasize the complexity of ground-motion amplitudes and polarization close to individual faults. On average over the whole data set, however, the scaling of ground motions with magnitude at a fixed distance, and the distance dependence of the ground motions, seem to be largely consistent with simple seismological models of source scaling, path propagation effects, and local site amplification. The data show that ground motions close to large faults, as measured by elastic response spectra, tend to saturate and become essentially constant for short periods. This saturation seems to be primarily a geometrical effect, due to the increasing size of the rupture surface with magnitude, and not due to a breakdown in self similarity.
Saunders, Rebecca E; Instrell, Rachael; Rispoli, Rossella; Jiang, Ming; Howell, Michael
2013-01-01
High-throughput screening (HTS) uses technologies such as RNA interference to generate loss-of-function phenotypes on a genomic scale. As these technologies become more popular, many research institutes have established core facilities of expertise to deal with the challenges of large-scale HTS experiments. As the efforts of core facility screening projects come to fruition, focus has shifted towards managing the results of these experiments and making them available in a useful format that can be further mined for phenotypic discovery. The HTS-DB database provides a public view of data from screening projects undertaken by the HTS core facility at the CRUK London Research Institute. All projects and screens are described with comprehensive assay protocols, and datasets are provided with complete descriptions of analysis techniques. This format allows users to browse and search data from large-scale studies in an informative and intuitive way. It also provides a repository for additional measurements obtained from screens that were not the focus of the project, such as cell viability, and groups these data so that it can provide a gene-centric summary across several different cell lines and conditions. All datasets from our screens that can be made available can be viewed interactively and mined for further hit lists. We believe that in this format, the database provides researchers with rapid access to results of large-scale experiments that might facilitate their understanding of genes/compounds identified in their own research. DATABASE URL: http://hts.cancerresearchuk.org/db/public.
ERIC Educational Resources Information Center
Lloyd-Strovas, Jenny D.; Arsuffi, Thomas L.
2016-01-01
We examined the diversity of environmental education (EE) in Texas, USA, by developing a framework to assess EE organizations and programs at a large scale: the Environmental Education Database of Organizations and Programs (EEDOP). This framework consisted of the following characteristics: organization/visitor demographics, pedagogy/curriculum,…
Chen, Mingyang; Stott, Amanda C; Li, Shenggang; Dixon, David A
2012-04-01
A robust metadata database called the Collaborative Chemistry Database Tool (CCDBT) for massive amounts of computational chemistry raw data has been designed and implemented. It performs data synchronization and simultaneously extracts the metadata. Computational chemistry data in various formats from different computing sources, software packages, and users can be parsed into uniform metadata for storage in a MySQL database. Parsing is performed by a parsing pyramid, including parsers written for different levels of data types and sets created by the parser loader after loading parser engines and configurations. Copyright © 2011 Elsevier Inc. All rights reserved.
[Status of libraries and databases for natural products at abroad].
Zhao, Li-Mei; Tan, Ning-Hua
2015-01-01
For natural products are one of the important sources for drug discovery, libraries and databases of natural products are significant for the development and research of natural products. At present, most of compound libraries at abroad are synthetic or combinatorial synthetic molecules, resulting to access natural products difficult; for information of natural products are scattered with different standards, it is difficult to construct convenient, comprehensive and large-scale databases for natural products. This paper reviewed the status of current accessing libraries and databases for natural products at abroad and provided some important information for the development of libraries and database for natural products.
MouseNet database: digital management of a large-scale mutagenesis project.
Pargent, W; Heffner, S; Schäble, K F; Soewarto, D; Fuchs, H; Hrabé de Angelis, M
2000-07-01
The Munich ENU Mouse Mutagenesis Screen is a large-scale mutant production, phenotyping, and mapping project. It encompasses two animal breeding facilities and a number of screening groups located in the general area of Munich. A central database is required to manage and process the immense amount of data generated by the mutagenesis project. This database, which we named MouseNet(c), runs on a Sybase platform and will finally store and process all data from the entire project. In addition, the system comprises a portfolio of functions needed to support the workflow management of the core facility and the screening groups. MouseNet(c) will make all of the data available to the participating screening groups, and later to the international scientific community. MouseNet(c) will consist of three major software components:* Animal Management System (AMS)* Sample Tracking System (STS)* Result Documentation System (RDS)MouseNet(c) provides the following major advantages:* being accessible from different client platforms via the Internet* being a full-featured multi-user system (including access restriction and data locking mechanisms)* relying on a professional RDBMS (relational database management system) which runs on a UNIX server platform* supplying workflow functions and a variety of plausibility checks.
Hierarchical Data Distribution Scheme for Peer-to-Peer Networks
NASA Astrophysics Data System (ADS)
Bhushan, Shashi; Dave, M.; Patel, R. B.
2010-11-01
In the past few years, peer-to-peer (P2P) networks have become an extremely popular mechanism for large-scale content sharing. P2P systems have focused on specific application domains (e.g. music files, video files) or on providing file system like capabilities. P2P is a powerful paradigm, which provides a large-scale and cost-effective mechanism for data sharing. P2P system may be used for storing data globally. Can we implement a conventional database on P2P system? But successful implementation of conventional databases on the P2P systems is yet to be reported. In this paper we have presented the mathematical model for the replication of the partitions and presented a hierarchical based data distribution scheme for the P2P networks. We have also analyzed the resource utilization and throughput of the P2P system with respect to the availability, when a conventional database is implemented over the P2P system with variable query rate. Simulation results show that database partitions placed on the peers with higher availability factor perform better. Degradation index, throughput, resource utilization are the parameters evaluated with respect to the availability factor.
Icing Simulation Research Supporting the Ice-Accretion Testing of Large-Scale Swept-Wing Models
NASA Technical Reports Server (NTRS)
Yadlin, Yoram; Monnig, Jaime T.; Malone, Adam M.; Paul, Bernard P.
2018-01-01
The work summarized in this report is a continuation of NASA's Large-Scale, Swept-Wing Test Articles Fabrication; Research and Test Support for NASA IRT contract (NNC10BA05 -NNC14TA36T) performed by Boeing under the NASA Research and Technology for Aerospace Propulsion Systems (RTAPS) contract. In the study conducted under RTAPS, a series of icing tests in the Icing Research Tunnel (IRT) have been conducted to characterize ice formations on large-scale swept wings representative of modern commercial transport airplanes. The outcome of that campaign was a large database of ice-accretion geometries that can be used for subsequent aerodynamic evaluation in other experimental facilities and for validation of ice-accretion prediction codes.
Computerization of Library and Information Services in Mainland China.
ERIC Educational Resources Information Center
Lin, Sharon Chien
1994-01-01
Describes two phases of the automation of library and information services in mainland China. From 1974-86, much effort was concentrated on developing computer systems, databases, online retrieval, and networking. From 1986 to the present, practical progress became possible largely because of CD-ROM technology; and large scale networking for…
Neural Network Modeling of UH-60A Pilot Vibration
NASA Technical Reports Server (NTRS)
Kottapalli, Sesi
2003-01-01
Full-scale flight-test pilot floor vibration is modeled using neural networks and full-scale wind tunnel test data for low speed level flight conditions. Neural network connections between the wind tunnel test data and the tlxee flight test pilot vibration components (vertical, lateral, and longitudinal) are studied. Two full-scale UH-60A Black Hawk databases are used. The first database is the NASMArmy UH-60A Airloads Program flight test database. The second database is the UH-60A rotor-only wind tunnel database that was acquired in the NASA Ames SO- by 120- Foot Wind Tunnel with the Large Rotor Test Apparatus (LRTA). Using neural networks, the flight-test pilot vibration is modeled using the wind tunnel rotating system hub accelerations, and separately, using the hub loads. The results show that the wind tunnel rotating system hub accelerations and the operating parameters can represent the flight test pilot vibration. The six components of the wind tunnel N/rev balance-system hub loads and the operating parameters can also represent the flight test pilot vibration. The present neural network connections can significandy increase the value of wind tunnel testing.
Architectural Implications for Spatial Object Association Algorithms*
Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste
2013-01-01
Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244
Large scale database scrubbing using object oriented software components.
Herting, R L; Barnes, M R
1998-01-01
Now that case managers, quality improvement teams, and researchers use medical databases extensively, the ability to share and disseminate such databases while maintaining patient confidentiality is paramount. A process called scrubbing addresses this problem by removing personally identifying information while keeping the integrity of the medical information intact. Scrubbing entire databases, containing multiple tables, requires that the implicit relationships between data elements in different tables of the database be maintained. To address this issue we developed DBScrub, a Java program that interfaces with any JDBC compliant database and scrubs the database while maintaining the implicit relationships within it. DBScrub uses a small number of highly configurable object-oriented software components to carry out the scrubbing. We describe the structure of these software components and how they maintain the implicit relationships within the database.
Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization
Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip
2013-01-01
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license. PMID:23613707
Classification of time series patterns from complex dynamic systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, J.C.; Rao, N.
1998-07-01
An increasing availability of high-performance computing and data storage media at decreasing cost is making possible the proliferation of large-scale numerical databases and data warehouses. Numeric warehousing enterprises on the order of hundreds of gigabytes to terabytes are a reality in many fields such as finance, retail sales, process systems monitoring, biomedical monitoring, surveillance and transportation. Large-scale databases are becoming more accessible to larger user communities through the internet, web-based applications and database connectivity. Consequently, most researchers now have access to a variety of massive datasets. This trend will probably only continue to grow over the next several years. Unfortunately,more » the availability of integrated tools to explore, analyze and understand the data warehoused in these archives is lagging far behind the ability to gain access to the same data. In particular, locating and identifying patterns of interest in numerical time series data is an increasingly important problem for which there are few available techniques. Temporal pattern recognition poses many interesting problems in classification, segmentation, prediction, diagnosis and anomaly detection. This research focuses on the problem of classification or characterization of numerical time series data. Highway vehicles and their drivers are examples of complex dynamic systems (CDS) which are being used by transportation agencies for field testing to generate large-scale time series datasets. Tools for effective analysis of numerical time series in databases generated by highway vehicle systems are not yet available, or have not been adapted to the target problem domain. However, analysis tools from similar domains may be adapted to the problem of classification of numerical time series data.« less
Efficient hemodynamic event detection utilizing relational databases and wavelet analysis
NASA Technical Reports Server (NTRS)
Saeed, M.; Mark, R. G.
2001-01-01
Development of a temporal query framework for time-oriented medical databases has hitherto been a challenging problem. We describe a novel method for the detection of hemodynamic events in multiparameter trends utilizing wavelet coefficients in a MySQL relational database. Storage of the wavelet coefficients allowed for a compact representation of the trends, and provided robust descriptors for the dynamics of the parameter time series. A data model was developed to allow for simplified queries along several dimensions and time scales. Of particular importance, the data model and wavelet framework allowed for queries to be processed with minimal table-join operations. A web-based search engine was developed to allow for user-defined queries. Typical queries required between 0.01 and 0.02 seconds, with at least two orders of magnitude improvement in speed over conventional queries. This powerful and innovative structure will facilitate research on large-scale time-oriented medical databases.
A blue carbon soil database: Tidal wetland stocks for the US National Greenhouse Gas Inventory
NASA Astrophysics Data System (ADS)
Feagin, R. A.; Eriksson, M.; Hinson, A.; Najjar, R. G.; Kroeger, K. D.; Herrmann, M.; Holmquist, J. R.; Windham-Myers, L.; MacDonald, G. M.; Brown, L. N.; Bianchi, T. S.
2015-12-01
Coastal wetlands contain large reservoirs of carbon, and in 2015 the US National Greenhouse Gas Inventory began the work of placing blue carbon within the national regulatory context. The potential value of a wetland carbon stock, in relation to its location, soon could be influential in determining governmental policy and management activities, or in stimulating market-based CO2 sequestration projects. To meet the national need for high-resolution maps, a blue carbon stock database was developed linking National Wetlands Inventory datasets with the USDA Soil Survey Geographic Database. Users of the database can identify the economic potential for carbon conservation or restoration projects within specific estuarine basins, states, wetland types, physical parameters, and land management activities. The database is geared towards both national-level assessments and local-level inquiries. Spatial analysis of the stocks show high variance within individual estuarine basins, largely dependent on geomorphic position on the landscape, though there are continental scale trends to the carbon distribution as well. Future plans including linking this database with a sedimentary accretion database to predict carbon flux in US tidal wetlands.
Ahmad, Riaz; Naz, Saeeda; Afzal, Muhammad Zeshan; Amin, Sayed Hassan; Breuel, Thomas
2015-01-01
The presence of a large number of unique shapes called ligatures in cursive languages, along with variations due to scaling, orientation and location provides one of the most challenging pattern recognition problems. Recognition of the large number of ligatures is often a complicated task in oriental languages such as Pashto, Urdu, Persian and Arabic. Research on cursive script recognition often ignores the fact that scaling, orientation, location and font variations are common in printed cursive text. Therefore, these variations are not included in image databases and in experimental evaluations. This research uncovers challenges faced by Arabic cursive script recognition in a holistic framework by considering Pashto as a test case, because Pashto language has larger alphabet set than Arabic, Persian and Urdu. A database containing 8000 images of 1000 unique ligatures having scaling, orientation and location variations is introduced. In this article, a feature space based on scale invariant feature transform (SIFT) along with a segmentation framework has been proposed for overcoming the above mentioned challenges. The experimental results show a significantly improved performance of proposed scheme over traditional feature extraction techniques such as principal component analysis (PCA). PMID:26368566
Interactive Exploration for Continuously Expanding Neuron Databases.
Li, Zhongyu; Metaxas, Dimitris N; Lu, Aidong; Zhang, Shaoting
2017-02-15
This paper proposes a novel framework to help biologists explore and analyze neurons based on retrieval of data from neuron morphological databases. In recent years, the continuously expanding neuron databases provide a rich source of information to associate neuronal morphologies with their functional properties. We design a coarse-to-fine framework for efficient and effective data retrieval from large-scale neuron databases. In the coarse-level, for efficiency in large-scale, we employ a binary coding method to compress morphological features into binary codes of tens of bits. Short binary codes allow for real-time similarity searching in Hamming space. Because the neuron databases are continuously expanding, it is inefficient to re-train the binary coding model from scratch when adding new neurons. To solve this problem, we extend binary coding with online updating schemes, which only considers the newly added neurons and update the model on-the-fly, without accessing the whole neuron databases. In the fine-grained level, we introduce domain experts/users in the framework, which can give relevance feedback for the binary coding based retrieval results. This interactive strategy can improve the retrieval performance through re-ranking the above coarse results, where we design a new similarity measure and take the feedback into account. Our framework is validated on more than 17,000 neuron cells, showing promising retrieval accuracy and efficiency. Moreover, we demonstrate its use case in assisting biologists to identify and explore unknown neurons. Copyright © 2017 Elsevier Inc. All rights reserved.
Spatial distribution of GRBs and large scale structure of the Universe
NASA Astrophysics Data System (ADS)
Bagoly, Zsolt; Rácz, István I.; Balázs, Lajos G.; Tóth, L. Viktor; Horváth, István
We studied the space distribution of the starburst galaxies from Millennium XXL database at z = 0.82. We examined the starburst distribution in the classical Millennium I (De Lucia et al. (2006)) using a semi-analytical model for the genesis of the galaxies. We simulated a starburst galaxies sample with Markov Chain Monte Carlo method. The connection between the large scale structures homogenous and starburst groups distribution (Kofman and Shandarin 1998), Suhhonenko et al. (2011), Liivamägi et al. (2012), Park et al. (2012), Horvath et al. (2014), Horvath et al. (2015)) on a defined scale were checked too.
[Adverse Effect Predictions Based on Computational Toxicology Techniques and Large-scale Databases].
Uesawa, Yoshihiro
2018-01-01
Understanding the features of chemical structures related to the adverse effects of drugs is useful for identifying potential adverse effects of new drugs. This can be based on the limited information available from post-marketing surveillance, assessment of the potential toxicities of metabolites and illegal drugs with unclear characteristics, screening of lead compounds at the drug discovery stage, and identification of leads for the discovery of new pharmacological mechanisms. This present paper describes techniques used in computational toxicology to investigate the content of large-scale spontaneous report databases of adverse effects, and it is illustrated with examples. Furthermore, volcano plotting, a new visualization method for clarifying the relationships between drugs and adverse effects via comprehensive analyses, will be introduced. These analyses may produce a great amount of data that can be applied to drug repositioning.
Databases for multilevel biophysiology research available at Physiome.jp.
Asai, Yoshiyuki; Abe, Takeshi; Li, Li; Oka, Hideki; Nomura, Taishin; Kitano, Hiroaki
2015-01-01
Physiome.jp (http://physiome.jp) is a portal site inaugurated in 2007 to support model-based research in physiome and systems biology. At Physiome.jp, several tools and databases are available to support construction of physiological, multi-hierarchical, large-scale models. There are three databases in Physiome.jp, housing mathematical models, morphological data, and time-series data. In late 2013, the site was fully renovated, and in May 2015, new functions were implemented to provide information infrastructure to support collaborative activities for developing models and performing simulations within the database framework. This article describes updates to the databases implemented since 2013, including cooperation among the three databases, interactive model browsing, user management, version management of models, management of parameter sets, and interoperability with applications.
Visual Systems for Interactive Exploration and Mining of Large-Scale Neuroimaging Data Archives
Bowman, Ian; Joshi, Shantanu H.; Van Horn, John D.
2012-01-01
While technological advancements in neuroimaging scanner engineering have improved the efficiency of data acquisition, electronic data capture methods will likewise significantly expedite the populating of large-scale neuroimaging databases. As they do and these archives grow in size, a particular challenge lies in examining and interacting with the information that these resources contain through the development of compelling, user-driven approaches for data exploration and mining. In this article, we introduce the informatics visualization for neuroimaging (INVIZIAN) framework for the graphical rendering of, and dynamic interaction with the contents of large-scale neuroimaging data sets. We describe the rationale behind INVIZIAN, detail its development, and demonstrate its usage in examining a collection of over 900 T1-anatomical magnetic resonance imaging (MRI) image volumes from across a diverse set of clinical neuroimaging studies drawn from a leading neuroimaging database. Using a collection of cortical surface metrics and means for examining brain similarity, INVIZIAN graphically displays brain surfaces as points in a coordinate space and enables classification of clusters of neuroanatomically similar MRI images and data mining. As an initial step toward addressing the need for such user-friendly tools, INVIZIAN provides a highly unique means to interact with large quantities of electronic brain imaging archives in ways suitable for hypothesis generation and data mining. PMID:22536181
Resources for Functional Genomics Studies in Drosophila melanogaster
Mohr, Stephanie E.; Hu, Yanhui; Kim, Kevin; Housden, Benjamin E.; Perrimon, Norbert
2014-01-01
Drosophila melanogaster has become a system of choice for functional genomic studies. Many resources, including online databases and software tools, are now available to support design or identification of relevant fly stocks and reagents or analysis and mining of existing functional genomic, transcriptomic, proteomic, etc. datasets. These include large community collections of fly stocks and plasmid clones, “meta” information sites like FlyBase and FlyMine, and an increasing number of more specialized reagents, databases, and online tools. Here, we introduce key resources useful to plan large-scale functional genomics studies in Drosophila and to analyze, integrate, and mine the results of those studies in ways that facilitate identification of highest-confidence results and generation of new hypotheses. We also discuss ways in which existing resources can be used and might be improved and suggest a few areas of future development that would further support large- and small-scale studies in Drosophila and facilitate use of Drosophila information by the research community more generally. PMID:24653003
Factors Affecting Volunteering among Older Rural and City Dwelling Adults in Australia
ERIC Educational Resources Information Center
Warburton, Jeni; Stirling, Christine
2007-01-01
In the absence of large scale Australian studies of volunteering among older adults, this study compared the relevance of two theoretical approaches--social capital theory and sociostructural resources theory--to predict voluntary activity in relation to a large national database. The paper explores volunteering by older people (aged 55+) in order…
Su, Xiaoquan; Xu, Jian; Ning, Kang
2012-10-01
It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database. In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods. Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples. ningkang@qibebt.ac.cn Supplementary data are available at Bioinformatics online.
Image segmentation evaluation for very-large datasets
NASA Astrophysics Data System (ADS)
Reeves, Anthony P.; Liu, Shuang; Xie, Yiting
2016-03-01
With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.
Edgren, Gustaf; Hjalgrim, Henrik
2010-11-01
At current safety levels, with adverse events from transfusions being relatively rare, further progress in risk reductions will require large-scale investigations. Thus, truly prospective studies may prove unfeasible and other alternatives deserve consideration. In this review, we will try to give an overview of recent and historical developments in the use of blood donation and transfusion databases in research. In addition, we will go over important methodological issues. There are at least three nationwide or near-nationwide donation/transfusion databases with the possibility for long-term follow-up of donors and recipients. During the past few years, a large number of reports have been published utilizing such data sources to investigate transfusion-associated risks. In addition, numerous clinics systematically collect and use such data on a smaller scale. Combining systematically recorded donation and transfusion data with long-term health follow-up opens up exciting opportunities for transfusion medicine research. However, the correct analysis of such data requires close attention to methodological issues, especially including the indication for transfusion and reverse causality.
Remote visual analysis of large turbulence databases at multiple scales
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Remote visual analysis of large turbulence databases at multiple scales
Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...
2018-06-15
The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Daniell, Nathan; Fraysse, François; Paul, Gunther
2012-01-01
Anthropometry has long been used for a range of ergonomic applications & product design. Although products are often designed for specific cohorts, anthropometric data are typically sourced from large scale surveys representative of the general population. Additionally, few data are available for emerging markets like China and India. This study measured 80 Chinese males that were representative of a specific cohort targeted for the design of a new product. Thirteen anthropometric measurements were recorded and compared to two large databases that represented a general population, a Chinese database and a Western database. Substantial differences were identified between the Chinese males measured in this study and both databases. The subjects were substantially taller, heavier and broader than subjects in the older Chinese database. However, they were still substantially smaller, lighter and thinner than Western males. Data from current Western anthropometric surveys are unlikely to accurately represent the target population for product designers and manufacturers in emerging markets like China.
Architectural Implications for Spatial Object Association Algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumar, V S; Kurc, T; Saltz, J
2009-01-29
Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation providesmore » insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).« less
The future of medical diagnostics: large digitized databases.
Kerr, Wesley T; Lau, Edward P; Owens, Gwen E; Trefler, Aaron
2012-09-01
The electronic health record mandate within the American Recovery and Reinvestment Act of 2009 will have a far-reaching affect on medicine. In this article, we provide an in-depth analysis of how this mandate is expected to stimulate the production of large-scale, digitized databases of patient information. There is evidence to suggest that millions of patients and the National Institutes of Health will fully support the mining of such databases to better understand the process of diagnosing patients. This data mining likely will reaffirm and quantify known risk factors for many diagnoses. This quantification may be leveraged to further develop computer-aided diagnostic tools that weigh risk factors and provide decision support for health care providers. We expect that creation of these databases will stimulate the development of computer-aided diagnostic support tools that will become an integral part of modern medicine.
Scale-Up of GRCop: From Laboratory to Rocket Engines
NASA Technical Reports Server (NTRS)
Ellis, David L.
2016-01-01
GRCop is a high temperature, high thermal conductivity copper-based series of alloys designed primarily for use in regeneratively cooled rocket engine liners. It began with laboratory-level production of a few grams of ribbon produced by chill block melt spinning and has grown to commercial-scale production of large-scale rocket engine liners. Along the way, a variety of methods of consolidating and working the alloy were examined, a database of properties was developed and a variety of commercial and government applications were considered. This talk will briefly address the basic material properties used for selection of compositions to scale up, the methods used to go from simple ribbon to rocket engines, the need to develop a suitable database, and the issues related to getting the alloy into a rocket engine or other application.
Angermeier, Paul L.; Frimpong, Emmanuel A.
2009-01-01
The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. FishTraits is a database of >100 traits for 809 (731 native and 78 exotic) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database contains information on four major categories of traits: (1) trophic ecology, (2) body size and reproductive ecology (life history), (3) habitat associations, and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status is also included. Together, we refer to the traits, distribution, and conservation status information as attributes. Descriptions of attributes are available here. Many sources were consulted to compile attributes, including state and regional species accounts and other databases.
bpRNA: large-scale automated annotation and analysis of RNA secondary structure.
Danaee, Padideh; Rouches, Mason; Wiley, Michelle; Deng, Dezhong; Huang, Liang; Hendrix, David
2018-05-09
While RNA secondary structure prediction from sequence data has made remarkable progress, there is a need for improved strategies for annotating the features of RNA secondary structures. Here, we present bpRNA, a novel annotation tool capable of parsing RNA structures, including complex pseudoknot-containing RNAs, to yield an objective, precise, compact, unambiguous, easily-interpretable description of all loops, stems, and pseudoknots, along with the positions, sequence, and flanking base pairs of each such structural feature. We also introduce several new informative representations of RNA structure types to improve structure visualization and interpretation. We have further used bpRNA to generate a web-accessible meta-database, 'bpRNA-1m', of over 100 000 single-molecule, known secondary structures; this is both more fully and accurately annotated and over 20-times larger than existing databases. We use a subset of the database with highly similar (≥90% identical) sequences filtered out to report on statistical trends in sequence, flanking base pairs, and length. Both the bpRNA method and the bpRNA-1m database will be valuable resources both for specific analysis of individual RNA molecules and large-scale analyses such as are useful for updating RNA energy parameters for computational thermodynamic predictions, improving machine learning models for structure prediction, and for benchmarking structure-prediction algorithms.
Chloroplast 2010: A Database for Large-Scale Phenotypic Screening of Arabidopsis Mutants1[W][OA
Lu, Yan; Savage, Linda J.; Larson, Matthew D.; Wilkerson, Curtis G.; Last, Robert L.
2011-01-01
Large-scale phenotypic screening presents challenges and opportunities not encountered in typical forward or reverse genetics projects. We describe a modular database and laboratory information management system that was implemented in support of the Chloroplast 2010 Project, an Arabidopsis (Arabidopsis thaliana) reverse genetics phenotypic screen of more than 5,000 mutants (http://bioinfo.bch.msu.edu/2010_LIMS; www.plastid.msu.edu). The software and laboratory work environment were designed to minimize operator error and detect systematic process errors. The database uses Ruby on Rails and Flash technologies to present complex quantitative and qualitative data and pedigree information in a flexible user interface. Examples are presented where the database was used to find opportunities for process changes that improved data quality. We also describe the use of the data-analysis tools to discover mutants defective in enzymes of leucine catabolism (heteromeric mitochondrial 3-methylcrotonyl-coenzyme A carboxylase [At1g03090 and At4g34030] and putative hydroxymethylglutaryl-coenzyme A lyase [At2g26800]) based upon a syndrome of pleiotropic seed amino acid phenotypes that resembles previously described isovaleryl coenzyme A dehydrogenase (At3g45300) mutants. In vitro assay results support the computational annotation of At2g26800 as hydroxymethylglutaryl-coenzyme A lyase. PMID:21224340
Lessons Learned from Managing a Petabyte
DOE Office of Scientific and Technical Information (OSTI.GOV)
Becla, J
2005-01-20
The amount of data collected and stored by the average business doubles each year. Many commercial databases are already approaching hundreds of terabytes, and at this rate, will soon be managing petabytes. More data enables new functionality and capability, but the larger scale reveals new problems and issues hidden in ''smaller'' terascale environments. This paper presents some of these new problems along with implemented solutions in the framework of a petabyte dataset for a large High Energy Physics experiment. Through experience with two persistence technologies, a commercial database and a file-based approach, we expose format-independent concepts and issues prevalent atmore » this new scale of computing.« less
SureChEMBL: a large-scale, chemically annotated patent document database.
Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P
2016-01-04
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
NASA Technical Reports Server (NTRS)
Okong'o, Nora; Bellan, Josette
2005-01-01
Models for large eddy simulation (LES) are assessed on a database obtained from direct numerical simulations (DNS) of supercritical binary-species temporal mixing layers. The analysis is performed at the DNS transitional states for heptane/nitrogen, oxygen/hydrogen and oxygen/helium mixing layers. The incorporation of simplifying assumptions that are validated on the DNS database leads to a set of LES equations that requires only models for the subgrid scale (SGS) fluxes, which arise from filtering the convective terms in the DNS equations. Constant-coefficient versions of three different models for the SGS fluxes are assessed and calibrated. The Smagorinsky SGS-flux model shows poor correlations with the SGS fluxes, while the Gradient and Similarity models have high correlations, as well as good quantitative agreement with the SGS fluxes when the calibrated coefficients are used.
SureChEMBL: a large-scale, chemically annotated patent document database
Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.
2016-01-01
SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922
Data-Mining Techniques in Detecting Factors Linked to Academic Achievement
ERIC Educational Resources Information Center
Martínez Abad, Fernando; Chaparro Caso López, Alicia A.
2017-01-01
In light of the emergence of statistical analysis techniques based on data mining in education sciences, and the potential they offer to detect non-trivial information in large databases, this paper presents a procedure used to detect factors linked to academic achievement in large-scale assessments. The study is based on a non-experimental,…
CLAST: CUDA implemented large-scale alignment search tool.
Yano, Masahiro; Mori, Hiroshi; Akiyama, Yutaka; Yamada, Takuji; Kurokawa, Ken
2014-12-11
Metagenomics is a powerful methodology to study microbial communities, but it is highly dependent on nucleotide sequence similarity searching against sequence databases. Metagenomic analyses with next-generation sequencing technologies produce enormous numbers of reads from microbial communities, and many reads are derived from microbes whose genomes have not yet been sequenced, limiting the usefulness of existing sequence similarity search tools. Therefore, there is a clear need for a sequence similarity search tool that can rapidly detect weak similarity in large datasets. We developed a tool, which we named CLAST (CUDA implemented large-scale alignment search tool), that enables analyses of millions of reads and thousands of reference genome sequences, and runs on NVIDIA Fermi architecture graphics processing units. CLAST has four main advantages over existing alignment tools. First, CLAST was capable of identifying sequence similarities ~80.8 times faster than BLAST and 9.6 times faster than BLAT. Second, CLAST executes global alignment as the default (local alignment is also an option), enabling CLAST to assign reads to taxonomic and functional groups based on evolutionarily distant nucleotide sequences with high accuracy. Third, CLAST does not need a preprocessed sequence database like Burrows-Wheeler Transform-based tools, and this enables CLAST to incorporate large, frequently updated sequence databases. Fourth, CLAST requires <2 GB of main memory, making it possible to run CLAST on a standard desktop computer or server node. CLAST achieved very high speed (similar to the Burrows-Wheeler Transform-based Bowtie 2 for long reads) and sensitivity (equal to BLAST, BLAT, and FR-HIT) without the need for extensive database preprocessing or a specialized computing platform. Our results demonstrate that CLAST has the potential to be one of the most powerful and realistic approaches to analyze the massive amount of sequence data from next-generation sequencing technologies.
Mitchell, Joshua M.; Fan, Teresa W.-M.; Lane, Andrew N.; Moseley, Hunter N. B.
2014-01-01
Large-scale identification of metabolites is key to elucidating and modeling metabolism at the systems level. Advances in metabolomics technologies, particularly ultra-high resolution mass spectrometry (MS) enable comprehensive and rapid analysis of metabolites. However, a significant barrier to meaningful data interpretation is the identification of a wide range of metabolites including unknowns and the determination of their role(s) in various metabolic networks. Chemoselective (CS) probes to tag metabolite functional groups combined with high mass accuracy provide additional structural constraints for metabolite identification and quantification. We have developed a novel algorithm, Chemically Aware Substructure Search (CASS) that efficiently detects functional groups within existing metabolite databases, allowing for combined molecular formula and functional group (from CS tagging) queries to aid in metabolite identification without a priori knowledge. Analysis of the isomeric compounds in both Human Metabolome Database (HMDB) and KEGG Ligand demonstrated a high percentage of isomeric molecular formulae (43 and 28%, respectively), indicating the necessity for techniques such as CS-tagging. Furthermore, these two databases have only moderate overlap in molecular formulae. Thus, it is prudent to use multiple databases in metabolite assignment, since each major metabolite database represents different portions of metabolism within the biosphere. In silico analysis of various CS-tagging strategies under different conditions for adduct formation demonstrate that combined FT-MS derived molecular formulae and CS-tagging can uniquely identify up to 71% of KEGG and 37% of the combined KEGG/HMDB database vs. 41 and 17%, respectively without adduct formation. This difference between database isomer disambiguation highlights the strength of CS-tagging for non-lipid metabolite identification. However, unique identification of complex lipids still needs additional information. PMID:25120557
Large-scale Health Information Database and Privacy Protection.
Yamamoto, Ryuichi
2016-09-01
Japan was once progressive in the digitalization of healthcare fields but unfortunately has fallen behind in terms of the secondary use of data for public interest. There has recently been a trend to establish large-scale health databases in the nation, and a conflict between data use for public interest and privacy protection has surfaced as this trend has progressed. Databases for health insurance claims or for specific health checkups and guidance services were created according to the law that aims to ensure healthcare for the elderly; however, there is no mention in the act about using these databases for public interest in general. Thus, an initiative for such use must proceed carefully and attentively. The PMDA projects that collect a large amount of medical record information from large hospitals and the health database development project that the Ministry of Health, Labour and Welfare (MHLW) is working on will soon begin to operate according to a general consensus; however, the validity of this consensus can be questioned if issues of anonymity arise. The likelihood that researchers conducting a study for public interest would intentionally invade the privacy of their subjects is slim. However, patients could develop a sense of distrust about their data being used since legal requirements are ambiguous. Nevertheless, without using patients' medical records for public interest, progress in medicine will grind to a halt. Proper legislation that is clear for both researchers and patients will therefore be highly desirable. A revision of the Act on the Protection of Personal Information is currently in progress. In reality, however, privacy is not something that laws alone can protect; it will also require guidelines and self-discipline. We now live in an information capitalization age. I will introduce the trends in legal reform regarding healthcare information and discuss some basics to help people properly face the issue of health big data and privacy protection with a sense of ownership.
Large-scale Health Information Database and Privacy Protection*1
YAMAMOTO, Ryuichi
2016-01-01
Japan was once progressive in the digitalization of healthcare fields but unfortunately has fallen behind in terms of the secondary use of data for public interest. There has recently been a trend to establish large-scale health databases in the nation, and a conflict between data use for public interest and privacy protection has surfaced as this trend has progressed. Databases for health insurance claims or for specific health checkups and guidance services were created according to the law that aims to ensure healthcare for the elderly; however, there is no mention in the act about using these databases for public interest in general. Thus, an initiative for such use must proceed carefully and attentively. The PMDA*2 projects that collect a large amount of medical record information from large hospitals and the health database development project that the Ministry of Health, Labour and Welfare (MHLW) is working on will soon begin to operate according to a general consensus; however, the validity of this consensus can be questioned if issues of anonymity arise. The likelihood that researchers conducting a study for public interest would intentionally invade the privacy of their subjects is slim. However, patients could develop a sense of distrust about their data being used since legal requirements are ambiguous. Nevertheless, without using patients’ medical records for public interest, progress in medicine will grind to a halt. Proper legislation that is clear for both researchers and patients will therefore be highly desirable. A revision of the Act on the Protection of Personal Information is currently in progress. In reality, however, privacy is not something that laws alone can protect; it will also require guidelines and self-discipline. We now live in an information capitalization age. I will introduce the trends in legal reform regarding healthcare information and discuss some basics to help people properly face the issue of health big data and privacy protection with a sense of ownership. PMID:28299244
NASA Technical Reports Server (NTRS)
Strom, Stephen; Sargent, Wallace L. W.; Wolff, Sidney; Ahearn, Michael F.; Angel, J. Roger; Beckwith, Steven V. W.; Carney, Bruce W.; Conti, Peter S.; Edwards, Suzan; Grasdalen, Gary
1991-01-01
Optical/infrared (O/IR) astronomy in the 1990's is reviewed. The following subject areas are included: research environment; science opportunities; technical development of the 1980's and opportunities for the 1990's; and ground-based O/IR astronomy outside the U.S. Recommendations are presented for: (1) large scale programs (Priority 1: a coordinated program for large O/IR telescopes); (2) medium scale programs (Priority 1: a coordinated program for high angular resolution; Priority 2: a new generation of 4-m class telescopes); (3) small scale programs (Priority 1: near-IR and optical all-sky surveys; Priority 2: a National Astrometric Facility); and (4) infrastructure issues (develop, purchase, and distribute optical CCDs and infrared arrays; a program to support large optics technology; a new generation of large filled aperture telescopes; a program to archive and disseminate astronomical databases; and a program for training new instrumentalists)
Cloud-Based Distributed Control of Unmanned Systems
2015-04-01
during mission execution. At best, the data is saved onto hard-drives and is accessible only by the local team. Data history in a form available and...following open source technologies: GeoServer, OpenLayers, PostgreSQL , and PostGIS are chosen to implement the back-end database and server. A brief...geospatial map data. 3. PostgreSQL : An SQL-compliant object-relational database that easily scales to accommodate large amounts of data - upwards to
Zhang, Yaoyang; Xu, Tao; Shan, Bing; Hart, Jonathan; Aslanian, Aaron; Han, Xuemei; Zong, Nobel; Li, Haomin; Choi, Howard; Wang, Dong; Acharya, Lipi; Du, Lisa; Vogt, Peter K; Ping, Peipei; Yates, John R
2015-11-03
Shotgun proteomics generates valuable information from large-scale and target protein characterizations, including protein expression, protein quantification, protein post-translational modifications (PTMs), protein localization, and protein-protein interactions. Typically, peptides derived from proteolytic digestion, rather than intact proteins, are analyzed by mass spectrometers because peptides are more readily separated, ionized and fragmented. The amino acid sequences of peptides can be interpreted by matching the observed tandem mass spectra to theoretical spectra derived from a protein sequence database. Identified peptides serve as surrogates for their proteins and are often used to establish what proteins were present in the original mixture and to quantify protein abundance. Two major issues exist for assigning peptides to their originating protein. The first issue is maintaining a desired false discovery rate (FDR) when comparing or combining multiple large datasets generated by shotgun analysis and the second issue is properly assigning peptides to proteins when homologous proteins are present in the database. Herein we demonstrate a new computational tool, ProteinInferencer, which can be used for protein inference with both small- or large-scale data sets to produce a well-controlled protein FDR. In addition, ProteinInferencer introduces confidence scoring for individual proteins, which makes protein identifications evaluable. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015. Published by Elsevier B.V.
Tropical Cyclone Information System
NASA Technical Reports Server (NTRS)
Li, P. Peggy; Knosp, Brian W.; Vu, Quoc A.; Yi, Chao; Hristova-Veleva, Svetla M.
2009-01-01
The JPL Tropical Cyclone Infor ma tion System (TCIS) is a Web portal (http://tropicalcyclone.jpl.nasa.gov) that provides researchers with an extensive set of observed hurricane parameters together with large-scale and convection resolving model outputs. It provides a comprehensive set of high-resolution satellite (see figure), airborne, and in-situ observations in both image and data formats. Large-scale datasets depict the surrounding environmental parameters such as SST (Sea Surface Temperature) and aerosol loading. Model outputs and analysis tools are provided to evaluate model performance and compare observations from different platforms. The system pertains to the thermodynamic and microphysical structure of the storm, the air-sea interaction processes, and the larger-scale environment as depicted by ocean heat content and the aerosol loading of the environment. Currently, the TCIS is populated with satellite observations of all tropical cyclones observed globally during 2005. There is a plan to extend the database both forward in time till present as well as backward to 1998. The portal is powered by a MySQL database and an Apache/Tomcat Web server on a Linux system. The interactive graphic user interface is provided by Google Map.
Large-scale mapping of hard-rock aquifer properties applied to Burkina Faso.
Courtois, Nathalie; Lachassagne, Patrick; Wyns, Robert; Blanchin, Raymonde; Bougaïré, Francis D; Somé, Sylvain; Tapsoba, Aïssata
2010-01-01
A country-scale (1:1,000,000) methodology has been developed for hydrogeologic mapping of hard-rock aquifers (granitic and metamorphic rocks) of the type that underlie a large part of the African continent. The method is based on quantifying the "useful thickness" and hydrodynamic properties of such aquifers and uses a recent conceptual model developed for this hydrogeologic context. This model links hydrodynamic parameters (transmissivity, storativity) to lithology and the geometry of the various layers constituting a weathering profile. The country-scale hydrogeological mapping was implemented in Burkina Faso, where a recent 1:1,000,000-scale digital geological map and a database of some 16,000 water wells were used to evaluate the methodology.
In-Memory Graph Databases for Web-Scale Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Morari, Alessandro; Weaver, Jesse R.
RDF databases have emerged as one of the most relevant way for organizing, integrating, and managing expo- nentially growing, often heterogeneous, and not rigidly structured data for a variety of scientific and commercial fields. In this paper we discuss the solutions integrated in GEMS (Graph database Engine for Multithreaded Systems), a software framework for implementing RDF databases on commodity, distributed-memory high-performance clusters. Unlike the majority of current RDF databases, GEMS has been designed from the ground up to primarily employ graph-based methods. This is reflected in all the layers of its stack. The GEMS framework is composed of: a SPARQL-to-C++more » compiler, a library of data structures and related methods to access and modify them, and a custom runtime providing lightweight software multithreading, network messages aggregation and a partitioned global address space. We provide an overview of the framework, detailing its component and how they have been closely designed and customized to address issues of graph methods applied to large-scale datasets on clusters. We discuss in details the principles that enable automatic translation of the queries (expressed in SPARQL, the query language of choice for RDF databases) to graph methods, and identify differences with respect to other RDF databases.« less
Molecular signatures database (MSigDB) 3.0.
Liberzon, Arthur; Subramanian, Aravind; Pinchback, Reid; Thorvaldsdóttir, Helga; Tamayo, Pablo; Mesirov, Jill P
2011-06-15
Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets. We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site. MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.
Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip
2016-01-01
Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.
Pharmacogenomic agreement between two cancer cell line data sets.
2015-12-03
Large cancer cell line collections broadly capture the genomic diversity of human cancers and provide valuable insight into anti-cancer drug response. Here we show substantial agreement and biological consilience between drug sensitivity measurements and their associated genomic predictors from two publicly available large-scale pharmacogenomics resources: The Cancer Cell Line Encyclopedia and the Genomics of Drug Sensitivity in Cancer databases.
Mining large heterogeneous data sets in drug discovery.
Wild, David J
2009-10-01
Increasingly, effective drug discovery involves the searching and data mining of large volumes of information from many sources covering the domains of chemistry, biology and pharmacology amongst others. This has led to a proliferation of databases and data sources relevant to drug discovery. This paper provides a review of the publicly-available large-scale databases relevant to drug discovery, describes the kinds of data mining approaches that can be applied to them and discusses recent work in integrative data mining that looks for associations that pan multiple sources, including the use of Semantic Web techniques. The future of mining large data sets for drug discovery requires intelligent, semantic aggregation of information from all of the data sources described in this review, along with the application of advanced methods such as intelligent agents and inference engines in client applications.
Transformation of social networks in the late pre-Hispanic US Southwest.
Mills, Barbara J; Clark, Jeffery J; Peeples, Matthew A; Haas, W R; Roberts, John M; Hill, J Brett; Huntley, Deborah L; Borck, Lewis; Breiger, Ronald L; Clauset, Aaron; Shackley, M Steven
2013-04-09
The late pre-Hispanic period in the US Southwest (A.D. 1200-1450) was characterized by large-scale demographic changes, including long-distance migration and population aggregation. To reconstruct how these processes reshaped social networks, we compiled a comprehensive artifact database from major sites dating to this interval in the western Southwest. We combine social network analysis with geographic information systems approaches to reconstruct network dynamics over 250 y. We show how social networks were transformed across the region at previously undocumented spatial, temporal, and social scales. Using well-dated decorated ceramics, we track changes in network topology at 50-y intervals to show a dramatic shift in network density and settlement centrality from the northern to the southern Southwest after A.D. 1300. Both obsidian sourcing and ceramic data demonstrate that long-distance network relationships also shifted from north to south after migration. Surprisingly, social distance does not always correlate with spatial distance because of the presence of network relationships spanning long geographic distances. Our research shows how a large network in the southern Southwest grew and then collapsed, whereas networks became more fragmented in the northern Southwest but persisted. The study also illustrates how formal social network analysis may be applied to large-scale databases of material culture to illustrate multigenerational changes in network structure.
Transformation of social networks in the late pre-Hispanic US Southwest
Mills, Barbara J.; Clark, Jeffery J.; Peeples, Matthew A.; Haas, W. R.; Roberts, John M.; Hill, J. Brett; Huntley, Deborah L.; Borck, Lewis; Breiger, Ronald L.; Clauset, Aaron; Shackley, M. Steven
2013-01-01
The late pre-Hispanic period in the US Southwest (A.D. 1200–1450) was characterized by large-scale demographic changes, including long-distance migration and population aggregation. To reconstruct how these processes reshaped social networks, we compiled a comprehensive artifact database from major sites dating to this interval in the western Southwest. We combine social network analysis with geographic information systems approaches to reconstruct network dynamics over 250 y. We show how social networks were transformed across the region at previously undocumented spatial, temporal, and social scales. Using well-dated decorated ceramics, we track changes in network topology at 50-y intervals to show a dramatic shift in network density and settlement centrality from the northern to the southern Southwest after A.D. 1300. Both obsidian sourcing and ceramic data demonstrate that long-distance network relationships also shifted from north to south after migration. Surprisingly, social distance does not always correlate with spatial distance because of the presence of network relationships spanning long geographic distances. Our research shows how a large network in the southern Southwest grew and then collapsed, whereas networks became more fragmented in the northern Southwest but persisted. The study also illustrates how formal social network analysis may be applied to large-scale databases of material culture to illustrate multigenerational changes in network structure. PMID:23530201
Rice, Michael; Gladstone, William; Weir, Michael
2004-01-01
We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills.
2004-01-01
We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills. PMID:15592597
The impact of large-scale, long-term optical surveys on pulsating star research
NASA Astrophysics Data System (ADS)
Soszyński, Igor
2017-09-01
The era of large-scale photometric variability surveys began a quarter of a century ago, when three microlensing projects - EROS, MACHO, and OGLE - started their operation. These surveys initiated a revolution in the field of variable stars and in the next years they inspired many new observational projects. Large-scale optical surveys multiplied the number of variable stars known in the Universe. The huge, homogeneous and complete catalogs of pulsating stars, such as Cepheids, RR Lyrae stars, or long-period variables, offer an unprecedented opportunity to calibrate and test the accuracy of various distance indicators, to trace the three-dimensional structure of the Milky Way and other galaxies, to discover exotic types of intrinsically variable stars, or to study previously unknown features and behaviors of pulsators. We present historical and recent findings on various types of pulsating stars obtained from the optical large-scale surveys, with particular emphasis on the OGLE project which currently offers the largest photometric database among surveys for stellar variability.
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework
2012-01-01
Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.
Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John
2012-12-05
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
Angermeier, Paul L.; Frimpong, Emmanuel A.
2011-01-01
The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. We have compiled a database of > 100 traits for 809 (731 native and 78 nonnative) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database, named Fish Traits, contains information on four major categories of traits: (1) trophic ecology; (2) body size, reproductive ecology, and life history; (3) habitat preferences; and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status was also compiled. The database enhances many opportunities for conducting research on fish species traits and constitutes the first step toward establishing a central repository for a continually expanding set of traits of North American fishes.
NVST Data Archiving System Based On FastBit NoSQL Database
NASA Astrophysics Data System (ADS)
Liu, Ying-bo; Wang, Feng; Ji, Kai-fan; Deng, Hui; Dai, Wei; Liang, Bo
2014-06-01
The New Vacuum Solar Telescope (NVST) is a 1-meter vacuum solar telescope that aims to observe the fine structures of active regions on the Sun. The main tasks of the NVST are high resolution imaging and spectral observations, including the measurements of the solar magnetic field. The NVST has been collecting more than 20 million FITS files since it began routine observations in 2012 and produces a maximum observational records of 120 thousand files in a day. Given the large amount of files, the effective archiving and retrieval of files becomes a critical and urgent problem. In this study, we implement a new data archiving system for the NVST based on the Fastbit Not Only Structured Query Language (NoSQL) database. Comparing to the relational database (i.e., MySQL; My Structured Query Language), the Fastbit database manifests distinctive advantages on indexing and querying performance. In a large scale database of 40 million records, the multi-field combined query response time of Fastbit database is about 15 times faster and fully meets the requirements of the NVST. Our study brings a new idea for massive astronomical data archiving and would contribute to the design of data management systems for other astronomical telescopes.
Menditto, Enrica; Bolufer De Gea, Angela; Cahir, Caitriona; Marengoni, Alessandra; Riegler, Salvatore; Fico, Giuseppe; Costa, Elisio; Monaco, Alessandro; Pecorelli, Sergio; Pani, Luca; Prados-Torres, Alexandra
2016-01-01
Computerized health care databases have been widely described as an excellent opportunity for research. The availability of "big data" has brought about a wave of innovation in projects when conducting health services research. Most of the available secondary data sources are restricted to the geographical scope of a given country and present heterogeneous structure and content. Under the umbrella of the European Innovation Partnership on Active and Healthy Ageing, collaborative work conducted by the partners of the group on "adherence to prescription and medical plans" identified the use of observational and large-population databases to monitor medication-taking behavior in the elderly. This article describes the methodology used to gather the information from available databases among the Adherence Action Group partners with the aim of improving data sharing on a European level. A total of six databases belonging to three different European countries (Spain, Republic of Ireland, and Italy) were included in the analysis. Preliminary results suggest that there are some similarities. However, these results should be applied in different contexts and European countries, supporting the idea that large European studies should be designed in order to get the most of already available databases.
Schmedes, Sarah E; King, Jonathan L; Budowle, Bruce
2015-01-01
Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.
Arntzen, Magnus Ø; Thiede, Bernd
2012-02-01
Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no.
Arntzen, Magnus Ø.; Thiede, Bernd
2012-01-01
Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no. PMID:22067098
A global, open-source database of flood protection standards
NASA Astrophysics Data System (ADS)
Scussolini, Paolo; Aerts, Jeroen; Jongman, Brenden; Bouwer, Laurens; Winsemius, Hessel; de Moel, Hans; Ward, Philip
2016-04-01
Accurate flood risk estimation is pivotal in that it enables risk-informed policies in disaster risk reduction, as emphasized in the recent Sendai framework for Disaster Risk Reduction. To improve our understanding of flood risk, models are now capable to provide actionable risk information on the (sub)global scale. Still the accuracy of their results is greatly limited by the lack of information on standards of protection to flood that are actually in place; and researchers thus take large assumptions on the extent of protection. With our work we propose a first global, open-source database of FLOod PROtection Standards, FLOPROS, covering a range of spatial scales. FLOPROS is structured in three layers of information, and merges them into one consistent database: 1) the Design layer contains empirical information about the standard of protection presently in place; 2) the Policy layer contains intended protection standards from normative documents; 3) the Model layer uses a validated numerical approach to calculate protection standards for areas not covered in the other layers. The FLOPROS database can be used for more accurate risk assessment exercises across scales. As the database should be continually updated to reflect new interventions, we invite researchers and practitioners to contribute information. Further, we look for partners within the risk community to participate in additional strategies to implement the amount and accuracy of information contained in this first version of FLOPROS.
Pemberton, T J; Jakobsson, M; Conrad, D F; Coop, G; Wall, J D; Pritchard, J K; Patel, P I; Rosenberg, N A
2008-07-01
When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis - such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.
Towards a New Assessment of Urban Areas from Local to Global Scales
NASA Astrophysics Data System (ADS)
Bhaduri, B. L.; Roy Chowdhury, P. K.; McKee, J.; Weaver, J.; Bright, E.; Weber, E.
2015-12-01
Since early 2000s, starting with NASA MODIS, satellite based remote sensing has facilitated collection of imagery with medium spatial resolution but high temporal resolution (daily). This trend continues with an increasing number of sensors and data products. Increasing spatial and temporal resolutions of remotely sensed data archives, from both public and commercial sources, have significantly enhanced the quality of mapping and change data products. However, even with automation of such analysis on evolving computing platforms, rates of data processing have been suboptimal largely because of the ever-increasing pixel to processor ratio coupled with limitations of the computing architectures. Novel approaches utilizing spatiotemporal data mining techniques and computational architectures have emerged that demonstrates the potential for sustained and geographically scalable landscape monitoring to be operational. We exemplify this challenge with two broad research initiatives on High Performance Geocomputation at Oak Ridge National Laboratory: (a) mapping global settlement distribution; (b) developing national critical infrastructure databases. Our present effort, on large GPU based architectures, to exploit high resolution (1m or less) satellite and airborne imagery for extracting settlements at global scale is yielding understanding of human settlement patterns and urban areas at unprecedented resolution. Comparison of such urban land cover database, with existing national and global land cover products, at various geographic scales in selected parts of the world is revealing intriguing patterns and insights for urban assessment. Early results, from the USA, Taiwan, and Egypt, indicate closer agreements (5-10%) in urban area assessments among databases at larger, aggregated geographic extents. However, spatial variability at local scales could be significantly different (over 50% disagreement).
Suchard, Marc A; Zorych, Ivan; Simpson, Shawn E; Schuemie, Martijn J; Ryan, Patrick B; Madigan, David
2013-10-01
The self-controlled case series (SCCS) offers potential as an statistical method for risk identification involving medical products from large-scale observational healthcare data. However, analytic design choices remain in encoding the longitudinal health records into the SCCS framework and its risk identification performance across real-world databases is unknown. To evaluate the performance of SCCS and its design choices as a tool for risk identification in observational healthcare data. We examined the risk identification performance of SCCS across five design choices using 399 drug-health outcome pairs in five real observational databases (four administrative claims and one electronic health records). In these databases, the pairs involve 165 positive controls and 234 negative controls. We also consider several synthetic databases with known relative risks between drug-outcome pairs. We evaluate risk identification performance through estimating the area under the receiver-operator characteristics curve (AUC) and bias and coverage probability in the synthetic examples. The SCCS achieves strong predictive performance. Twelve of the twenty health outcome-database scenarios return AUCs >0.75 across all drugs. Including all adverse events instead of just the first per patient and applying a multivariate adjustment for concomitant drug use are the most important design choices. However, the SCCS as applied here returns relative risk point-estimates biased towards the null value of 1 with low coverage probability. The SCCS recently extended to apply a multivariate adjustment for concomitant drug use offers promise as a statistical tool for risk identification in large-scale observational healthcare databases. Poor estimator calibration dampens enthusiasm, but on-going work should correct this short-coming.
GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.
Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de
2006-03-31
Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.
DBMap: a TreeMap-based framework for data navigation and visualization of brain research registry
NASA Astrophysics Data System (ADS)
Zhang, Ming; Zhang, Hong; Tjandra, Donny; Wong, Stephen T. C.
2003-05-01
The purpose of this study is to investigate and apply a new, intuitive and space-conscious visualization framework to facilitate efficient data presentation and exploration of large-scale data warehouses. We have implemented the DBMap framework for the UCSF Brain Research Registry. Such a novel utility would facilitate medical specialists and clinical researchers in better exploring and evaluating a number of attributes organized in the brain research registry. The current UCSF Brain Research Registry consists of a federation of disease-oriented database modules, including Epilepsy, Brain Tumor, Intracerebral Hemorrphage, and CJD (Creuzfeld-Jacob disease). These database modules organize large volumes of imaging and non-imaging data to support Web-based clinical research. While the data warehouse supports general information retrieval and analysis, there lacks an effective way to visualize and present the voluminous and complex data stored. This study investigates whether the TreeMap algorithm can be adapted to display and navigate categorical biomedical data warehouse or registry. TreeMap is a space constrained graphical representation of large hierarchical data sets, mapped to a matrix of rectangles, whose size and color represent interested database fields. It allows the display of a large amount of numerical and categorical information in limited real estate of computer screen with an intuitive user interface. The paper will describe, DBMap, the proposed new data visualization framework for large biomedical databases. Built upon XML, Java and JDBC technologies, the prototype system includes a set of software modules that reside in the application server tier and provide interface to backend database tier and front-end Web tier of the brain registry.
GIS applications for military operations in coastal zones
Fleming, S.; Jordan, T.; Madden, M.; Usery, E.L.; Welch, R.
2009-01-01
In order to successfully support current and future US military operations in coastal zones, geospatial information must be rapidly integrated and analyzed to meet ongoing force structure evolution and new mission directives. Coastal zones in a military-operational environment are complex regions that include sea, land and air features that demand high-volume databases of extreme detail within relatively narrow geographic corridors. Static products in the form of analog maps at varying scales traditionally have been used by military commanders and their operational planners. The rapidly changing battlefield of 21st Century warfare, however, demands dynamic mapping solutions. Commercial geographic information system (GIS) software for military-specific applications is now being developed and employed with digital databases to provide customized digital maps of variable scale, content and symbolization tailored to unique demands of military units. Research conducted by the Center for Remote Sensing and Mapping Science at the University of Georgia demonstrated the utility of GIS-based analysis and digital map creation when developing large-scale (1:10,000) products from littoral warfare databases. The methodology employed-selection of data sources (including high resolution commercial images and Lidar), establishment of analysis/modeling parameters, conduct of vehicle mobility analysis, development of models and generation of products (such as a continuous sea-land DEM and geo-visualization of changing shorelines with tidal levels)-is discussed. Based on observations and identified needs from the National Geospatial-Intelligence Agency, formerly the National Imagery and Mapping Agency, and the Department of Defense, prototype GIS models for military operations in sea, land and air environments were created from multiple data sets of a study area at US Marine Corps Base Camp Lejeune, North Carolina. Results of these models, along with methodologies for developing large-scale littoral warfare databases, aid the National Geospatial-Intelligence Agency in meeting littoral warfare analysis, modeling and map generation requirements for US military organizations. ?? 2008 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
GIS applications for military operations in coastal zones
NASA Astrophysics Data System (ADS)
Fleming, S.; Jordan, T.; Madden, M.; Usery, E. L.; Welch, R.
In order to successfully support current and future US military operations in coastal zones, geospatial information must be rapidly integrated and analyzed to meet ongoing force structure evolution and new mission directives. Coastal zones in a military-operational environment are complex regions that include sea, land and air features that demand high-volume databases of extreme detail within relatively narrow geographic corridors. Static products in the form of analog maps at varying scales traditionally have been used by military commanders and their operational planners. The rapidly changing battlefield of 21st Century warfare, however, demands dynamic mapping solutions. Commercial geographic information system (GIS) software for military-specific applications is now being developed and employed with digital databases to provide customized digital maps of variable scale, content and symbolization tailored to unique demands of military units. Research conducted by the Center for Remote Sensing and Mapping Science at the University of Georgia demonstrated the utility of GIS-based analysis and digital map creation when developing large-scale (1:10,000) products from littoral warfare databases. The methodology employed-selection of data sources (including high resolution commercial images and Lidar), establishment of analysis/modeling parameters, conduct of vehicle mobility analysis, development of models and generation of products (such as a continuous sea-land DEM and geo-visualization of changing shorelines with tidal levels)-is discussed. Based on observations and identified needs from the National Geospatial-Intelligence Agency, formerly the National Imagery and Mapping Agency, and the Department of Defense, prototype GIS models for military operations in sea, land and air environments were created from multiple data sets of a study area at US Marine Corps Base Camp Lejeune, North Carolina. Results of these models, along with methodologies for developing large-scale littoral warfare databases, aid the National Geospatial-Intelligence Agency in meeting littoral warfare analysis, modeling and map generation requirements for US military organizations.
Coates, Jennifer C; Colaiezzi, Brooke A; Bell, Winnie; Charrondiere, U Ruth; Leclercq, Catherine
2017-03-16
An increasing number of low-income countries (LICs) exhibit high rates of malnutrition coincident with rising rates of overweight and obesity. Individual-level dietary data are needed to inform effective responses, yet dietary data from large-scale surveys conducted in LICs remain extremely limited. This discussion paper first seeks to highlight the barriers to collection and use of individual-level dietary data in LICs. Second, it introduces readers to new technological developments and research initiatives to remedy this situation, led by the International Dietary Data Expansion (INDDEX) Project. Constraints to conducting large-scale dietary assessments include significant costs, time burden, technical complexity, and limited investment in dietary research infrastructure, including the necessary tools and databases required to collect individual-level dietary data in large surveys. To address existing bottlenecks, the INDDEX Project is developing a dietary assessment platform for LICs, called INDDEX24, consisting of a mobile application integrated with a web database application, which is expected to facilitate seamless data collection and processing. These tools will be subject to rigorous testing including feasibility, validation, and cost studies. To scale up dietary data collection and use in LICs, the INDDEX Project will also invest in food composition databases, an individual-level dietary data dissemination platform, and capacity development activities. Although the INDDEX Project activities are expected to improve the ability of researchers and policymakers in low-income countries to collect, process, and use dietary data, the global nutrition community is urged to commit further significant investments in order to adequately address the range and scope of challenges described in this paper.
Vullo, Carlos M; Romero, Magdalena; Catelli, Laura; Šakić, Mustafa; Saragoni, Victor G; Jimenez Pleguezuelos, María Jose; Romanini, Carola; Anjos Porto, Maria João; Puente Prieto, Jorge; Bofarull Castro, Alicia; Hernandez, Alexis; Farfán, María José; Prieto, Victoria; Alvarez, David; Penacino, Gustavo; Zabalza, Santiago; Hernández Bolaños, Alejandro; Miguel Manterola, Irati; Prieto, Lourdes; Parsons, Thomas
2016-03-01
The GHEP-ISFG Working Group has recognized the importance of assisting DNA laboratories to gain expertise in handling DVI or missing persons identification (MPI) projects which involve the need for large-scale genetic profile comparisons. Eleven laboratories participated in a DNA matching exercise to identify victims from a hypothetical conflict with 193 missing persons. The post mortem database was comprised of 87 skeletal remain profiles from a secondary mass grave displaying a minimal number of 58 individuals with evidence of commingling. The reference database was represented by 286 family reference profiles with diverse pedigrees. The goal of the exercise was to correctly discover re-associations and family matches. The results of direct matching for commingled remains re-associations were correct and fully concordant among all laboratories. However, the kinship analysis for missing persons identifications showed variable results among the participants. There was a group of laboratories with correct, concordant results but nearly half of the others showed discrepant results exhibiting likelihood ratio differences of several degrees of magnitude in some cases. Three main errors were detected: (a) some laboratories did not use the complete reference family genetic data to report the match with the remains, (b) the identity and/or non-identity hypotheses were sometimes wrongly expressed in the likelihood ratio calculations, and (c) many laboratories did not properly evaluate the prior odds for the event. The results suggest that large-scale profile comparisons for DVI or MPI is a challenge for forensic genetics laboratories and the statistical treatment of DNA matching and the Bayesian framework should be better standardized among laboratories. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Coates, Jennifer C.; Colaiezzi, Brooke A.; Bell, Winnie; Charrondiere, U. Ruth; Leclercq, Catherine
2017-01-01
An increasing number of low-income countries (LICs) exhibit high rates of malnutrition coincident with rising rates of overweight and obesity. Individual-level dietary data are needed to inform effective responses, yet dietary data from large-scale surveys conducted in LICs remain extremely limited. This discussion paper first seeks to highlight the barriers to collection and use of individual-level dietary data in LICs. Second, it introduces readers to new technological developments and research initiatives to remedy this situation, led by the International Dietary Data Expansion (INDDEX) Project. Constraints to conducting large-scale dietary assessments include significant costs, time burden, technical complexity, and limited investment in dietary research infrastructure, including the necessary tools and databases required to collect individual-level dietary data in large surveys. To address existing bottlenecks, the INDDEX Project is developing a dietary assessment platform for LICs, called INDDEX24, consisting of a mobile application integrated with a web database application, which is expected to facilitate seamless data collection and processing. These tools will be subject to rigorous testing including feasibility, validation, and cost studies. To scale up dietary data collection and use in LICs, the INDDEX Project will also invest in food composition databases, an individual-level dietary data dissemination platform, and capacity development activities. Although the INDDEX Project activities are expected to improve the ability of researchers and policymakers in low-income countries to collect, process, and use dietary data, the global nutrition community is urged to commit further significant investments in order to adequately address the range and scope of challenges described in this paper. PMID:28300759
MetReS, an Efficient Database for Genomic Applications.
Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc
2018-02-01
MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.
Database recovery using redundant disk arrays
NASA Technical Reports Server (NTRS)
Mourad, Antoine N.; Fuchs, W. K.; Saab, Daniel G.
1992-01-01
Redundant disk arrays provide a way for achieving rapid recovery from media failures with a relatively low storage cost for large scale database systems requiring high availability. In this paper a method is proposed for using redundant disk arrays to support rapid-recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, it is shown that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
Recovery issues in databases using redundant disk arrays
NASA Technical Reports Server (NTRS)
Mourad, Antoine N.; Fuchs, W. K.; Saab, Daniel G.
1993-01-01
Redundant disk arrays provide a way for achieving rapid recovery from media failures with a relatively low storage cost for large scale database systems requiring high availability. In this paper we propose a method for using redundant disk arrays to support rapid recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, we show that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
Cormode, Graham; Dasgupta, Anirban; Goyal, Amit; Lee, Chi Hoon
2018-01-01
Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users' queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with "vanilla" LSH, even when using the same amount of space.
Wilshire, Howard G.; Bedford, David R.; Coleman, Teresa
2002-01-01
3. Plottable map representations of the database at 1:24,000 scale in PostScript and Adobe PDF formats. The plottable files consist of a color geologic map derived from the spatial database, composited with a topographic base map in the form of the USGS Digital Raster Graphic for the map area. Color symbology from each of these datasets is maintained, which can cause plot file sizes to be large.
A Two-Layer Least Squares Support Vector Machine Approach to Credit Risk Assessment
NASA Astrophysics Data System (ADS)
Liu, Jingli; Li, Jianping; Xu, Weixuan; Shi, Yong
Least squares support vector machine (LS-SVM) is a revised version of support vector machine (SVM) and has been proved to be a useful tool for pattern recognition. LS-SVM had excellent generalization performance and low computational cost. In this paper, we propose a new method called two-layer least squares support vector machine which combines kernel principle component analysis (KPCA) and linear programming form of least square support vector machine. With this method sparseness and robustness is obtained while solving large dimensional and large scale database. A U.S. commercial credit card database is used to test the efficiency of our method and the result proved to be a satisfactory one.
Sorokin, Anatoly; Selkov, Gene; Goryanin, Igor
2012-07-16
The volume of the experimentally measured time series data is rapidly growing, while storage solutions offering better data types than simple arrays of numbers or opaque blobs for keeping series data are sorely lacking. A number of indexing methods have been proposed to provide efficient access to time series data, but none has so far been integrated into a tried-and-proven database system. To explore the possibility of such integration, we have developed a data type for time series storage in PostgreSQL, an object-relational database system, and equipped it with an access method based on SAX (Symbolic Aggregate approXimation). This new data type has been successfully tested in a database supporting a large-scale plant gene expression experiment, and it was additionally tested on a very large set of simulated time series data. Copyright © 2011 Elsevier B.V. All rights reserved.
Massive parallelization of serial inference algorithms for a complex generalized linear model
Suchard, Marc A.; Simpson, Shawn E.; Zorych, Ivan; Ryan, Patrick; Madigan, David
2014-01-01
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety. PMID:25328363
PREPping Students for Authentic Science
ERIC Educational Resources Information Center
Dolan, Erin L.; Lally, David J.; Brooks, Eric; Tax, Frans E.
2008-01-01
In this article, the authors describe a large-scale research collaboration, the Partnership for Research and Education in Plants (PREP), which has capitalized on publicly available databases that contain massive amounts of biological information; stock centers that house and distribute inexpensive organisms with different genotypes; and the…
Xu, Weijia; Ozer, Stuart; Gutell, Robin R
2009-01-01
With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.
Xu, Weijia; Ozer, Stuart; Gutell, Robin R.
2010-01-01
With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure. PMID:20502534
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets
NASA Astrophysics Data System (ADS)
Juric, Mario
2011-01-01
The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.
Unified Access Architecture for Large-Scale Scientific Datasets
NASA Astrophysics Data System (ADS)
Karna, Risav
2014-05-01
Data-intensive sciences have to deploy diverse large scale database technologies for data analytics as scientists have now been dealing with much larger volume than ever before. While array databases have bridged many gaps between the needs of data-intensive research fields and DBMS technologies (Zhang 2011), invocation of other big data tools accompanying these databases is still manual and separate the database management's interface. We identify this as an architectural challenge that will increasingly complicate the user's work flow owing to the growing number of useful but isolated and niche database tools. Such use of data analysis tools in effect leaves the burden on the user's end to synchronize the results from other data manipulation analysis tools with the database management system. To this end, we propose a unified access interface for using big data tools within large scale scientific array database using the database queries themselves to embed foreign routines belonging to the big data tools. Such an invocation of foreign data manipulation routines inside a query into a database can be made possible through a user-defined function (UDF). UDFs that allow such levels of freedom as to call modules from another language and interface back and forth between the query body and the side-loaded functions would be needed for this purpose. For the purpose of this research we attempt coupling of four widely used tools Hadoop (hadoop1), Matlab (matlab1), R (r1) and ScaLAPACK (scalapack1) with UDF feature of rasdaman (Baumann 98), an array-based data manager, for investigating this concept. The native array data model used by an array-based data manager provides compact data storage and high performance operations on ordered data such as spatial data, temporal data, and matrix-based data for linear algebra operations (scidbusr1). Performances issues arising due to coupling of tools with different paradigms, niche functionalities, separate processes and output data formats have been anticipated and considered during the design of the unified architecture. The research focuses on the feasibility of the designed coupling mechanism and the evaluation of the efficiency and benefits of our proposed unified access architecture. Zhang 2011: Zhang, Ying and Kersten, Martin and Ivanova, Milena and Nes, Niels, SciQL: Bridging the Gap Between Science and Relational DBMS, Proceedings of the 15th Symposium on International Database Engineering Applications, 2011. Baumann 98: Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N., "The Multidimensional Database System RasDaMan", SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, 1998. hadoop1: hadoop.apache.org, "Hadoop", http://hadoop.apache.org/, [Online; accessed 12-Jan-2014]. scalapack1: netlib.org/scalapack, "ScaLAPACK", http://www.netlib.org/scalapack,[Online; accessed 12-Jan-2014]. r1: r-project.org, "R", http://www.r-project.org/,[Online; accessed 12-Jan-2014]. matlab1: mathworks.com, "Matlab Documentation", http://www.mathworks.de/de/help/matlab/,[Online; accessed 12-Jan-2014]. scidbusr1: scidb.org, "SciDB User's Guide", http://scidb.org/HTMLmanual/13.6/scidb_ug,[Online; accessed 01-Dec-2013].
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roehm, Dominic; Pavel, Robert S.; Barros, Kipton
We present an adaptive sampling method supplemented by a distributed database and a prediction method for multiscale simulations using the Heterogeneous Multiscale Method. A finite-volume scheme integrates the macro-scale conservation laws for elastodynamics, which are closed by momentum and energy fluxes evaluated at the micro-scale. In the original approach, molecular dynamics (MD) simulations are launched for every macro-scale volume element. Our adaptive sampling scheme replaces a large fraction of costly micro-scale MD simulations with fast table lookup and prediction. The cloud database Redis provides the plain table lookup, and with locality aware hashing we gather input data for our predictionmore » scheme. For the latter we use kriging, which estimates an unknown value and its uncertainty (error) at a specific location in parameter space by using weighted averages of the neighboring points. We find that our adaptive scheme significantly improves simulation performance by a factor of 2.5 to 25, while retaining high accuracy for various choices of the algorithm parameters.« less
Large scale validation of the M5L lung CAD on heterogeneous CT datasets.
Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P
2015-04-01
M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.
A Priori Subgrid Analysis of Temporal Mixing Layers with Evaporating Droplets
NASA Technical Reports Server (NTRS)
Okongo, Nora; Bellan, Josette
1999-01-01
Subgrid analysis of a transitional temporal mixing layer with evaporating droplets has been performed using three sets of results from a Direct Numerical Simulation (DNS) database, with Reynolds numbers (based on initial vorticity thickness) as large as 600 and with droplet mass loadings as large as 0.5. In the DNS, the gas phase is computed using a Eulerian formulation, with Lagrangian droplet tracking. The Large Eddy Simulation (LES) equations corresponding to the DNS are first derived, and key assumptions in deriving them are first confirmed by computing the terms using the DNS database. Since LES of this flow requires the computation of unfiltered gas-phase variables at droplet locations from filtered gas-phase variables at the grid points, it is proposed to model these by assuming the gas-phase variables to be the sum of the filtered variables and a correction based on the filtered standard deviation; this correction is then computed from the Subgrid Scale (SGS) standard deviation. This model predicts the unfiltered variables at droplet locations considerably better than simply interpolating the filtered variables. Three methods are investigated for modeling the SGS standard deviation: the Smagorinsky approach, the Gradient model and the Scale-Similarity formulation. When the proportionality constant inherent in the SGS models is properly calculated, the Gradient and Scale-Similarity methods give results in excellent agreement with the DNS.
Monitoring aquatic resources for regional assessments requires an accurate and comprehensive inventory of the resource and useful classification of exosystem similarities. Our research effort to create an electronic database and work with various ways to classify coastal wetlands...
Learning Deep Representations for Ground to Aerial Geolocalization (Open Access)
2015-10-15
proposed approach, Where-CNN, is inspired by deep learning success in face verification and achieves significant improvements over tra- ditional hand...crafted features and existing deep features learned from other large-scale databases. We show the ef- fectiveness of Where-CNN in finding matches
Health-Terrain: Visualizing Large Scale Health Data
2014-12-01
systems can only be realized if the quality of emerging large medical databases can be characterized and the meaning of the data understood. For this...Designed and tested an evaluation procedure for health data visualization system. This visualization framework offers a real time and web-based solution...rule is shown in the table, with the quality measures of each rule including the support, confidence, Laplace, Gain, p-s, lift and Conviction. We
Deep learning with non-medical training used for chest pathology identification
NASA Astrophysics Data System (ADS)
Bar, Yaniv; Diamant, Idit; Wolf, Lior; Greenspan, Hayit
2015-03-01
In this work, we examine the strength of deep learning approaches for pathology detection in chest radiograph data. Convolutional neural networks (CNN) deep architecture classification approaches have gained popularity due to their ability to learn mid and high level image representations. We explore the ability of a CNN to identify different types of pathologies in chest x-ray images. Moreover, since very large training sets are generally not available in the medical domain, we explore the feasibility of using a deep learning approach based on non-medical learning. We tested our algorithm on a dataset of 93 images. We use a CNN that was trained with ImageNet, a well-known large scale nonmedical image database. The best performance was achieved using a combination of features extracted from the CNN and a set of low-level features. We obtained an area under curve (AUC) of 0.93 for Right Pleural Effusion detection, 0.89 for Enlarged heart detection and 0.79 for classification between healthy and abnormal chest x-ray, where all pathologies are combined into one large class. This is a first-of-its-kind experiment that shows that deep learning with large scale non-medical image databases may be sufficient for general medical image recognition tasks.
Menditto, Enrica; Bolufer De Gea, Angela; Cahir, Caitriona; Marengoni, Alessandra; Riegler, Salvatore; Fico, Giuseppe; Costa, Elisio; Monaco, Alessandro; Pecorelli, Sergio; Pani, Luca; Prados-Torres, Alexandra
2016-01-01
Computerized health care databases have been widely described as an excellent opportunity for research. The availability of “big data” has brought about a wave of innovation in projects when conducting health services research. Most of the available secondary data sources are restricted to the geographical scope of a given country and present heterogeneous structure and content. Under the umbrella of the European Innovation Partnership on Active and Healthy Ageing, collaborative work conducted by the partners of the group on “adherence to prescription and medical plans” identified the use of observational and large-population databases to monitor medication-taking behavior in the elderly. This article describes the methodology used to gather the information from available databases among the Adherence Action Group partners with the aim of improving data sharing on a European level. A total of six databases belonging to three different European countries (Spain, Republic of Ireland, and Italy) were included in the analysis. Preliminary results suggest that there are some similarities. However, these results should be applied in different contexts and European countries, supporting the idea that large European studies should be designed in order to get the most of already available databases. PMID:27358570
Real-time terrain storage generation from multiple sensors towards mobile robot operation interface.
Song, Wei; Cho, Seoungjae; Xi, Yulong; Cho, Kyungeun; Um, Kyhyun
2014-01-01
A mobile robot mounted with multiple sensors is used to rapidly collect 3D point clouds and video images so as to allow accurate terrain modeling. In this study, we develop a real-time terrain storage generation and representation system including a nonground point database (PDB), ground mesh database (MDB), and texture database (TDB). A voxel-based flag map is proposed for incrementally registering large-scale point clouds in a terrain model in real time. We quantize the 3D point clouds into 3D grids of the flag map as a comparative table in order to remove the redundant points. We integrate the large-scale 3D point clouds into a nonground PDB and a node-based terrain mesh using the CPU. Subsequently, we program a graphics processing unit (GPU) to generate the TDB by mapping the triangles in the terrain mesh onto the captured video images. Finally, we produce a nonground voxel map and a ground textured mesh as a terrain reconstruction result. Our proposed methods were tested in an outdoor environment. Our results show that the proposed system was able to rapidly generate terrain storage and provide high resolution terrain representation for mobile mapping services and a graphical user interface between remote operators and mobile robots.
Real-Time Terrain Storage Generation from Multiple Sensors towards Mobile Robot Operation Interface
Cho, Seoungjae; Xi, Yulong; Cho, Kyungeun
2014-01-01
A mobile robot mounted with multiple sensors is used to rapidly collect 3D point clouds and video images so as to allow accurate terrain modeling. In this study, we develop a real-time terrain storage generation and representation system including a nonground point database (PDB), ground mesh database (MDB), and texture database (TDB). A voxel-based flag map is proposed for incrementally registering large-scale point clouds in a terrain model in real time. We quantize the 3D point clouds into 3D grids of the flag map as a comparative table in order to remove the redundant points. We integrate the large-scale 3D point clouds into a nonground PDB and a node-based terrain mesh using the CPU. Subsequently, we program a graphics processing unit (GPU) to generate the TDB by mapping the triangles in the terrain mesh onto the captured video images. Finally, we produce a nonground voxel map and a ground textured mesh as a terrain reconstruction result. Our proposed methods were tested in an outdoor environment. Our results show that the proposed system was able to rapidly generate terrain storage and provide high resolution terrain representation for mobile mapping services and a graphical user interface between remote operators and mobile robots. PMID:25101321
NASA Technical Reports Server (NTRS)
Parsons, David S.; Ordway, David; Johnson, Kenneth
2013-01-01
This experimental study seeks to quantify the impact various composite parameters have on the structural response of a composite structure in a pyroshock environment. The prediction of an aerospace structure's response to pyroshock induced loading is largely dependent on empirical databases created from collections of development and flight test data. While there is significant structural response data due to pyroshock induced loading for metallic structures, there is much less data available for composite structures. One challenge of developing a composite pyroshock response database as well as empirical prediction methods for composite structures is the large number of parameters associated with composite materials. This experimental study uses data from a test series planned using design of experiments (DOE) methods. Statistical analysis methods are then used to identify which composite material parameters most greatly influence a flat composite panel's structural response to pyroshock induced loading. The parameters considered are panel thickness, type of ply, ply orientation, and pyroshock level induced into the panel. The results of this test will aid in future large scale testing by eliminating insignificant parameters as well as aid in the development of empirical scaling methods for composite structures' response to pyroshock induced loading.
NASA Technical Reports Server (NTRS)
Parsons, David S.; Ordway, David O.; Johnson, Kenneth L.
2013-01-01
This experimental study seeks to quantify the impact various composite parameters have on the structural response of a composite structure in a pyroshock environment. The prediction of an aerospace structure's response to pyroshock induced loading is largely dependent on empirical databases created from collections of development and flight test data. While there is significant structural response data due to pyroshock induced loading for metallic structures, there is much less data available for composite structures. One challenge of developing a composite pyroshock response database as well as empirical prediction methods for composite structures is the large number of parameters associated with composite materials. This experimental study uses data from a test series planned using design of experiments (DOE) methods. Statistical analysis methods are then used to identify which composite material parameters most greatly influence a flat composite panel's structural response to pyroshock induced loading. The parameters considered are panel thickness, type of ply, ply orientation, and pyroshock level induced into the panel. The results of this test will aid in future large scale testing by eliminating insignificant parameters as well as aid in the development of empirical scaling methods for composite structures' response to pyroshock induced loading.
Privacy-preserving search for chemical compound databases.
Shimizu, Kana; Nuida, Koji; Arai, Hiromi; Mitsunari, Shigeo; Attrapadung, Nuttapong; Hamada, Michiaki; Tsuda, Koji; Hirokawa, Takatsugu; Sakuma, Jun; Hanaoka, Goichiro; Asai, Kiyoshi
2015-01-01
Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources. In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation. We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.
Privacy-preserving search for chemical compound databases
2015-01-01
Background Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources. Results In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation. Conclusion We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information. PMID:26678650
Ascoli, Davide; Vacchiano, Giorgio; Turco, Marco; Conedera, Marco; Drobyshev, Igor; Maringer, Janet; Motta, Renzo; Hacket-Pain, Andrew
2017-12-20
Climate teleconnections drive highly variable and synchronous seed production (masting) over large scales. Disentangling the effect of high-frequency (inter-annual variation) from low-frequency (decadal trends) components of climate oscillations will improve our understanding of masting as an ecosystem process. Using century-long observations on masting (the MASTREE database) and data on the Northern Atlantic Oscillation (NAO), we show that in the last 60 years both high-frequency summer and spring NAO, and low-frequency winter NAO components are highly correlated to continent-wide masting in European beech and Norway spruce. Relationships are weaker (non-stationary) in the early twentieth century. This finding improves our understanding on how climate variation affects large-scale synchronization of tree masting. Moreover, it supports the connection between proximate and ultimate causes of masting: indeed, large-scale features of atmospheric circulation coherently drive cues and resources for masting, as well as its evolutionary drivers, such as pollination efficiency, abundance of seed dispersers, and natural disturbance regimes.
Advanced Model for Extreme Lift and Improved Aeroacoustics (AMELIA)
NASA Technical Reports Server (NTRS)
Lichtwardt, Jonathan; Paciano, Eric; Jameson, Tina; Fong, Robert; Marshall, David
2012-01-01
With the very recent advent of NASA's Environmentally Responsible Aviation Project (ERA), which is dedicated to designing aircraft that will reduce the impact of aviation on the environment, there is a need for research and development of methodologies to minimize fuel burn, emissions, and reduce community noise produced by regional airliners. ERA tackles airframe technology, propulsion technology, and vehicle systems integration to meet performance objectives in the time frame for the aircraft to be at a Technology Readiness Level (TRL) of 4-6 by the year of 2020 (deemed N+2). The proceeding project that investigated similar goals to ERA was NASA's Subsonic Fixed Wing (SFW). SFW focused on conducting research to improve prediction methods and technologies that will produce lower noise, lower emissions, and higher performing subsonic aircraft for the Next Generation Air Transportation System. The work provided in this investigation was a NASA Research Announcement (NRA) contract #NNL07AA55C funded by Subsonic Fixed Wing. The project started in 2007 with a specific goal of conducting a large-scale wind tunnel test along with the development of new and improved predictive codes for the advanced powered-lift concepts. Many of the predictive codes were incorporated to refine the wind tunnel model outer mold line design. The large scale wind tunnel test goal was to investigate powered lift technologies and provide an experimental database to validate current and future modeling techniques. Powered-lift concepts investigated were Circulation Control (CC) wing in conjunction with over-the-wing mounted engines to entrain the exhaust to further increase the lift generated by CC technologies alone. The NRA was a five-year effort; during the first year the objective was to select and refine CESTOL concepts and then to complete a preliminary design of a large-scale wind tunnel model for the large scale test. During the second, third, and fourth years the large-scale wind tunnel model design would be completed, manufactured, and calibrated. During the fifth year the large scale wind tunnel test was conducted. This technical memo will describe all phases of the Advanced Model for Extreme Lift and Improved Aeroacoustics (AMELIA) project and provide a brief summary of the background and modeling efforts involved in the NRA. The conceptual designs considered for this project and the decision process for the selected configuration adapted for a wind tunnel model will be briefly discussed. The internal configuration of AMELIA, and the internal measurements chosen in order to satisfy the requirements of obtaining a database of experimental data to be used for future computational model validations. The external experimental techniques that were employed during the test, along with the large-scale wind tunnel test facility are covered in great detail. Experimental measurements in the database include forces and moments, and surface pressure distributions, local skin friction measurements, boundary and shear layer velocity profiles, far-field acoustic data and noise signatures from turbofan propulsion simulators. Results and discussion of the circulation control performance, over-the-wing mounted engines, and the combined performance are also discussed in great detail.
Quadratic integrand double-hybrid made spin-component-scaled
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brémond, Éric, E-mail: eric.bremond@iit.it; Savarese, Marika; Sancho-García, Juan C.
2016-03-28
We propose two analytical expressions aiming to rationalize the spin-component-scaled (SCS) and spin-opposite-scaled (SOS) schemes for double-hybrid exchange-correlation density-functionals. Their performances are extensively tested within the framework of the nonempirical quadratic integrand double-hybrid (QIDH) model on energetic properties included into the very large GMTKN30 benchmark database, and on structural properties of semirigid medium-sized organic compounds. The SOS variant is revealed as a less computationally demanding alternative to reach the accuracy of the original QIDH model without losing any theoretical background.
2018-01-01
Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users’ queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with “vanilla” LSH, even when using the same amount of space. PMID:29346410
MycoDB, a global database of plant response to mycorrhizal fungi.
Chaudhary, V Bala; Rúa, Megan A; Antoninka, Anita; Bever, James D; Cannon, Jeffery; Craig, Ashley; Duchicela, Jessica; Frame, Alicia; Gardes, Monique; Gehring, Catherine; Ha, Michelle; Hart, Miranda; Hopkins, Jacob; Ji, Baoming; Johnson, Nancy Collins; Kaonongbua, Wittaya; Karst, Justine; Koide, Roger T; Lamit, Louis J; Meadow, James; Milligan, Brook G; Moore, John C; Pendergast, Thomas H; Piculell, Bridget; Ramsby, Blake; Simard, Suzanne; Shrestha, Shubha; Umbanhowar, James; Viechtbauer, Wolfgang; Walters, Lawrence; Wilson, Gail W T; Zee, Peter C; Hoeksema, Jason D
2016-05-10
Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems.
MycoDB, a global database of plant response to mycorrhizal fungi
Chaudhary, V. Bala; Rúa, Megan A.; Antoninka, Anita; Bever, James D.; Cannon, Jeffery; Craig, Ashley; Duchicela, Jessica; Frame, Alicia; Gardes, Monique; Gehring, Catherine; Ha, Michelle; Hart, Miranda; Hopkins, Jacob; Ji, Baoming; Johnson, Nancy Collins; Kaonongbua, Wittaya; Karst, Justine; Koide, Roger T.; Lamit, Louis J.; Meadow, James; Milligan, Brook G.; Moore, John C.; Pendergast IV, Thomas H.; Piculell, Bridget; Ramsby, Blake; Simard, Suzanne; Shrestha, Shubha; Umbanhowar, James; Viechtbauer, Wolfgang; Walters, Lawrence; Wilson, Gail W. T.; Zee, Peter C.; Hoeksema, Jason D.
2016-01-01
Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems. PMID:27163938
MycoDB, a global database of plant response to mycorrhizal fungi
NASA Astrophysics Data System (ADS)
Chaudhary, V. Bala; Rúa, Megan A.; Antoninka, Anita; Bever, James D.; Cannon, Jeffery; Craig, Ashley; Duchicela, Jessica; Frame, Alicia; Gardes, Monique; Gehring, Catherine; Ha, Michelle; Hart, Miranda; Hopkins, Jacob; Ji, Baoming; Johnson, Nancy Collins; Kaonongbua, Wittaya; Karst, Justine; Koide, Roger T.; Lamit, Louis J.; Meadow, James; Milligan, Brook G.; Moore, John C.; Pendergast, Thomas H., IV; Piculell, Bridget; Ramsby, Blake; Simard, Suzanne; Shrestha, Shubha; Umbanhowar, James; Viechtbauer, Wolfgang; Walters, Lawrence; Wilson, Gail W. T.; Zee, Peter C.; Hoeksema, Jason D.
2016-05-01
Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems.
NREL Supercomputer Tackles Grid Challenges | News | NREL
traditional database processes. Photo by Dennis Schroeder, NREL "Big data" is playing an imagery, and large-scale simulation data. Photo by Dennis Schroeder, NREL "Peregrine provides much . Photo by Dennis Schroeder, NREL Collaboration is key, and it is hard-wired into the ESIF's core. NREL
DOT National Transportation Integrated Search
1999-12-01
This paper analyzes the freight demand characteristics that drive modal choice by means of a large scale, national, disaggregate revealed preference database for shippers in France in 1988, using a nested logit. Particular attention is given to priva...
The Starkey habitat database for ungulate research: construction, documentation, and use.
Mary M. Rowland; Priscilla K. Coe; Rosemary J. Stussy; [and others].
1998-01-01
The Starkey Project, a large-scale, multidisciplinary research venture, began in 1987 in the Starkey Experimental Forest and Range in northeast Oregon. Researchers are studying effects of forest management on interactions and habitat use of mule deer (Odocoileus hemionus hemionus), elk (Cervus elaphus nelsoni), and cattle. A...
Large Eddy Simulation of jets laden with evaporating drops
NASA Technical Reports Server (NTRS)
Leboissetier, A.; Okong'o, N.; Bellan, J.
2004-01-01
LES of a circular jet laden with evaporating liquid drops are conducted to assess computational-drop modeling and three different SGS-flux models: the Scale Similarity model (SSC), using a constant coefficient calibrated on a temporal mixing layer DNS database, and dynamic-coefficient Gradient and Smagorinsky models.
Sosso, Gabriele C; Miceli, Giacomo; Caravati, Sebastiano; Giberti, Federico; Behler, Jörg; Bernasconi, Marco
2013-12-19
Phase change materials are of great interest as active layers in rewritable optical disks and novel electronic nonvolatile memories. These applications rest on a fast and reversible transformation between the amorphous and crystalline phases upon heating, taking place on the nanosecond time scale. In this work, we investigate the microscopic origin of the fast crystallization process by means of large-scale molecular dynamics simulations of the phase change compound GeTe. To this end, we use an interatomic potential generated from a Neural Network fitting of a large database of ab initio energies. We demonstrate that in the temperature range of the programming protocols of the electronic memories (500-700 K), nucleation of the crystal in the supercooled liquid is not rate-limiting. In this temperature range, the growth of supercritical nuclei is very fast because of a large atomic mobility, which is, in turn, the consequence of the high fragility of the supercooled liquid and the associated breakdown of the Stokes-Einstein relation between viscosity and diffusivity.
NASA Astrophysics Data System (ADS)
Aliseda, Alberto; Bourgoin, Mickael; Eswirp Collaboration
2014-11-01
We present preliminary results from a recent grid turbulence experiment conducted at the ONERA wind tunnel in Modane, France. The ESWIRP Collaboration was conceived to probe the smallest scales of a canonical turbulent flow with very high Reynolds numbers. To achieve this, the largest scales of the turbulence need to be extremely big so that, even with the large separation of scales, the smallest scales would be well above the spatial and temporal resolution of the instruments. The ONERA wind tunnel in Modane (8 m -diameter test section) was chosen as a limit of the biggest large scales achievable in a laboratory setting. A giant inflatable grid (M = 0.8 m) was conceived to induce slowly-decaying homogeneous isotropic turbulence in a large region of the test section, with minimal structural risk. An international team or researchers collected hot wire anemometry, ultrasound anemometry, resonant cantilever anemometry, fast pitot tube anemometry, cold wire thermometry and high-speed particle tracking data of this canonical turbulent flow. While analysis of this large database, which will become publicly available over the next 2 years, has only started, the Taylor-scale Reynolds number is estimated to be between 400 and 800, with Kolmogorov scales as large as a few mm . The ESWIRP Collaboration is formed by an international team of scientists to investigate experimentally the smallest scales of turbulence. It was funded by the European Union to take advantage of the largest wind tunnel in Europe for fundamental research.
Optical components damage parameters database system
NASA Astrophysics Data System (ADS)
Tao, Yizheng; Li, Xinglan; Jin, Yuquan; Xie, Dongmei; Tang, Dingyong
2012-10-01
Optical component is the key to large-scale laser device developed by one of its load capacity is directly related to the device output capacity indicators, load capacity depends on many factors. Through the optical components will damage parameters database load capacity factors of various digital, information technology, for the load capacity of optical components to provide a scientific basis for data support; use of business processes and model-driven approach, the establishment of component damage parameter information model and database systems, system application results that meet the injury test optical components business processes and data management requirements of damage parameters, component parameters of flexible, configurable system is simple, easy to use, improve the efficiency of the optical component damage test.
Scaling Semantic Graph Databases in Size and Performance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morari, Alessandro; Castellana, Vito G.; Villa, Oreste
In this paper we present SGEM, a full software system for accelerating large-scale semantic graph databases on commodity clusters. Unlike current approaches, SGEM addresses semantic graph databases by only employing graph methods at all the levels of the stack. On one hand, this allows exploiting the space efficiency of graph data structures and the inherent parallelism of graph algorithms. These features adapt well to the increasing system memory and core counts of modern commodity clusters. On the other hand, however, these systems are optimized for regular computation and batched data transfers, while graph methods usually are irregular and generate fine-grainedmore » data accesses with poor spatial and temporal locality. Our framework comprises a SPARQL to data parallel C compiler, a library of parallel graph methods and a custom, multithreaded runtime system. We introduce our stack, motivate its advantages with respect to other solutions and show how we solved the challenges posed by irregular behaviors. We present the result of our software stack on the Berlin SPARQL benchmarks with datasets up to 10 billion triples (a triple corresponds to a graph edge), demonstrating scaling in dataset size and in performance as more nodes are added to the cluster.« less
The EMBL nucleotide sequence database
Stoesser, Guenter; Baker, Wendy; van den Broek, Alexandra; Camon, Evelyn; Garcia-Pastor, Maria; Kanz, Carola; Kulikova, Tamara; Lombard, Vincent; Lopez, Rodrigo; Parkinson, Helen; Redaschi, Nicole; Sterk, Peter; Stoehr, Peter; Tuli, Mary Ann
2001-01-01
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data is exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via ftp, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many specialized databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT. PMID:11125039
Efficient frequent pattern mining algorithm based on node sets in cloud computing environment
NASA Astrophysics Data System (ADS)
Billa, V. N. Vinay Kumar; Lakshmanna, K.; Rajesh, K.; Reddy, M. Praveen Kumar; Nagaraja, G.; Sudheer, K.
2017-11-01
The ultimate goal of Data Mining is to determine the hidden information which is useful in making decisions using the large databases collected by an organization. This Data Mining involves many tasks that are to be performed during the process. Mining frequent itemsets is the one of the most important tasks in case of transactional databases. These transactional databases contain the data in very large scale where the mining of these databases involves the consumption of physical memory and time in proportion to the size of the database. A frequent pattern mining algorithm is said to be efficient only if it consumes less memory and time to mine the frequent itemsets from the given large database. Having these points in mind in this thesis we proposed a system which mines frequent itemsets in an optimized way in terms of memory and time by using cloud computing as an important factor to make the process parallel and the application is provided as a service. A complete framework which uses a proven efficient algorithm called FIN algorithm. FIN algorithm works on Nodesets and POC (pre-order coding) tree. In order to evaluate the performance of the system we conduct the experiments to compare the efficiency of the same algorithm applied in a standalone manner and in cloud computing environment on a real time data set which is traffic accidents data set. The results show that the memory consumption and execution time taken for the process in the proposed system is much lesser than those of standalone system.
Klein, Brennan J; Li, Zhi; Durgin, Frank H
2016-04-01
What is the natural reference frame for seeing large-scale spatial scenes in locomotor action space? Prior studies indicate an asymmetric angular expansion in perceived direction in large-scale environments: Angular elevation relative to the horizon is perceptually exaggerated by a factor of 1.5, whereas azimuthal direction is exaggerated by a factor of about 1.25. Here participants made angular and spatial judgments when upright or on their sides to dissociate egocentric from allocentric reference frames. In Experiment 1, it was found that body orientation did not affect the magnitude of the up-down exaggeration of direction, suggesting that the relevant orientation reference frame for this directional bias is allocentric rather than egocentric. In Experiment 2, the comparison of large-scale horizontal and vertical extents was somewhat affected by viewer orientation, but only to the extent necessitated by the classic (5%) horizontal-vertical illusion (HVI) that is known to be retinotopic. Large-scale vertical extents continued to appear much larger than horizontal ground extents when observers lay sideways. When the visual world was reoriented in Experiment 3, the bias remained tied to the ground-based allocentric reference frame. The allocentric HVI is quantitatively consistent with differential angular exaggerations previously measured for elevation and azimuth in locomotor space. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
New Zealand's National Landslide Database
NASA Astrophysics Data System (ADS)
Rosser, B.; Dellow, S.; Haubrook, S.; Glassey, P.
2016-12-01
Since 1780, landslides have caused an average of about 3 deaths a year in New Zealand and have cost the economy an average of at least NZ$250M/a (0.1% GDP). To understand the risk posed by landslide hazards to society, a thorough knowledge of where, when and why different types of landslides occur is vital. The main objective for establishing the database was to provide a centralised national-scale, publically available database to collate landslide information that could be used for landslide hazard and risk assessment. Design of a national landslide database for New Zealand required consideration of both existing landslide data stored in a variety of digital formats, and future data, yet to be collected. Pre-existing databases were developed and populated with data reflecting the needs of the landslide or hazard project, and the database structures of the time. Bringing these data into a single unified database required a new structure capable of storing and delivering data at a variety of scales and accuracy and with different attributes. A "unified data model" was developed to enable the database to hold old and new landslide data irrespective of scale and method of capture. The database contains information on landslide locations and where available: 1) the timing of landslides and the events that may have triggered them; 2) the type of landslide movement; 3) the volume and area; 4) the source and debris tail; and 5) the impacts caused by the landslide. Information from a variety of sources including aerial photographs (and other remotely sensed data), field reconnaissance and media accounts has been collated and is presented for each landslide along with metadata describing the data sources and quality. There are currently nearly 19,000 landslide records in the database that include point locations, polygons of landslide source and deposit areas, and linear features. Several large datasets are awaiting upload which will bring the total number of landslides to over 100,000. The geo-spatial database is publicly available via the Internet. Software components, including the underlying database (PostGIS), Web Map Server (GeoServer) and web application use open-source software. The hope is that others will add relevant information to the database as well as download the data contained in it.
Historical reconstructions of California wildfires vary by data source
Syphard, Alexandra D.; Keeley, Jon E.
2016-01-01
Historical data are essential for understanding how fire activity responds to different drivers. It is important that the source of data is commensurate with the spatial and temporal scale of the question addressed, but fire history databases are derived from different sources with different restrictions. In California, a frequently used fire history dataset is the State of California Fire and Resource Assessment Program (FRAP) fire history database, which circumscribes fire perimeters at a relatively fine scale. It includes large fires on both state and federal lands but only covers fires that were mapped or had other spatially explicit data. A different database is the state and federal governments’ annual reports of all fires. They are more complete than the FRAP database but are only spatially explicit to the level of county (California Department of Forestry and Fire Protection – Cal Fire) or forest (United States Forest Service – USFS). We found substantial differences between the FRAP database and the annual summaries, with the largest and most consistent discrepancy being in fire frequency. The FRAP database missed the majority of fires and is thus a poor indicator of fire frequency or indicators of ignition sources. The FRAP database is also deficient in area burned, especially before 1950. Even in contemporary records, the huge number of smaller fires not included in the FRAP database account for substantial cumulative differences in area burned. Wildfires in California account for nearly half of the western United States fire suppression budget. Therefore, the conclusions about data discrepancies and the implications for fire research are of broad importance.
NASA Astrophysics Data System (ADS)
Ramage, K.; Desbois, M.; Eymard, L.
2004-12-01
The African Monsoon Multidisciplinary Analysis project is a French initiative, which aims at identifying and analysing in details the multidisciplinary and multi-scales processes that lead to a better understanding of the physical mechanisms linked to the African Monsoon. The main components of the African Monsoon are: Atmospheric Dynamics, the Continental Water Cycle, Atmospheric Chemistry, Oceanic and Continental Surface Conditions. Satellites contribute to various objectives of the project both for process analysis and for large scale-long term studies: some series of satellites (METEOSAT, NOAA,.) have been flown for more than 20 years, ensuring a good quality monitoring of some of the West African atmosphere and surface characteristics. Moreover, several recent missions, and several projects will strongly improve and complement this survey. The AMMA project offers an opportunity to develop the exploitation of satellite data and to make collaboration between specialist and non-specialist users. In this purpose databases are being developed to collect all past and future satellite data related to the African Monsoon. It will then be possible to compare different types of data from different resolution, to validate satellite data with in situ measurements or numerical simulations. AMMA-SAT database main goal is to offer an easy access to satellite data to the AMMA scientific community. The database contains geophysical products estimated from operational or research algorithms and covering the different components of the AMMA project. Nevertheless, the choice has been made to group data within pertinent scales rather than within their thematic. In this purpose, five regions of interest where defined to extract the data: An area covering Tropical Atlantic and Africa for large scale studies, an area covering West Africa for mesoscale studies and three local areas surrounding sites of in situ observations. Within each of these regions satellite data are projected on a regular grid with a spatial resolution compatible with the spatial variability of the geophysical parameter. Data are stored in NetCDF files to facilitate their use. Satellite products can be selected using several spatial and temporal criteria and ordered through a web interface developed in PHP-MySQL. More common means of access are also available such as direct FTP or NFS access for identified users. A Live Access Server allows quick visualization of the data. A meta-data catalogue based on the Directory Interchange Format manages the documentation of each satellite product. The database is currently under development, but some products are already available. The database will be complete by the end of 2005.
ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining.
Huan, Tianxiao; Sivachenko, Andrey Y; Harrison, Scott H; Chen, Jake Y
2008-08-12
New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed. We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network. The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
Martin, Tiphaine; Sherman, David J; Durrens, Pascal
2011-01-01
The Génolevures online database (URL: http://www.genolevures.org) stores and provides the data and results obtained by the Génolevures Consortium through several campaigns of genome annotation of the yeasts in the Saccharomycotina subphylum (hemiascomycetes). This database is dedicated to large-scale comparison of these genomes, storing not only the different chromosomal elements detected in the sequences, but also the logical relations between them. The database is divided into a public part, accessible to anyone through Internet, and a private part where the Consortium members make genome annotations with our Magus annotation system; this system is used to annotate several related genomes in parallel. The public database is widely consulted and offers structured data, organized using a REST web site architecture that allows for automated requests. The implementation of the database, as well as its associated tools and methods, is evolving to cope with the influx of genome sequences produced by Next Generation Sequencing (NGS). Copyright © 2011 Académie des sciences. Published by Elsevier SAS. All rights reserved.
Universal scaling function in discrete time asymmetric exclusion processes
NASA Astrophysics Data System (ADS)
Chia, Nicholas; Bundschuh, Ralf
2005-03-01
In the universality class of the one dimensional Kardar-Parisi-Zhang surface growth, Derrida and Lebowitz conjectured the universality of not only the scaling exponents, but of an entire scaling function. Since Derrida and Lebowitz' original publication this universality has been verified for a variety of continuous time systems in the KPZ universality class. We study the Derrida-Lebowitz scaling function for multi-particle versions of the discrete time Asymmetric Exclusion Process. We find that in this discrete time system the Derrida-Lebowitz scaling function not only properly characterizes the large system size limit, but even accurately describes surprisingly small systems. These results have immediate applications in searching biological sequence databases.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Borghesi, Giulio; Bellan, Josette, E-mail: josette.bellan@jpl.nasa.gov; Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California 91109-8099
2015-03-15
A Direct Numerical Simulation (DNS) database was created representing mixing of species under high-pressure conditions. The configuration considered is that of a temporally evolving mixing layer. The database was examined and analyzed for the purpose of modeling some of the unclosed terms that appear in the Large Eddy Simulation (LES) equations. Several metrics are used to understand the LES modeling requirements. First, a statistical analysis of the DNS-database large-scale flow structures was performed to provide a metric for probing the accuracy of the proposed LES models as the flow fields obtained from accurate LESs should contain structures of morphology statisticallymore » similar to those observed in the filtered-and-coarsened DNS (FC-DNS) fields. To characterize the morphology of the large-scales structures, the Minkowski functionals of the iso-surfaces were evaluated for two different fields: the second-invariant of the rate of deformation tensor and the irreversible entropy production rate. To remove the presence of the small flow scales, both of these fields were computed using the FC-DNS solutions. It was found that the large-scale structures of the irreversible entropy production rate exhibit higher morphological complexity than those of the second invariant of the rate of deformation tensor, indicating that the burden of modeling will be on recovering the thermodynamic fields. Second, to evaluate the physical effects which must be modeled at the subfilter scale, an a priori analysis was conducted. This a priori analysis, conducted in the coarse-grid LES regime, revealed that standard closures for the filtered pressure, the filtered heat flux, and the filtered species mass fluxes, in which a filtered function of a variable is equal to the function of the filtered variable, may no longer be valid for the high-pressure flows considered in this study. The terms requiring modeling are the filtered pressure, the filtered heat flux, the filtered pressure work, and the filtered species mass fluxes. Improved models were developed based on a scale-similarity approach and were found to perform considerably better than the classical ones. These improved models were also assessed in an a posteriori study. Different combinations of the standard models and the improved ones were tested. At the relatively small Reynolds numbers achievable in DNS and at the relatively small filter widths used here, the standard models for the filtered pressure, the filtered heat flux, and the filtered species fluxes were found to yield accurate results for the morphology of the large-scale structures present in the flow. Analysis of the temporal evolution of several volume-averaged quantities representative of the mixing layer growth, and of the cross-stream variation of homogeneous-plane averages and second-order correlations, as well as of visualizations, indicated that the models performed equivalently for the conditions of the simulations. The expectation is that at the much larger Reynolds numbers and much larger filter widths used in practical applications, the improved models will have much more accurate performance than the standard one.« less
Integration of a neuroimaging processing pipeline into a pan-canadian computing grid
NASA Astrophysics Data System (ADS)
Lavoie-Courchesne, S.; Rioux, P.; Chouinard-Decorte, F.; Sherif, T.; Rousseau, M.-E.; Das, S.; Adalat, R.; Doyon, J.; Craddock, C.; Margulies, D.; Chu, C.; Lyttelton, O.; Evans, A. C.; Bellec, P.
2012-02-01
The ethos of the neuroimaging field is quickly moving towards the open sharing of resources, including both imaging databases and processing tools. As a neuroimaging database represents a large volume of datasets and as neuroimaging processing pipelines are composed of heterogeneous, computationally intensive tools, such open sharing raises specific computational challenges. This motivates the design of novel dedicated computing infrastructures. This paper describes an interface between PSOM, a code-oriented pipeline development framework, and CBRAIN, a web-oriented platform for grid computing. This interface was used to integrate a PSOM-compliant pipeline for preprocessing of structural and functional magnetic resonance imaging into CBRAIN. We further tested the capacity of our infrastructure to handle a real large-scale project. A neuroimaging database including close to 1000 subjects was preprocessed using our interface and publicly released to help the participants of the ADHD-200 international competition. This successful experiment demonstrated that our integrated grid-computing platform is a powerful solution for high-throughput pipeline analysis in the field of neuroimaging.
ERIC Educational Resources Information Center
Chudagr, Amita; Luschei, Thomas F.
2016-01-01
The objective of this commentary is to call attention to the feasibility and importance of large-scale, systematic, quantitative analysis in international and comparative education research. We contend that although many existing databases are under- or unutilized in quantitative international-comparative research, these resources present the…
Data Intensive Systems (DIS) Benchmark Performance Summary
2003-08-01
models assumed by today’s conventional architectures. Such applications include model- based Automatic Target Recognition (ATR), synthetic aperture...radar (SAR) codes, large scale dynamic databases/battlefield integration, dynamic sensor- based processing, high-speed cryptanalysis, high speed...distributed interactive and data intensive simulations, data-oriented problems characterized by pointer- based and other highly irregular data structures
Very Large Scale Distributed Information Processing Systems
1991-09-27
USENIX Conference Proceedings, pp. 31-43. USENIX, February 1988. [KLA90] Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis Apos- tolides, Beth...will be selected if cost is the curlcron Iorsleettin- IfFigure 2 R DistribUted Database lSgtam and its we combin the abolve two pit , n r-itcrr
Using National Education Longitudinal Data Sets in School Counseling Research
ERIC Educational Resources Information Center
Bryan, Julia A.; Day-Vines, Norma L.; Holcomb-McCoy, Cheryl; Moore-Thomas, Cheryl
2010-01-01
National longitudinal databases hold much promise for school counseling researchers. Several of the more frequently used data sets, possible professional implications, and strategies for acquiring training in the use of large-scale national data sets are described. A 6-step process for conducting research with the data sets is explicated:…
Similarity-based modeling in large-scale prediction of drug-drug interactions.
Vilar, Santiago; Uriarte, Eugenio; Santana, Lourdes; Lorberbaum, Tal; Hripcsak, George; Friedman, Carol; Tatonetti, Nicholas P
2014-09-01
Drug-drug interactions (DDIs) are a major cause of adverse drug effects and a public health concern, as they increase hospital care expenses and reduce patients' quality of life. DDI detection is, therefore, an important objective in patient safety, one whose pursuit affects drug development and pharmacovigilance. In this article, we describe a protocol applicable on a large scale to predict novel DDIs based on similarity of drug interaction candidates to drugs involved in established DDIs. The method integrates a reference standard database of known DDIs with drug similarity information extracted from different sources, such as 2D and 3D molecular structure, interaction profile, target and side-effect similarities. The method is interpretable in that it generates drug interaction candidates that are traceable to pharmacological or clinical effects. We describe a protocol with applications in patient safety and preclinical toxicity screening. The time frame to implement this protocol is 5-7 h, with additional time potentially necessary, depending on the complexity of the reference standard DDI database and the similarity measures implemented.
StructRNAfinder: an automated pipeline and web server for RNA families prediction.
Arias-Carrasco, Raúl; Vásquez-Morán, Yessenia; Nakaya, Helder I; Maracaja-Coutinho, Vinicius
2018-02-17
The function of many noncoding RNAs (ncRNAs) depend upon their secondary structures. Over the last decades, several methodologies have been developed to predict such structures or to use them to functionally annotate RNAs into RNA families. However, to fully perform this analysis, researchers should utilize multiple tools, which require the constant parsing and processing of several intermediate files. This makes the large-scale prediction and annotation of RNAs a daunting task even to researchers with good computational or bioinformatics skills. We present an automated pipeline named StructRNAfinder that predicts and annotates RNA families in transcript or genome sequences. This single tool not only displays the sequence/structural consensus alignments for each RNA family, according to Rfam database but also provides a taxonomic overview for each assigned functional RNA. Moreover, we implemented a user-friendly web service that allows researchers to upload their own nucleotide sequences in order to perform the whole analysis. Finally, we provided a stand-alone version of StructRNAfinder to be used in large-scale projects. The tool was developed under GNU General Public License (GPLv3) and is freely available at http://structrnafinder.integrativebioinformatics.me . The main advantage of StructRNAfinder relies on the large-scale processing and integrating the data obtained by each tool and database employed along the workflow, of which several files are generated and displayed in user-friendly reports, useful for downstream analyses and data exploration.
Leaf optical properties shed light on foliar trait variability at individual to global scales
NASA Astrophysics Data System (ADS)
Shiklomanov, A. N.; Serbin, S.; Dietze, M.
2016-12-01
Recent syntheses of large trait databases have contributed immensely to our understanding of drivers of plant function at the global scale. However, the global trade-offs revealed by such syntheses, such as the trade-off between leaf productivity and resilience (i.e. "leaf economics spectrum"), are often absent at smaller scales and fail to correlate with actual functional limitations. An improved understanding of how traits vary within communities, species, and individuals is critical to accurate representations of vegetation ecophysiology and ecological dynamics in ecosystem models. Spectral data from both field observations and remote sensing platforms present a potentially rich and widely available source of information on plant traits. In particular, the inversion of physically-based radiative transfer models (RTMs) is an effective and general method for estimating plant traits from spectral measurements. Here, we apply Bayesian inversion of the PROSPECT leaf RTM to a large database of field spectra and plant traits spanning tropical, temperate, and boreal forests, agricultural plots, arid shrublands, and tundra to identify dominant sources of variability and characterize trade-offs in plant functional traits. By leveraging such a large and diverse dataset, we re-calibrate the empirical absorption coefficients underlying the PROSPECT model and expand its scope to include additional leaf biochemical components, namely leaf nitrogen content. Our work provides a key methodological contribution as a physically-based retrieval of leaf nitrogen from remote sensing observations, and provides substantial insights about trait trade-offs related to plant acclimation, adaptation, and community assembly.
Sachem: a chemical cartridge for high-performance substructure search.
Kratochvíl, Miroslav; Vondrášek, Jiří; Galgonek, Jakub
2018-05-23
Structure search is one of the valuable capabilities of small-molecule databases. Fingerprint-based screening methods are usually employed to enhance the search performance by reducing the number of calls to the verification procedure. In substructure search, fingerprints are designed to capture important structural aspects of the molecule to aid the decision about whether the molecule contains a given substructure. Currently available cartridges typically provide acceptable search performance for processing user queries, but do not scale satisfactorily with dataset size. We present Sachem, a new open-source chemical cartridge that implements two substructure search methods: The first is a performance-oriented reimplementation of substructure indexing based on the OrChem fingerprint, and the second is a novel method that employs newly designed fingerprints stored in inverted indices. We assessed the performance of both methods on small, medium, and large datasets containing 1, 10, and 94 million compounds, respectively. Comparison of Sachem with other freely available cartridges revealed improvements in overall performance, scaling potential and screen-out efficiency. The Sachem cartridge allows efficient substructure searches in databases of all sizes. The sublinear performance scaling of the second method and the ability to efficiently query large amounts of pre-extracted information may together open the door to new applications for substructure searches.
Lo, Yu-Chen; Senese, Silvia; Li, Chien-Ming; Hu, Qiyang; Huang, Yong; Damoiseaux, Robert; Torres, Jorge Z.
2015-01-01
Target identification is one of the most critical steps following cell-based phenotypic chemical screens aimed at identifying compounds with potential uses in cell biology and for developing novel disease therapies. Current in silico target identification methods, including chemical similarity database searches, are limited to single or sequential ligand analysis that have limited capabilities for accurate deconvolution of a large number of compounds with diverse chemical structures. Here, we present CSNAP (Chemical Similarity Network Analysis Pulldown), a new computational target identification method that utilizes chemical similarity networks for large-scale chemotype (consensus chemical pattern) recognition and drug target profiling. Our benchmark study showed that CSNAP can achieve an overall higher accuracy (>80%) of target prediction with respect to representative chemotypes in large (>200) compound sets, in comparison to the SEA approach (60–70%). Additionally, CSNAP is capable of integrating with biological knowledge-based databases (Uniprot, GO) and high-throughput biology platforms (proteomic, genetic, etc) for system-wise drug target validation. To demonstrate the utility of the CSNAP approach, we combined CSNAP's target prediction with experimental ligand evaluation to identify the major mitotic targets of hit compounds from a cell-based chemical screen and we highlight novel compounds targeting microtubules, an important cancer therapeutic target. The CSNAP method is freely available and can be accessed from the CSNAP web server (http://services.mbi.ucla.edu/CSNAP/). PMID:25826798
A generic method for improving the spatial interoperability of medical and ecological databases.
Ghenassia, A; Beuscart, J B; Ficheur, G; Occelli, F; Babykina, E; Chazard, E; Genin, M
2017-10-03
The availability of big data in healthcare and the intensive development of data reuse and georeferencing have opened up perspectives for health spatial analysis. However, fine-scale spatial studies of ecological and medical databases are limited by the change of support problem and thus a lack of spatial unit interoperability. The use of spatial disaggregation methods to solve this problem introduces errors into the spatial estimations. Here, we present a generic, two-step method for merging medical and ecological databases that avoids the use of spatial disaggregation methods, while maximizing the spatial resolution. Firstly, a mapping table is created after one or more transition matrices have been defined. The latter link the spatial units of the original databases to the spatial units of the final database. Secondly, the mapping table is validated by (1) comparing the covariates contained in the two original databases, and (2) checking the spatial validity with a spatial continuity criterion and a spatial resolution index. We used our novel method to merge a medical database (the French national diagnosis-related group database, containing 5644 spatial units) with an ecological database (produced by the French National Institute of Statistics and Economic Studies, and containing with 36,594 spatial units). The mapping table yielded 5632 final spatial units. The mapping table's validity was evaluated by comparing the number of births in the medical database and the ecological databases in each final spatial unit. The median [interquartile range] relative difference was 2.3% [0; 5.7]. The spatial continuity criterion was low (2.4%), and the spatial resolution index was greater than for most French administrative areas. Our innovative approach improves interoperability between medical and ecological databases and facilitates fine-scale spatial analyses. We have shown that disaggregation models and large aggregation techniques are not necessarily the best ways to tackle the change of support problem.
Haugum, Mona; Danielsen, Kirsten; Iversen, Hilde Hestad; Bjertnaes, Oyvind
2014-12-01
An important goal for national and large-scale surveys of user experiences is quality improvement. However, large-scale surveys are normally conducted by a professional external surveyor, creating an institutionalized division between the measurement of user experiences and the quality work that is performed locally. The aim of this study was to identify and describe scientific studies related to the use of national and large-scale surveys of user experiences in local quality work. Ovid EMBASE, Ovid MEDLINE, Ovid PsycINFO and the Cochrane Database of Systematic Reviews. Scientific publications about user experiences and satisfaction about the extent to which data from national and other large-scale user experience surveys are used for local quality work in the health services. Themes of interest were identified and a narrative analysis was undertaken. Thirteen publications were included, all differed substantially in several characteristics. The results show that large-scale surveys of user experiences are used in local quality work. The types of follow-up activity varied considerably from conducting a follow-up analysis of user experience survey data to information sharing and more-systematic efforts to use the data as a basis for improving the quality of care. This review shows that large-scale surveys of user experiences are used in local quality work. However, there is a need for more, better and standardized research in this field. The considerable variation in follow-up activities points to the need for systematic guidance on how to use data in local quality work. © The Author 2014. Published by Oxford University Press in association with the International Society for Quality in Health Care; all rights reserved.
Statistical Downscaling in Multi-dimensional Wave Climate Forecast
NASA Astrophysics Data System (ADS)
Camus, P.; Méndez, F. J.; Medina, R.; Losada, I. J.; Cofiño, A. S.; Gutiérrez, J. M.
2009-04-01
Wave climate at a particular site is defined by the statistical distribution of sea state parameters, such as significant wave height, mean wave period, mean wave direction, wind velocity, wind direction and storm surge. Nowadays, long-term time series of these parameters are available from reanalysis databases obtained by numerical models. The Self-Organizing Map (SOM) technique is applied to characterize multi-dimensional wave climate, obtaining the relevant "wave types" spanning the historical variability. This technique summarizes multi-dimension of wave climate in terms of a set of clusters projected in low-dimensional lattice with a spatial organization, providing Probability Density Functions (PDFs) on the lattice. On the other hand, wind and storm surge depend on instantaneous local large-scale sea level pressure (SLP) fields while waves depend on the recent history of these fields (say, 1 to 5 days). Thus, these variables are associated with large-scale atmospheric circulation patterns. In this work, a nearest-neighbors analog method is used to predict monthly multi-dimensional wave climate. This method establishes relationships between the large-scale atmospheric circulation patterns from numerical models (SLP fields as predictors) with local wave databases of observations (monthly wave climate SOM PDFs as predictand) to set up statistical models. A wave reanalysis database, developed by Puertos del Estado (Ministerio de Fomento), is considered as historical time series of local variables. The simultaneous SLP fields calculated by NCEP atmospheric reanalysis are used as predictors. Several applications with different size of sea level pressure grid and with different temporal domain resolution are compared to obtain the optimal statistical model that better represents the monthly wave climate at a particular site. In this work we examine the potential skill of this downscaling approach considering perfect-model conditions, but we will also analyze the suitability of this methodology to be used for seasonal forecast and for long-term climate change scenario projection of wave climate.
Evaluation of Tsunami Run-Up on Coastal Areas at Regional Scale
NASA Astrophysics Data System (ADS)
González, M.; Aniel-Quiroga, Í.; Gutiérrez, O.
2017-12-01
Tsunami hazard assessment is tackled by means of numerical simulations, giving as a result, the areas flooded by tsunami wave inland. To get this, some input data is required, i.e., the high resolution topobathymetry of the study area, the earthquake focal mechanism parameters, etc. The computational cost of these kinds of simulations are still excessive. An important restriction for the elaboration of large scale maps at National or regional scale is the reconstruction of high resolution topobathymetry on the coastal zone. An alternative and traditional method consists of the application of empirical-analytical formulations to calculate run-up at several coastal profiles (i.e. Synolakis, 1987), combined with numerical simulations offshore without including coastal inundation. In this case, the numerical simulations are faster but some limitations are added as the coastal bathymetric profiles are very simply idealized. In this work, we present a complementary methodology based on a hybrid numerical model, formed by 2 models that were coupled ad hoc for this work: a non-linear shallow water equations model (NLSWE) for the offshore part of the propagation and a Volume of Fluid model (VOF) for the areas near the coast and inland, applying each numerical scheme where they better reproduce the tsunami wave. The run-up of a tsunami scenario is obtained by applying the coupled model to an ad-hoc numerical flume. To design this methodology, hundreds of worldwide topobathymetric profiles have been parameterized, using 5 parameters (2 depths and 3 slopes). In addition, tsunami waves have been also parameterized by their height and period. As an application of the numerical flume methodology, the coastal parameterized profiles and tsunami waves have been combined to build a populated database of run-up calculations. The combination was tackled by means of numerical simulations in the numerical flume The result is a tsunami run-up database that considers real profiles shape, realistic tsunami waves, and optimized numerical simulations. This database allows the calculation of the run-up of any new tsunami wave by interpolation on the database, in a short period of time, based on the tsunami wave characteristics provided as an output of the NLSWE model along the coast at a large scale domain (regional or National scale).
Dankar, Fida K; Ptitsyn, Andrey; Dankar, Samar K
2018-04-10
Contemporary biomedical databases include a wide range of information types from various observational and instrumental sources. Among the most important features that unite biomedical databases across the field are high volume of information and high potential to cause damage through data corruption, loss of performance, and loss of patient privacy. Thus, issues of data governance and privacy protection are essential for the construction of data depositories for biomedical research and healthcare. In this paper, we discuss various challenges of data governance in the context of population genome projects. The various challenges along with best practices and current research efforts are discussed through the steps of data collection, storage, sharing, analysis, and knowledge dissemination.
Gu, Xun; Wang, Yufeng; Gu, Jianying
2002-06-01
The classical (two-round) hypothesis of vertebrate genome duplication proposes two successive whole-genome duplication(s) (polyploidizations) predating the origin of fishes, a view now being seriously challenged. As the debate largely concerns the relative merits of the 'big-bang mode' theory (large-scale duplication) and the 'continuous mode' theory (constant creation by small-scale duplications), we tested whether a significant proportion of paralogous genes in the contemporary human genome was indeed generated in the early stage of vertebrate evolution. After an extensive search of major databases, we dated 1,739 gene duplication events from the phylogenetic analysis of 749 vertebrate gene families. We found a pattern characterized by two waves (I, II) and an ancient component. Wave I represents a recent gene family expansion by tandem or segmental duplications, whereas wave II, a rapid paralogous gene increase in the early stage of vertebrate evolution, supports the idea of genome duplication(s) (the big-bang mode). Further analysis indicated that large- and small-scale gene duplications both make a significant contribution during the early stage of vertebrate evolution to build the current hierarchy of the human proteome.
A Matter of Time: Faster Percolator Analysis via Efficient SVM Learning for Large-Scale Proteomics.
Halloran, John T; Rocke, David M
2018-05-04
Percolator is an important tool for greatly improving the results of a database search and subsequent downstream analysis. Using support vector machines (SVMs), Percolator recalibrates peptide-spectrum matches based on the learned decision boundary between targets and decoys. To improve analysis time for large-scale data sets, we update Percolator's SVM learning engine through software and algorithmic optimizations rather than heuristic approaches that necessitate the careful study of their impact on learned parameters across different search settings and data sets. We show that by optimizing Percolator's original learning algorithm, l 2 -SVM-MFN, large-scale SVM learning requires nearly only a third of the original runtime. Furthermore, we show that by employing the widely used Trust Region Newton (TRON) algorithm instead of l 2 -SVM-MFN, large-scale Percolator SVM learning is reduced to nearly only a fifth of the original runtime. Importantly, these speedups only affect the speed at which Percolator converges to a global solution and do not alter recalibration performance. The upgraded versions of both l 2 -SVM-MFN and TRON are optimized within the Percolator codebase for multithreaded and single-thread use and are available under Apache license at bitbucket.org/jthalloran/percolator_upgrade .
Smith, Steven M.; Neilson, Ryan T.; Giles, Stuart A.
2015-01-01
Government-sponsored, national-scale, soil and sediment geochemical databases are used to estimate regional and local background concentrations for environmental issues, identify possible anthropogenic contamination, estimate mineral endowment, explore for new mineral deposits, evaluate nutrient levels for agriculture, and establish concentration relationships with human or animal health. Because of these different uses, it is difficult for any single database to accommodate all the needs of each client. Smith et al. (2013, p. 168) reviewed six national-scale soil and sediment geochemical databases for the United States (U.S.) and, for each, evaluated “its appropriateness as a national-scale geochemical database and its usefulness for national-scale geochemical mapping.” Each of the evaluated databases has strengths and weaknesses that were listed in that review.Two of these U.S. national-scale geochemical databases are similar in their sample media and collection protocols but have different strengths—primarily sampling density and analytical consistency. This project was implemented to determine whether those databases could be merged to produce a combined dataset that could be used for mineral resource assessments. The utility of the merged database was tested to see whether mapped distributions could identify metalliferous black shales at a national scale.
The thermodynamic scale of inorganic crystalline metastability
Sun, Wenhao; Dacek, Stephen T.; Ong, Shyue Ping; Hautier, Geoffroy; Jain, Anubhav; Richards, William D.; Gamst, Anthony C.; Persson, Kristin A.; Ceder, Gerbrand
2016-01-01
The space of metastable materials offers promising new design opportunities for next-generation technological materials, such as complex oxides, semiconductors, pharmaceuticals, steels, and beyond. Although metastable phases are ubiquitous in both nature and technology, only a heuristic understanding of their underlying thermodynamics exists. We report a large-scale data-mining study of the Materials Project, a high-throughput database of density functional theory–calculated energetics of Inorganic Crystal Structure Database structures, to explicitly quantify the thermodynamic scale of metastability for 29,902 observed inorganic crystalline phases. We reveal the influence of chemistry and composition on the accessible thermodynamic range of crystalline metastability for polymorphic and phase-separating compounds, yielding new physical insights that can guide the design of novel metastable materials. We further assert that not all low-energy metastable compounds can necessarily be synthesized, and propose a principle of ‘remnant metastability’—that observable metastable crystalline phases are generally remnants of thermodynamic conditions where they were once the lowest free-energy phase. PMID:28138514
Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics.
Chung, Ming-Hua; Wang, Yuping; Tang, Hailin; Zou, Wen; Basinger, John; Xu, Xiaowei; Tong, Weida
2015-01-01
The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this paper, we developed a generalized probabilistic topic model to analyze a toxicogenomics dataset that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.
Thakur, Shalabh; Guttman, David S
2016-06-30
Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/ .
Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context
Faith, Jeremiah J; Olson, Andrew J; Gardner, Timothy S; Sachidanandam, Ravi
2007-01-01
Background Lightweight genome viewer (lwgv) is a web-based tool for visualization of sequence annotations in their chromosomal context. It performs most of the functions of larger genome browsers, while relying on standard flat-file formats and bypassing the database needs of most visualization tools. Visualization as an aide to discovery requires display of novel data in conjunction with static annotations in their chromosomal context. With database-based systems, displaying dynamic results requires temporary tables that need to be tracked for removal. Results lwgv simplifies the visualization of user-generated results on a local computer. The dynamic results of these analyses are written to transient files, which can import static content from a more permanent file. lwgv is currently used in many different applications, from whole genome browsers to single-gene RNAi design visualization, demonstrating its applicability in a large variety of contexts and scales. Conclusion lwgv provides a lightweight alternative to large genome browsers for visualizing biological annotations and dynamic analyses in their chromosomal context. It is particularly suited for applications ranging from short sequences to medium-sized genomes when the creation and maintenance of a large software and database infrastructure is not necessary or desired. PMID:17877794
Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context.
Faith, Jeremiah J; Olson, Andrew J; Gardner, Timothy S; Sachidanandam, Ravi
2007-09-18
Lightweight genome viewer (lwgv) is a web-based tool for visualization of sequence annotations in their chromosomal context. It performs most of the functions of larger genome browsers, while relying on standard flat-file formats and bypassing the database needs of most visualization tools. Visualization as an aide to discovery requires display of novel data in conjunction with static annotations in their chromosomal context. With database-based systems, displaying dynamic results requires temporary tables that need to be tracked for removal. lwgv simplifies the visualization of user-generated results on a local computer. The dynamic results of these analyses are written to transient files, which can import static content from a more permanent file. lwgv is currently used in many different applications, from whole genome browsers to single-gene RNAi design visualization, demonstrating its applicability in a large variety of contexts and scales. lwgv provides a lightweight alternative to large genome browsers for visualizing biological annotations and dynamic analyses in their chromosomal context. It is particularly suited for applications ranging from short sequences to medium-sized genomes when the creation and maintenance of a large software and database infrastructure is not necessary or desired.
BIG: a large-scale data integration tool for renal physiology.
Zhao, Yue; Yang, Chin-Rang; Raghuram, Viswanathan; Parulekar, Jaya; Knepper, Mark A
2016-10-01
Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: "How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?" This is the type of problem that has motivated the "Big-Data" revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/.
Coincident scales of forest feedback on climate and conservation in a diversity hot spot
Webb, Thomas J; Gaston, Kevin J; Hannah, Lee; Ian Woodward, F
2005-01-01
The dynamic relationship between vegetation and climate is now widely acknowledged. Climate influences the distribution of vegetation; and through a number of feedback mechanisms vegetation affects climate. This implies that land-use changes such as deforestation will have climatic consequences. However, the spatial scales at which such feedbacks occur remain largely unknown. Here, we use a large database of precipitation and tree cover records for an area of the biodiversity-rich Atlantic forest region in south eastern Brazil to investigate the forest–rainfall feedback at a range of spatial scales from ca 101–104 km2. We show that the strength of the feedback increases up to scales of at least 103 km2, with the climate at a particular locality influenced by the pattern of landcover extending over a large area. Thus, smaller forest fragments, even if well protected, may suffer degradation due to the climate responding to land-use change in the surrounding area. Atlantic forest vertebrate taxa also require large areas of forest to support viable populations. Areas of forest of ca 103 km2 would be large enough to support such populations at the same time as minimizing the risk of climatic feedbacks resulting from deforestation. PMID:16608697
Coincident scales of forest feedback on climate and conservation in a diversity hot spot.
Webb, Thomas J; Gaston, Kevin J; Hannah, Lee; Ian Woodward, F
2006-03-22
The dynamic relationship between vegetation and climate is now widely acknowledged. Climate influences the distribution of vegetation; and through a number of feedback mechanisms vegetation affects climate. This implies that land-use changes such as deforestation will have climatic consequences. However, the spatial scales at which such feedbacks occur remain largely unknown. Here, we use a large database of precipitation and tree cover records for an area of the biodiversity-rich Atlantic forest region in south eastern Brazil to investigate the forest-rainfall feedback at a range of spatial scales from ca 10(1)-10(4) km2. We show that the strength of the feedback increases up to scales of at least 10(3) km2, with the climate at a particular locality influenced by the pattern of landcover extending over a large area. Thus, smaller forest fragments, even if well protected, may suffer degradation due to the climate responding to land-use change in the surrounding area. Atlantic forest vertebrate taxa also require large areas of forest to support viable populations. Areas of forest of ca 10(3) km2 would be large enough to support such populations at the same time as minimizing the risk of climatic feedbacks resulting from deforestation.
Administrative Databases in Orthopaedic Research: Pearls and Pitfalls of Big Data.
Patel, Alpesh A; Singh, Kern; Nunley, Ryan M; Minhas, Shobhit V
2016-03-01
The drive for evidence-based decision-making has highlighted the shortcomings of traditional orthopaedic literature. Although high-quality, prospective, randomized studies in surgery are the benchmark in orthopaedic literature, they are often limited by size, scope, cost, time, and ethical concerns and may not be generalizable to larger populations. Given these restrictions, there is a growing trend toward the use of large administrative databases to investigate orthopaedic outcomes. These datasets afford the opportunity to identify a large numbers of patients across a broad spectrum of comorbidities, providing information regarding disparities in care and outcomes, preoperative risk stratification parameters for perioperative morbidity and mortality, and national epidemiologic rates and trends. Although there is power in these databases in terms of their impact, potential problems include administrative data that are at risk of clerical inaccuracies, recording bias secondary to financial incentives, temporal changes in billing codes, a lack of numerous clinically relevant variables and orthopaedic-specific outcomes, and the absolute requirement of an experienced epidemiologist and/or statistician when evaluating results and controlling for confounders. Despite these drawbacks, administrative database studies are fundamental and powerful tools in assessing outcomes on a national scale and will likely be of substantial assistance in the future of orthopaedic research.
JEnsembl: a version-aware Java API to Ensembl data systems.
Paterson, Trevor; Law, Andy
2012-11-01
The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing 'through time' comparative analyses to be performed. Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net).
CellLineNavigator: a workbench for cancer cell line analysis
Krupp, Markus; Itzel, Timo; Maass, Thorsten; Hildebrandt, Andreas; Galle, Peter R.; Teufel, Andreas
2013-01-01
The CellLineNavigator database, freely available at http://www.medicalgenomics.org/celllinenavigator, is a web-based workbench for large scale comparisons of a large collection of diverse cell lines. It aims to support experimental design in the fields of genomics, systems biology and translational biomedical research. Currently, this compendium holds genome wide expression profiles of 317 different cancer cell lines, categorized into 57 different pathological states and 28 individual tissues. To enlarge the scope of CellLineNavigator, the database was furthermore closely linked to commonly used bioinformatics databases and knowledge repositories. To ensure easy data access and search ability, a simple data and an intuitive querying interface were implemented. It allows the user to explore and filter gene expression, focusing on pathological or physiological conditions. For a more complex search, the advanced query interface may be used to query for (i) differentially expressed genes; (ii) pathological or physiological conditions; or (iii) gene names or functional attributes, such as Kyoto Encyclopaedia of Genes and Genomes pathway maps. These queries may also be combined. Finally, CellLineNavigator allows additional advanced analysis of differentially regulated genes by a direct link to the Database for Annotation, Visualization and Integrated Discovery (DAVID) Bioinformatics Resources. PMID:23118487
Protocol for developing a Database of Zoonotic disease Research in India (DoZooRI).
Chatterjee, Pranab; Bhaumik, Soumyadeep; Chauhan, Abhimanyu Singh; Kakkar, Manish
2017-12-10
Zoonotic and emerging infectious diseases (EIDs) represent a public health threat that has been acknowledged only recently although they have been on the rise for the past several decades. On an average, every year since the Second World War, one pathogen has emerged or re-emerged on a global scale. Low/middle-income countries such as India bear a significant burden of zoonotic and EIDs. We propose that the creation of a database of published, peer-reviewed research will open up avenues for evidence-based policymaking for targeted prevention and control of zoonoses. A large-scale systematic mapping of the published peer-reviewed research conducted in India will be undertaken. All published research will be included in the database, without any prejudice for quality screening, to broaden the scope of included studies. Structured search strategies will be developed for priority zoonotic diseases (leptospirosis, rabies, anthrax, brucellosis, cysticercosis, salmonellosis, bovine tuberculosis, Japanese encephalitis and rickettsial infections), and multiple databases will be searched for studies conducted in India. The database will be managed and hosted on a cloud-based platform called Rayyan. Individual studies will be tagged based on key preidentified parameters (disease, study design, study type, location, randomisation status and interventions, host involvement and others, as applicable). The database will incorporate already published studies, obviating the need for additional ethical clearances. The database will be made available online, and in collaboration with multisectoral teams, domains of enquiries will be identified and subsequent research questions will be raised. The database will be queried for these and resulting evidence will be analysed and published in peer-reviewed journals. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Transitioning to a new nursing home: one organization's experience.
O'Brien, Kelli; Welsh, Darlene; Lundrigan, Elaine; Doyle, Anne
2013-01-01
Restructuring of long-term care in Western Health, a regional health authority within Newfoundland and Labrador, created a unique opportunity to study the widespread impacts of the transition. Staff and long-term-care residents were relocated from a variety of settings to a newly constructed facility. A plan was developed to assess the impact of relocation on staff, residents, and families. Indicators included fall rates, medication errors, complaints, media database, sick leave, overtime, injuries, and staff and family satisfaction. This article reports on the findings and lessons learned from an organizational perspective with such a large-scale transition. Some of the key findings included the necessity of premove and postmove strategies to minimize negative impacts, ongoing communication and involvement in decision making during transitions, tracking of key indicators, recognition from management regarding increased workload and stress experienced by staff, engagement of residents and families throughout the transition, and assessing the timing of large-scale relocations. These findings would be of interest to health care managers and leadership team in organizations planning large-scale changes.
[Advances in the research of application of artificial intelligence in burn field].
Li, H H; Bao, Z X; Liu, X B; Zhu, S H
2018-04-20
Artificial intelligence has been able to automatically learn and judge large-scale data to some extent. Based on database of a large amount of burn data and in-depth learning, artificial intelligence can assist burn surgeons to evaluate burn surface, diagnose burn depth, guide fluid supply during shock stage, and predict prognosis, with high accuracy. With the development of technology, artificial intelligence can provide more accurate information for burn surgeons to make clinical diagnosis and treatment strategies.
Remote visualization and scale analysis of large turbulence datatsets
NASA Astrophysics Data System (ADS)
Livescu, D.; Pulido, J.; Burns, R.; Canada, C.; Ahrens, J.; Hamann, B.
2015-12-01
Accurate simulations of turbulent flows require solving all the dynamically relevant scales of motions. This technique, called Direct Numerical Simulation, has been successfully applied to a variety of simple flows; however, the large-scale flows encountered in Geophysical Fluid Dynamics (GFD) would require meshes outside the range of the most powerful supercomputers for the foreseeable future. Nevertheless, the current generation of petascale computers has enabled unprecedented simulations of many types of turbulent flows which focus on various GFD aspects, from the idealized configurations extensively studied in the past to more complex flows closer to the practical applications. The pace at which such simulations are performed only continues to increase; however, the simulations themselves are restricted to a small number of groups with access to large computational platforms. Yet the petabytes of turbulence data offer almost limitless information on many different aspects of the flow, from the hierarchy of turbulence moments, spectra and correlations, to structure-functions, geometrical properties, etc. The ability to share such datasets with other groups can significantly reduce the time to analyze the data, help the creative process and increase the pace of discovery. Using the largest DOE supercomputing platforms, we have performed some of the biggest turbulence simulations to date, in various configurations, addressing specific aspects of turbulence production and mixing mechanisms. Until recently, the visualization and analysis of such datasets was restricted by access to large supercomputers. The public Johns Hopkins Turbulence database simplifies the access to multi-Terabyte turbulence datasets and facilitates turbulence analysis through the use of commodity hardware. First, one of our datasets, which is part of the database, will be described and then a framework that adds high-speed visualization and wavelet support for multi-resolution analysis of turbulence will be highlighted. The addition of wavelet support reduces the latency and bandwidth requirements for visualization, allowing for many concurrent users, and enables new types of analyses, including scale decomposition and coherent feature extraction.
2012-01-01
particular functions and identify species that contain these proteins. For example, if users select two species, Homo sapiens and Mus musculus, and...Kerr AR, McCormack TJ, Riley M: Evolution by leaps: gene duplication in bacteria. Biol Direct 2009, 4:46. 12. Remm M, Storm CE, Sonnhammer EL
Developing Data Systems To Support the Analysis and Development of Large-Scale, On-Line Assessment.
ERIC Educational Resources Information Center
Yu, Chong Ho
Today many data warehousing systems are data rich, but information poor. Extracting useful information from an ocean of data to support administrative, policy, and instructional decisions becomes a major challenge to both database designers and measurement specialists. This paper focuses on the development of a data processing system that…
ERIC Educational Resources Information Center
Hsieh, Feng-Jui; Law, Chiu-Keung; Shy, Haw-Yaw; Wang, Ting-Ying; Hsieh, Chia-Jui; Tang, Shu-Jyh
2011-01-01
The Teacher Education and Development Study in Mathematics, sponsored by the International Association for the Evaluation of Educational Achievement, is the first data-based study about mathematics teacher education with large-scale samples; this article is based on its data but develops a stand-alone conceptual framework to investigate the…
ERIC Educational Resources Information Center
Hsaieh, Hsiao-Chin; Yang, Chia-Ling
2014-01-01
While access to higher education has reached gender parity in Taiwan, the phenomenon of gender segregation and stratification by fields of study and by division of labor persist. In this article, we trace the historical evolution of Taiwan's education system and data using large-scale educational databases to analyze the association of…
Height-diameter allometry of tropical forest trees
T.R. Feldpausch; L. Banin; O.L. Phillips; T.R. Baker; S.L. Lewis; C.A. Quesada; K. Affum-Baffoe; E.J.M.M. Arets; N.J. Berry; M. Bird; E.S. Brondizio; P de Camargo; J. Chave; G. Djagbletey; T.F. Domingues; M. Drescher; P.M. Fearnside; M.B. Franca; N.M. Fyllas; G. Lopez-Gonzalez; A. Hladik; N. Higuchi; M.O. Hunter; Y. Iida; K.A. Salim; A.R. Kassim; M. Keller; J. Kemp; D.A. King; J.C. Lovett; B.S. Marimon; B.H. Marimon-Junior; E. Lenza; A.R. Marshall; D.J. Metcalfe; E.T.A. Mitchard; E.F. Moran; B.W. Nelson; R. Nilus; E.M. Nogueira; M. Palace; S. Patiño; K.S.-H. Peh; M.T. Raventos; J.M. Reitsma; G. Saiz; F. Schrodt; B. Sonke; H.E. Taedoumg; S. Tan; L. White; H. Woll; J. Lloyd
2011-01-01
Tropical tree height-diameter (H:D) relationships may vary by forest type and region making large-scale estimates of above-ground biomass subject to bias if they ignore these differences in stem allometry. We have therefore developed a new global tropical forest database consisting of 39 955 concurrent H and D measurements encompassing 283 sites in 22 tropical...
Obesity, High-Calorie Food Intake, and Academic Achievement Trends among U.S. School Children
ERIC Educational Resources Information Center
Li, Jian; O'Connell, Ann A.
2012-01-01
The authors investigated children's self-reported high-calorie food intake in Grade 5 and its relationship to trends in obesity status and academic achievement over the first 6 years of school. They used 3-level hierarchical linear models in the large-scale database (the Early Childhood Longitudinal Study--Kindergarten Cohort). Findings indicated…
Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.
2007-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (). PMID:17202161
A Computational Chemistry Database for Semiconductor Processing
NASA Technical Reports Server (NTRS)
Jaffe, R.; Meyyappan, M.; Arnold, J. O. (Technical Monitor)
1998-01-01
The concept of 'virtual reactor' or 'virtual prototyping' has received much attention recently in the semiconductor industry. Commercial codes to simulate thermal CVD and plasma processes have become available to aid in equipment and process design efforts, The virtual prototyping effort would go nowhere if codes do not come with a reliable database of chemical and physical properties of gases involved in semiconductor processing. Commercial code vendors have no capabilities to generate such a database, rather leave the task to the user of finding whatever is needed. While individual investigations of interesting chemical systems continue at Universities, there has not been any large scale effort to create a database. In this presentation, we outline our efforts in this area. Our effort focuses on the following five areas: 1. Thermal CVD reaction mechanism and rate constants. 2. Thermochemical properties. 3. Transport properties.4. Electron-molecule collision cross sections. and 5. Gas-surface interactions.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2008-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.
2008-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov PMID:18073190
Australia's continental-scale acoustic tracking database and its automated quality control process
NASA Astrophysics Data System (ADS)
Hoenner, Xavier; Huveneers, Charlie; Steckenreuter, Andre; Simpfendorfer, Colin; Tattersall, Katherine; Jaine, Fabrice; Atkins, Natalia; Babcock, Russ; Brodie, Stephanie; Burgess, Jonathan; Campbell, Hamish; Heupel, Michelle; Pasquer, Benedicte; Proctor, Roger; Taylor, Matthew D.; Udyawer, Vinay; Harcourt, Robert
2018-01-01
Our ability to predict species responses to environmental changes relies on accurate records of animal movement patterns. Continental-scale acoustic telemetry networks are increasingly being established worldwide, producing large volumes of information-rich geospatial data. During the last decade, the Integrated Marine Observing System's Animal Tracking Facility (IMOS ATF) established a permanent array of acoustic receivers around Australia. Simultaneously, IMOS developed a centralised national database to foster collaborative research across the user community and quantify individual behaviour across a broad range of taxa. Here we present the database and quality control procedures developed to collate 49.6 million valid detections from 1891 receiving stations. This dataset consists of detections for 3,777 tags deployed on 117 marine species, with distances travelled ranging from a few to thousands of kilometres. Connectivity between regions was only made possible by the joint contribution of IMOS infrastructure and researcher-funded receivers. This dataset constitutes a valuable resource facilitating meta-analysis of animal movement, distributions, and habitat use, and is important for relating species distribution shifts with environmental covariates.
A Comparison of Global Indexing Schemes to Facilitate Earth Science Data Management
NASA Astrophysics Data System (ADS)
Griessbaum, N.; Frew, J.; Rilee, M. L.; Kuo, K. S.
2017-12-01
Recent advances in database technology have led to systems optimized for managing petabyte-scale multidimensional arrays. These array databases are a good fit for subsets of the Earth's surface that can be projected into a rectangular coordinate system with acceptable geometric fidelity. However, for global analyses, array databases must address the same distortions and discontinuities that apply to map projections in general. The array database SciDB supports enormous databases spread across thousands of computing nodes. Additionally, the following SciDB characteristics are particularly germane to the coordinate system problem: SciDB efficiently stores and manipulates sparse (i.e. mostly empty) arrays. SciDB arrays have 64-bit indexes. SciDB supports user-defined data types, functions, and operators. We have implemented two geospatial indexing schemes in SciDB. The simplest uses two array dimensions to represent longitude and latitude. For representation as 64-bit integers, the coordinates are multiplied by a scale factor large enough to yield an appropriate Earth surface resolution (e.g., a scale factor of 100,000 yields a resolution of approximately 1m at the equator). Aside from the longitudinal discontinuity, the principal disadvantage of this scheme is its fixed scale factor. The second scheme uses a single array dimension to represent the bit-codes for locations in a hierarchical triangular mesh (HTM) coordinate system. A HTM maps the Earth's surface onto an octahedron, and then recursively subdivides each triangular face to the desired resolution. Earth surface locations are represented as the concatenation of an octahedron face code and a quadtree code within the face. Unlike our integerized lat-lon scheme, the HTM allow for objects of different size (e.g., pixels with differing resolutions) to be represented in the same indexing scheme. We present an evaluation of the relative utility of these two schemes for managing and analyzing MODIS swath data.
Evolving from bioinformatics in-the-small to bioinformatics in-the-large.
Parker, D Stott; Gorlick, Michael M; Lee, Christopher J
2003-01-01
We argue the significance of a fundamental shift in bioinformatics, from in-the-small to in-the-large. Adopting a large-scale perspective is a way to manage the problems endemic to the world of the small-constellations of incompatible tools for which the effort required to assemble an integrated system exceeds the perceived benefit of the integration. Where bioinformatics in-the-small is about data and tools, bioinformatics in-the-large is about metadata and dependencies. Dependencies represent the complexities of large-scale integration, including the requirements and assumptions governing the composition of tools. The popular make utility is a very effective system for defining and maintaining simple dependencies, and it offers a number of insights about the essence of bioinformatics in-the-large. Keeping an in-the-large perspective has been very useful to us in large bioinformatics projects. We give two fairly different examples, and extract lessons from them showing how it has helped. These examples both suggest the benefit of explicitly defining and managing knowledge flows and knowledge maps (which represent metadata regarding types, flows, and dependencies), and also suggest approaches for developing bioinformatics database systems. Generally, we argue that large-scale engineering principles can be successfully adapted from disciplines such as software engineering and data management, and that having an in-the-large perspective will be a key advantage in the next phase of bioinformatics development.
A data model and database for high-resolution pathology analytical image informatics.
Wang, Fusheng; Kong, Jun; Cooper, Lee; Pan, Tony; Kurc, Tahsin; Chen, Wenjin; Sharma, Ashish; Niedermayr, Cristobal; Oh, Tae W; Brat, Daniel; Farris, Alton B; Foran, David J; Saltz, Joel
2011-01-01
The systematic analysis of imaged pathology specimens often results in a vast amount of morphological information at both the cellular and sub-cellular scales. While microscopy scanners and computerized analysis are capable of capturing and analyzing data rapidly, microscopy image data remain underutilized in research and clinical settings. One major obstacle which tends to reduce wider adoption of these new technologies throughout the clinical and scientific communities is the challenge of managing, querying, and integrating the vast amounts of data resulting from the analysis of large digital pathology datasets. This paper presents a data model, which addresses these challenges, and demonstrates its implementation in a relational database system. This paper describes a data model, referred to as Pathology Analytic Imaging Standards (PAIS), and a database implementation, which are designed to support the data management and query requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines on whole-slide images and tissue microarrays (TMAs). (1) Development of a data model capable of efficiently representing and storing virtual slide related image, annotation, markup, and feature information. (2) Development of a database, based on the data model, capable of supporting queries for data retrieval based on analysis and image metadata, queries for comparison of results from different analyses, and spatial queries on segmented regions, features, and classified objects. The work described in this paper is motivated by the challenges associated with characterization of micro-scale features for comparative and correlative analyses involving whole-slides tissue images and TMAs. Technologies for digitizing tissues have advanced significantly in the past decade. Slide scanners are capable of producing high-magnification, high-resolution images from whole slides and TMAs within several minutes. Hence, it is becoming increasingly feasible for basic, clinical, and translational research studies to produce thousands of whole-slide images. Systematic analysis of these large datasets requires efficient data management support for representing and indexing results from hundreds of interrelated analyses generating very large volumes of quantifications such as shape and texture and of classifications of the quantified features. We have designed a data model and a database to address the data management requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines. The data model represents virtual slide related image, annotation, markup and feature information. The database supports a wide range of metadata and spatial queries on images, annotations, markups, and features. We currently have three databases running on a Dell PowerEdge T410 server with CentOS 5.5 Linux operating system. The database server is IBM DB2 Enterprise Edition 9.7.2. The set of databases consists of 1) a TMA database containing image analysis results from 4740 cases of breast cancer, with 641 MB storage size; 2) an algorithm validation database, which stores markups and annotations from two segmentation algorithms and two parameter sets on 18 selected slides, with 66 GB storage size; and 3) an in silico brain tumor study database comprising results from 307 TCGA slides, with 365 GB storage size. The latter two databases also contain human-generated annotations and markups for regions and nuclei. Modeling and managing pathology image analysis results in a database provide immediate benefits on the value and usability of data in a research study. The database provides powerful query capabilities, which are otherwise difficult or cumbersome to support by other approaches such as programming languages. Standardized, semantic annotated data representation and interfaces also make it possible to more efficiently share image data and analysis results.
Accounting for rainfall spatial variability in the prediction of flash floods
NASA Astrophysics Data System (ADS)
Saharia, Manabendra; Kirstetter, Pierre-Emmanuel; Gourley, Jonathan J.; Hong, Yang; Vergara, Humberto; Flamig, Zachary L.
2017-04-01
Flash floods are a particularly damaging natural hazard worldwide in terms of both fatalities and property damage. In the United States, the lack of a comprehensive database that catalogues information related to flash flood timing, location, causative rainfall, and basin geomorphology has hindered broad characterization studies. First a representative and long archive of more than 15,000 flooding events during 2002-2011 is used to analyze the spatial and temporal variability of flash floods. We also derive large number of spatially distributed geomorphological and climatological parameters such as basin area, mean annual precipitation, basin slope etc. to identify static basin characteristics that influence flood response. For the same period, the National Severe Storms Laboratory (NSSL) has produced a decadal archive of Multi-Radar/Multi-Sensor (MRMS) radar-only precipitation rates at 1-km spatial resolution with 5-min temporal resolution. This provides an unprecedented opportunity to analyze the impact of event-level precipitation variability on flooding using a big data approach. To analyze the impact of sub-basin scale rainfall spatial variability on flooding, certain indices such as the first and second scaled moment of rainfall, horizontal gap, vertical gap etc. are computed from the MRMS dataset. Finally, flooding characteristics such as rise time, lag time, and peak discharge are linked to derived geomorphologic, climatologic, and rainfall indices to identify basin characteristics that drive flash floods. The database has been subjected to rigorous quality control by accounting for radar beam height and percentage snow in basins. So far studies involving rainfall variability indices have only been performed on a case study basis, and a large scale approach is expected to provide a deeper insight into how sub-basin scale precipitation variability affects flooding. Finally, these findings are validated using the National Weather Service storm reports and a historical flood fatalities database. This analysis framework will serve as a baseline for evaluating distributed hydrologic model simulations such as the Flooded Locations And Simulated Hydrographs Project (FLASH) (http://flash.ou.edu).
Toward the automated generation of genome-scale metabolic networks in the SEED.
DeJongh, Matthew; Formsma, Kevin; Boillot, Paul; Gould, John; Rycenga, Matthew; Best, Aaron
2007-04-26
Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process. We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis). We have implemented our tools and database within the SEED, an open-source software environment for comparative genome annotation and analysis. Our method sets the stage for the automated generation of substantially complete metabolic networks for over 400 complete genome sequences currently in the SEED. With each genome that is processed using our tools, the database of common components grows to cover more of the diversity of metabolic pathways. This increases the likelihood that components of reaction networks for subsequently processed genomes can be retrieved from the database, rather than assembled and verified manually.
A Study of the Efficiency of Spatial Indexing Methods Applied to Large Astronomical Databases
NASA Astrophysics Data System (ADS)
Donaldson, Tom; Berriman, G. Bruce; Good, John; Shiao, Bernie
2018-01-01
Spatial indexing of astronomical databases generally uses quadrature methods, which partition the sky into cells used to create an index (usually a B-tree) written as database column. We report the results of a study to compare the performance of two common indexing methods, HTM and HEALPix, on Solaris and Windows database servers installed with a PostgreSQL database, and a Windows Server installed with MS SQL Server. The indexing was applied to the 2MASS All-Sky Catalog and to the Hubble Source catalog. On each server, the study compared indexing performance by submitting 1 million queries at each index level with random sky positions and random cone search radius, which was computed on a logarithmic scale between 1 arcsec and 1 degree, and measuring the time to complete the query and write the output. These simulated queries, intended to model realistic use patterns, were run in a uniform way on many combinations of indexing method and indexing level. The query times in all simulations are strongly I/O-bound and are linear with number of records returned for large numbers of sources. There are, however, considerable differences between simulations, which reveal that hardware I/O throughput is a more important factor in managing the performance of a DBMS than the choice of indexing scheme. The choice of index itself is relatively unimportant: for comparable index levels, the performance is consistent within the scatter of the timings. At small index levels (large cells; e.g. level 4; cell size 3.7 deg), there is large scatter in the timings because of wide variations in the number of sources found in the cells. At larger index levels, performance improves and scatter decreases, but the improvement at level 8 (14 min) and higher is masked to some extent in the timing scatter caused by the range of query sizes. At very high levels (20; 0.0004 arsec), the granularity of the cells becomes so high that a large number of extraneous empty cells begin to degrade performance. Thus, for the use patterns studied here the database performance is not critically dependent on the exact choices of index or level.
MGIS: managing banana (Musa spp.) genetic resources information and high-throughput genotyping data
Guignon, V.; Sempere, G.; Sardos, J.; Hueber, Y.; Duvergey, H.; Andrieu, A.; Chase, R.; Jenny, C.; Hazekamp, T.; Irish, B.; Jelali, K.; Adeka, J.; Ayala-Silva, T.; Chao, C.P.; Daniells, J.; Dowiya, B.; Effa effa, B.; Gueco, L.; Herradura, L.; Ibobondji, L.; Kempenaers, E.; Kilangi, J.; Muhangi, S.; Ngo Xuan, P.; Paofa, J.; Pavis, C.; Thiemele, D.; Tossou, C.; Sandoval, J.; Sutanto, A.; Vangu Paka, G.; Yi, G.; Van den houwe, I.; Roux, N.
2017-01-01
Abstract Unraveling the genetic diversity held in genebanks on a large scale is underway, due to advances in Next-generation sequence (NGS) based technologies that produce high-density genetic markers for a large number of samples at low cost. Genebank users should be in a position to identify and select germplasm from the global genepool based on a combination of passport, genotypic and phenotypic data. To facilitate this, a new generation of information systems is being designed to efficiently handle data and link it with other external resources such as genome or breeding databases. The Musa Germplasm Information System (MGIS), the database for global ex situ-held banana genetic resources, has been developed to address those needs in a user-friendly way. In developing MGIS, we selected a generic database schema (Chado), the robust content management system Drupal for the user interface, and Tripal, a set of Drupal modules which links the Chado schema to Drupal. MGIS allows germplasm collection examination, accession browsing, advanced search functions, and germplasm orders. Additionally, we developed unique graphical interfaces to compare accessions and to explore them based on their taxonomic information. Accession-based data has been enriched with publications, genotyping studies and associated genotyping datasets reporting on germplasm use. Finally, an interoperability layer has been implemented to facilitate the link with complementary databases like the Banana Genome Hub and the MusaBase breeding database. Database URL: https://www.crop-diversity.org/mgis/ PMID:29220435
Akiyama, Kenji; Kurotani, Atsushi; Iida, Kei; Kuromori, Takashi; Shinozaki, Kazuo; Sakurai, Tetsuya
2014-01-01
Arabidopsis thaliana is one of the most popular experimental plants. However, only 40% of its genes have at least one experimental Gene Ontology (GO) annotation assigned. Systematic observation of mutant phenotypes is an important technique for elucidating gene functions. Indeed, several large-scale phenotypic analyses have been performed and have generated phenotypic data sets from many Arabidopsis mutant lines and overexpressing lines, which are freely available online. Since each Arabidopsis mutant line database uses individual phenotype expression, the differences in the structured term sets used by each database make it difficult to compare data sets and make it impossible to search across databases. Therefore, we obtained publicly available information for a total of 66,209 Arabidopsis mutant lines, including loss-of-function (RATM and TARAPPER) and gain-of-function (AtFOX and OsFOX) lines, and integrated the phenotype data by mapping the descriptions onto Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) terms. This approach made it possible to manage the four different phenotype databases as one large data set. Here, we report a publicly accessible web-based database, the RIKEN Arabidopsis Genome Encyclopedia II (RARGE II; http://rarge-v2.psc.riken.jp/), in which all of the data described in this study are included. Using the database, we demonstrated consistency (in terms of protein function) with a previous study and identified the presumed function of an unknown gene. We provide examples of AT1G21600, which is a subunit in the plastid-encoded RNA polymerase complex, and AT5G56980, which is related to the jasmonic acid signaling pathway.
Peng, Le; Zhang, Chao; Zhou, Lan; Zuo, Hong-Xia; He, Xiao-Kuo; Niu, Yu-Ming
2018-04-01
To investigate the effectiveness of traditional manual acupuncture combined with rehabilitation therapy versus rehabilitation therapy alone for shoulder hand syndrome after stroke. PubMed, EMBASE, the Cochrane Library, Chinese Biomedicine Database, China National Knowledge Infrastructure, VIP Information Database, Wan Fang Database and reference lists of the eligible studies were searched up to July 2017 for relevant studies. Randomized controlled trials that compared the combined effects of traditional manual acupuncture and rehabilitation therapy to rehabilitation therapy alone for shoulder hand syndrome after stroke were included. Two reviewers independently screened the searched records, extracted the data and assessed risk of bias of the included studies. The treatment effect sizes were pooled in a meta-analysis using RevMan 5.3 software. A total of 20 studies involving 1918 participants were included in this study. Compared to rehabilitation therapy alone, the combined therapy significantly reduced pain on the visual analogue scale and improved limb movement on the Fugl-Meyer Assessment scale and the performance of activities of daily living (ADL) on the Barthel Index scale or Modified Barthel Index scale. Of these, the visual analogue scale score changes were significantly higher (mean difference = 1.49, 95% confidence interval = 1.15-1.82, P < 0.00001) favoring the combined therapy after treatment, with severe heterogeneity ( I 2 = 71%, P = 0.0005). Current evidence suggests that traditional manual acupuncture integrated with rehabilitation therapy is more effective in alleviating pain, improving limb movement and ADL. However, considering the relatively low quality of available evidence, further rigorously designed and large-scale randomized controlled trials are needed to confirm the results.
Automatic location of L/H transition times for physical studies with a large statistical basis
NASA Astrophysics Data System (ADS)
González, S.; Vega, J.; Murari, A.; Pereira, A.; Dormido-Canto, S.; Ramírez, J. M.; contributors, JET-EFDA
2012-06-01
Completely automatic techniques to estimate and validate L/H transition times can be essential in L/H transition analyses. The generation of databases with hundreds of transition times and without human intervention is an important step to accomplish (a) L/H transition physics analysis, (b) validation of L/H theoretical models and (c) creation of L/H scaling laws. An entirely unattended methodology is presented in this paper to build large databases of transition times in JET using time series. The proposed technique has been applied to a dataset of 551 JET discharges between campaigns C21 and C26. A prediction with discharges that show a clear signature in time series is made through the locating properties of the wavelet transform. It is an accurate prediction and the uncertainty interval is ±3.2 ms. The discharges with a non-clear pattern in the time series use an L/H mode classifier based on discharges with a clear signature. In this case, the estimation error shows a distribution with mean and standard deviation of 27.9 ms and 37.62 ms, respectively. Two different regression methods have been applied to the measurements acquired at the transition times identified by the automatic system. The obtained scaling laws for the threshold power are not significantly different from those obtained using the data at the transition times determined manually by the experts. The automatic methods allow performing physical studies with a large number of discharges, showing, for example, that there are statistically different types of transitions characterized by different scaling laws.
Virtual Systems Pharmacology (ViSP) software for simulation from mechanistic systems-level models.
Ermakov, Sergey; Forster, Peter; Pagidala, Jyotsna; Miladinov, Marko; Wang, Albert; Baillie, Rebecca; Bartlett, Derek; Reed, Mike; Leil, Tarek A
2014-01-01
Multiple software programs are available for designing and running large scale system-level pharmacology models used in the drug development process. Depending on the problem, scientists may be forced to use several modeling tools that could increase model development time, IT costs and so on. Therefore, it is desirable to have a single platform that allows setting up and running large-scale simulations for the models that have been developed with different modeling tools. We developed a workflow and a software platform in which a model file is compiled into a self-contained executable that is no longer dependent on the software that was used to create the model. At the same time the full model specifics is preserved by presenting all model parameters as input parameters for the executable. This platform was implemented as a model agnostic, therapeutic area agnostic and web-based application with a database back-end that can be used to configure, manage and execute large-scale simulations for multiple models by multiple users. The user interface is designed to be easily configurable to reflect the specifics of the model and the user's particular needs and the back-end database has been implemented to store and manage all aspects of the systems, such as Models, Virtual Patients, User Interface Settings, and Results. The platform can be adapted and deployed on an existing cluster or cloud computing environment. Its use was demonstrated with a metabolic disease systems pharmacology model that simulates the effects of two antidiabetic drugs, metformin and fasiglifam, in type 2 diabetes mellitus patients.
Virtual Systems Pharmacology (ViSP) software for simulation from mechanistic systems-level models
Ermakov, Sergey; Forster, Peter; Pagidala, Jyotsna; Miladinov, Marko; Wang, Albert; Baillie, Rebecca; Bartlett, Derek; Reed, Mike; Leil, Tarek A.
2014-01-01
Multiple software programs are available for designing and running large scale system-level pharmacology models used in the drug development process. Depending on the problem, scientists may be forced to use several modeling tools that could increase model development time, IT costs and so on. Therefore, it is desirable to have a single platform that allows setting up and running large-scale simulations for the models that have been developed with different modeling tools. We developed a workflow and a software platform in which a model file is compiled into a self-contained executable that is no longer dependent on the software that was used to create the model. At the same time the full model specifics is preserved by presenting all model parameters as input parameters for the executable. This platform was implemented as a model agnostic, therapeutic area agnostic and web-based application with a database back-end that can be used to configure, manage and execute large-scale simulations for multiple models by multiple users. The user interface is designed to be easily configurable to reflect the specifics of the model and the user's particular needs and the back-end database has been implemented to store and manage all aspects of the systems, such as Models, Virtual Patients, User Interface Settings, and Results. The platform can be adapted and deployed on an existing cluster or cloud computing environment. Its use was demonstrated with a metabolic disease systems pharmacology model that simulates the effects of two antidiabetic drugs, metformin and fasiglifam, in type 2 diabetes mellitus patients. PMID:25374542
Cutaneous lichen planus: A systematic review of treatments.
Fazel, Nasim
2015-06-01
Various treatment modalities are available for cutaneous lichen planus. Pubmed, EMBASE, Cochrane Database of Systematic Reviews, Cochrane Central Register of Controlled Trials, Database of Abstracts of Reviews of Effects, and Health Technology Assessment Database were searched for all the systematic reviews and randomized controlled trials related to cutaneous lichen planus. Two systematic reviews and nine relevant randomized controlled trials were identified. Acitretin, griseofulvin, hydroxychloroquine and narrow band ultraviolet B are demonstrated to be effective in the treatment of cutaneous lichen planus. Sulfasalazine is effective, but has an unfavorable safety profile. KH1060, a vitamin D analogue, is not beneficial in the management of cutaneous lichen planus. Evidence from large scale randomized trials demonstrating the safety and efficacy for many other treatment modalities used to treat cutaneous lichen planus is simply not available.
Harb, Omar S; Roos, David S
2015-01-01
Over the past 20 years, advances in high-throughput biological techniques and the availability of computational resources including fast Internet access have resulted in an explosion of large genome-scale data sets "big data." While such data are readily available for download and personal use and analysis from a variety of repositories, often such analysis requires access to seldom-available computational skills. As a result a number of databases have emerged to provide scientists with online tools enabling the interrogation of data without the need for sophisticated computational skills beyond basic knowledge of Internet browser utility. This chapter focuses on the Eukaryotic Pathogen Databases (EuPathDB: http://eupathdb.org) Bioinformatic Resource Center (BRC) and illustrates some of the available tools and methods.
Large-scale quantitative analysis of painting arts.
Kim, Daniel; Son, Seung-Woo; Jeong, Hawoong
2014-12-11
Scientists have made efforts to understand the beauty of painting art in their own languages. As digital image acquisition of painting arts has made rapid progress, researchers have come to a point where it is possible to perform statistical analysis of a large-scale database of artistic paints to make a bridge between art and science. Using digital image processing techniques, we investigate three quantitative measures of images - the usage of individual colors, the variety of colors, and the roughness of the brightness. We found a difference in color usage between classical paintings and photographs, and a significantly low color variety of the medieval period. Interestingly, moreover, the increment of roughness exponent as painting techniques such as chiaroscuro and sfumato have advanced is consistent with historical circumstances.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoon Lee, Sang; Hong, Tianzhen; Sawaya, Geof
The paper presents a method and process to establish a database of energy efficiency performance (DEEP) to enable quick and accurate assessment of energy retrofit of commercial buildings. DEEP was compiled from results of about 35 million EnergyPlus simulations. DEEP provides energy savings for screening and evaluation of retrofit measures targeting the small and medium-sized office and retail buildings in California. The prototype building models are developed for a comprehensive assessment of building energy performance based on DOE commercial reference buildings and the California DEER prototype buildings. The prototype buildings represent seven building types across six vintages of constructions andmore » 16 California climate zones. DEEP uses these prototypes to evaluate energy performance of about 100 energy conservation measures covering envelope, lighting, heating, ventilation, air-conditioning, plug-loads, and domestic hot water. DEEP consists the energy simulation results for individual retrofit measures as well as packages of measures to consider interactive effects between multiple measures. The large scale EnergyPlus simulations are being conducted on the super computers at the National Energy Research Scientific Computing Center of Lawrence Berkeley National Laboratory. The pre-simulation database is a part of an on-going project to develop a web-based retrofit toolkit for small and medium-sized commercial buildings in California, which provides real-time energy retrofit feedback by querying DEEP with recommended measures, estimated energy savings and financial payback period based on users’ decision criteria of maximizing energy savings, energy cost savings, carbon reduction, or payback of investment. The pre-simulated database and associated comprehensive measure analysis enhances the ability to performance assessments of retrofits to reduce energy use for small and medium buildings and business owners who typically do not have resources to conduct costly building energy audit. DEEP will be migrated into the DEnCity - DOE’s Energy City, which integrates large-scale energy data for multi-purpose, open, and dynamic database leveraging diverse source of existing simulation data.« less
Addition of a breeding database in the Genome Database for Rosaceae
Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie
2013-01-01
Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox PMID:24247530
Addition of a breeding database in the Genome Database for Rosaceae.
Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie
2013-01-01
Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox.
BIG: a large-scale data integration tool for renal physiology
Zhao, Yue; Yang, Chin-Rang; Raghuram, Viswanathan; Parulekar, Jaya
2016-01-01
Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/. PMID:27279488
CELL5M: A geospatial database of agricultural indicators for Africa South of the Sahara.
Koo, Jawoo; Cox, Cindy M; Bacou, Melanie; Azzarri, Carlo; Guo, Zhe; Wood-Sichra, Ulrike; Gong, Queenie; You, Liangzhi
2016-01-01
Recent progress in large-scale georeferenced data collection is widening opportunities for combining multi-disciplinary datasets from biophysical to socioeconomic domains, advancing our analytical and modeling capacity. Granular spatial datasets provide critical information necessary for decision makers to identify target areas, assess baseline conditions, prioritize investment options, set goals and targets and monitor impacts. However, key challenges in reconciling data across themes, scales and borders restrict our capacity to produce global and regional maps and time series. This paper provides overview, structure and coverage of CELL5M-an open-access database of geospatial indicators at 5 arc-minute grid resolution-and introduces a range of analytical applications and case-uses. CELL5M covers a wide set of agriculture-relevant domains for all countries in Africa South of the Sahara and supports our understanding of multi-dimensional spatial variability inherent in farming landscapes throughout the region.
Leaf optical properties shed light on foliar trait variability at individual to global scales
NASA Astrophysics Data System (ADS)
Shiklomanov, A. N.; Serbin, S.; Dietze, M.
2017-12-01
Recent syntheses of large trait databases have contributed immensely to our understanding of drivers of plant function at the global scale. However, the global trade-offs revealed by such syntheses, such as the trade-off between leaf productivity and resilience (i.e. "leaf economics spectrum"), are often absent at smaller scales and fail to correlate with actual functional limitations. An improved understanding of how traits vary among communities, species, and individuals is critical to accurate representations of vegetation ecophysiology and ecological dynamics in ecosystem models. Spectral data from both field observations and remote sensing platforms present a rich and widely available source of information on plant traits. Here, we apply Bayesian inversion of the PROSPECT leaf radiative transfer model to a large global database of over 60,000 field spectra and plant traits to (1) comprehensively assess the accuracy of leaf trait estimation using PROSPECT spectral inversion; (2) investigate the correlations between optical traits estimable from PROSPECT and other important foliar traits such as nitrogen and lignin concentrations; and (3) identify dominant sources of variability and characterize trade-offs in optical and non-optical foliar traits. Our work provides a key methodological contribution by validating physically-based retrieval of plant traits from remote sensing observations, and provides insights about trait trade-offs related to plant acclimation, adaptation, and community assembly.
ERIC Educational Resources Information Center
Ip, Edward H.; Leung, Phillip; Johnson, Joseph
2004-01-01
We describe the design and implementation of a web-based statistical program--the Interactive Profiler (IP). The prototypical program, developed in Java, was motivated by the need for the general public to query against data collected from the National Assessment of Educational Progress (NAEP), a large-scale US survey of the academic state of…
Radiocarbon Dating the Anthropocene
NASA Astrophysics Data System (ADS)
Chaput, M. A.; Gajewski, K. J.
2015-12-01
The Anthropocene has no agreed start date since current suggestions for its beginning range from Pre-Industrial times to the Industrial Revolution, and from the mid-twentieth century to the future. To set the boundary of the Anthropocene in geological time, we must first understand when, how and to what extent humans began altering the Earth system. One aspect of this involves reconstructing the effects of prehistoric human activity on the physical landscape. However, for global reconstructions of land use and land cover change to be more accurately interpreted in the context of human interaction with the landscape, large-scale spatio-temporal demographic changes in prehistoric populations must be known. Estimates of the relative number of prehistoric humans in different regions of the world and at different moments in time are needed. To this end, we analyze a dataset of radiocarbon dates from the Canadian Archaeological Radiocarbon Database (CARD), the Palaeolithic Database of Europe and the AustArch Database of Australia, as well as published dates from South America. This is the first time such a large quantity of dates (approximately 60,000) has been mapped and studied at a global scale. Initial results from the analysis of temporal frequency distributions of calibrated radiocarbon dates, assumed to be proportional to population density, will be discussed. The utility of radiocarbon dates in studies of the Anthropocene will be evaluated and potential links between population density and changes in atmospheric greenhouse gas concentrations, climate, migration patterning and fire frequency coincidence will be considered.
Metaproteomics as a Complementary Approach to Gut Microbiota in Health and Disease
NASA Astrophysics Data System (ADS)
Petriz, Bernardo A.; Franco, Octávio L.
2017-01-01
Classic studies on phylotype profiling are limited to the identification of microbial constituents, where information is lacking about the molecular interaction of these bacterial communities with the host genome and the possible outcomes in host biology. A range of OMICs approaches have provided great progress linking the microbiota to health and disease. However, the investigation of this context through proteomic mass spectrometry-based tools is still being improved. Therefore, metaproteomics or community proteogenomics has emerged as a complementary approach to metagenomic data, as a field in proteomics aiming to perform large-scale characterization of proteins from environmental microbiota such as the human gut. The advances in molecular separation methods coupled with mass spectrometry (e.g. LC-MS/MS) and proteome bioinformatics have been fundamental in these novel large-scale metaproteomic studies, which have further been performed in a wide range of samples including soil, plant and human environments. Metaproteomic studies will make major progress if a comprehensive database covering the genes and expresses proteins from all gut microbial species is developed. To this end, we here present some of the main limitations of metaproteomic studies in complex microbiota environments such as the gut, also addressing the up-to-date pipelines in sample preparation prior to fractionation/separation and mass spectrometry analysis. In addition, a novel approach to the limitations of metagenomic databases is also discussed. Finally, prospects are addressed regarding the application of metaproteomic analysis using a unified host-microbiome gene database and other meta-OMICs platforms.
Large Scale Analyses and Visualization of Adaptive Amino Acid Changes Projects.
Vázquez, Noé; Vieira, Cristina P; Amorim, Bárbara S R; Torres, André; López-Fernández, Hugo; Fdez-Riverola, Florentino; Sousa, José L R; Reboiro-Jato, Miguel; Vieira, Jorge
2018-03-01
When changes at few amino acid sites are the target of selection, adaptive amino acid changes in protein sequences can be identified using maximum-likelihood methods based on models of codon substitution (such as codeml). Although such methods have been employed numerous times using a variety of different organisms, the time needed to collect the data and prepare the input files means that tens or hundreds of coding regions are usually analyzed. Nevertheless, the recent availability of flexible and easy to use computer applications that collect relevant data (such as BDBM) and infer positively selected amino acid sites (such as ADOPS), means that the entire process is easier and quicker than before. However, the lack of a batch option in ADOPS, here reported, still precludes the analysis of hundreds or thousands of sequence files. Given the interest and possibility of running such large-scale projects, we have also developed a database where ADOPS projects can be stored. Therefore, this study also presents the B+ database, which is both a data repository and a convenient interface that looks at the information contained in ADOPS projects without the need to download and unzip the corresponding ADOPS project file. The ADOPS projects available at B+ can also be downloaded, unzipped, and opened using the ADOPS graphical interface. The availability of such a database ensures results repeatability, promotes data reuse with significant savings on the time needed for preparing datasets, and effortlessly allows further exploration of the data contained in ADOPS projects.
Zamami, Yoshito; Niimura, Takahiro; Takechi, Kenshi; Imanishi, Masaki; Koyama, Toshihiro; Ishizawa, Keisuke
2017-01-01
Approximately 100000 people suffer cardiopulmonary arrest in Japan every year, and the aging of society means that this number is expected to increase. Worldwide, approximately 100 million develop cardiac arrest annually, making it an international issue. Although survival has improved thanks to advances in cardiopulmonary resuscitation, there is a high rate of postresuscitation encephalopathy after the return of spontaneous circulation, and the proportion of patients who can return to normal life is extremely low. Treatment for postresuscitation encephalopathy is long term, and if sequelae persist then nursing care is required, causing immeasurable economic burdens as a result of ballooning medical costs. As at present there is no drug treatment to improve postresuscitation encephalopathy as a complication of cardiopulmonary arrest, the development of novel drug treatments is desirable. In recent years, new efficacy for existing drugs used in the clinical setting has been discovered, and drug repositioning has been proposed as a strategy for developing those drugs as therapeutic agents for different diseases. This review describes a large-scale database study carried out following a discovery strategy for drug repositioning with the objective of improving survival rates after cardiopulmonary arrest and discusses future repositioning prospects.
Managing Large Scale Project Analysis Teams through a Web Accessible Database
NASA Technical Reports Server (NTRS)
O'Neil, Daniel A.
2008-01-01
Large scale space programs analyze thousands of requirements while mitigating safety, performance, schedule, and cost risks. These efforts involve a variety of roles with interdependent use cases and goals. For example, study managers and facilitators identify ground-rules and assumptions for a collection of studies required for a program or project milestone. Task leaders derive product requirements from the ground rules and assumptions and describe activities to produce needed analytical products. Disciplined specialists produce the specified products and load results into a file management system. Organizational and project managers provide the personnel and funds to conduct the tasks. Each role has responsibilities to establish information linkages and provide status reports to management. Projects conduct design and analysis cycles to refine designs to meet the requirements and implement risk mitigation plans. At the program level, integrated design and analysis cycles studies are conducted to eliminate every 'to-be-determined' and develop plans to mitigate every risk. At the agency level, strategic studies analyze different approaches to exploration architectures and campaigns. This paper describes a web-accessible database developed by NASA to coordinate and manage tasks at three organizational levels. Other topics in this paper cover integration technologies and techniques for process modeling and enterprise architectures.
NASA Astrophysics Data System (ADS)
Do, Hong; Gudmundsson, Lukas; Leonard, Michael; Westra, Seth; Senerivatne, Sonia
2017-04-01
In-situ observations of daily streamflow with global coverage are a crucial asset for understanding large-scale freshwater resources which are an essential component of the Earth system and a prerequisite for societal development. Here we present the Global Streamflow Indices and Metadata archive (G-SIM), a collection indices derived from more than 20,000 daily streamflow time series across the globe. These indices are designed to support global assessments of change in wet and dry extremes, and have been compiled from 12 free-to-access online databases (seven national databases and five international collections). The G-SIM archive also includes significant metadata to help support detailed understanding of streamflow dynamics, with the inclusion of drainage area shapefile and many essential catchment properties such as land cover type, soil and topographic characteristics. The automated procedure in data handling and quality control of the project makes G-SIM a reproducible, extendible archive and can be utilised for many purposes in large-scale hydrology. Some potential applications include the identification of observational trends in hydrological extremes, the assessment of climate change impacts on streamflow regimes, and the validation of global hydrological models.
Li, Hui; Li, Defang; Chen, Anguo; Tang, Huijuan; Li, Jianjun; Huang, Siqi
2016-01-01
Kenaf (Hibiscus cannabinus L.) is an economically important natural fiber crop grown worldwide. However, only 20 expressed tag sequences (ESTs) for kenaf are available in public databases. The aim of this study was to develop large-scale simple sequence repeat (SSR) markers to lay a solid foundation for the construction of genetic linkage maps and marker-assisted breeding in kenaf. We used Illumina paired-end sequencing technology to generate new EST-simple sequences and MISA software to mine SSR markers. We identified 71,318 unigenes with an average length of 1143 nt and annotated these unigenes using four different protein databases. Overall, 9324 complementary pairs were designated as EST-SSR markers, and their quality was validated using 100 randomly selected SSR markers. In total, 72 primer pairs reproducibly amplified target amplicons, and 61 of these primer pairs detected significant polymorphism among 28 kenaf accessions. Thus, in this study, we have developed large-scale SSR markers for kenaf, and this new resource will facilitate construction of genetic linkage maps, investigation of fiber growth and development in kenaf, and also be of value to novel gene discovery and functional genomic studies. PMID:26960153
The Camden & Islington Research Database: Using electronic mental health records for research.
Werbeloff, Nomi; Osborn, David P J; Patel, Rashmi; Taylor, Matthew; Stewart, Robert; Broadbent, Matthew; Hayes, Joseph F
2018-01-01
Electronic health records (EHRs) are widely used in mental health services. Case registers using EHRs from secondary mental healthcare have the potential to deliver large-scale projects evaluating mental health outcomes in real-world clinical populations. We describe the Camden and Islington NHS Foundation Trust (C&I) Research Database which uses the Clinical Record Interactive Search (CRIS) tool to extract and de-identify routinely collected clinical information from a large UK provider of secondary mental healthcare, and demonstrate its capabilities to answer a clinical research question regarding time to diagnosis and treatment of bipolar disorder. The C&I Research Database contains records from 108,168 mental health patients, of which 23,538 were receiving active care. The characteristics of the patient population are compared to those of the catchment area, of London, and of England as a whole. The median time to diagnosis of bipolar disorder was 76 days (interquartile range: 17-391) and median time to treatment was 37 days (interquartile range: 5-194). Compulsory admission under the UK Mental Health Act was associated with shorter intervals to diagnosis and treatment. Prior diagnoses of other psychiatric disorders were associated with longer intervals to diagnosis, though prior diagnoses of schizophrenia and related disorders were associated with decreased time to treatment. The CRIS tool, developed by the South London and Maudsley NHS Foundation Trust (SLaM) Biomedical Research Centre (BRC), functioned very well at C&I. It is reassuring that data from different organizations deliver similar results, and that applications developed in one Trust can then be successfully deployed in another. The information can be retrieved in a quicker and more efficient fashion than more traditional methods of health research. The findings support the secondary use of EHRs for large-scale mental health research in naturalistic samples and settings investigated across large, diverse geographical areas.
Automatic initialization and quality control of large-scale cardiac MRI segmentations.
Albà, Xènia; Lekadir, Karim; Pereañez, Marco; Medrano-Gracia, Pau; Young, Alistair A; Frangi, Alejandro F
2018-01-01
Continuous advances in imaging technologies enable ever more comprehensive phenotyping of human anatomy and physiology. Concomitant reduction of imaging costs has resulted in widespread use of imaging in large clinical trials and population imaging studies. Magnetic Resonance Imaging (MRI), in particular, offers one-stop-shop multidimensional biomarkers of cardiovascular physiology and pathology. A wide range of analysis methods offer sophisticated cardiac image assessment and quantification for clinical and research studies. However, most methods have only been evaluated on relatively small databases often not accessible for open and fair benchmarking. Consequently, published performance indices are not directly comparable across studies and their translation and scalability to large clinical trials or population imaging cohorts is uncertain. Most existing techniques still rely on considerable manual intervention for the initialization and quality control of the segmentation process, becoming prohibitive when dealing with thousands of images. The contributions of this paper are three-fold. First, we propose a fully automatic method for initializing cardiac MRI segmentation, by using image features and random forests regression to predict an initial position of the heart and key anatomical landmarks in an MRI volume. In processing a full imaging database, the technique predicts the optimal corrective displacements and positions in relation to the initial rough intersections of the long and short axis images. Second, we introduce for the first time a quality control measure capable of identifying incorrect cardiac segmentations with no visual assessment. The method uses statistical, pattern and fractal descriptors in a random forest classifier to detect failures to be corrected or removed from subsequent statistical analysis. Finally, we validate these new techniques within a full pipeline for cardiac segmentation applicable to large-scale cardiac MRI databases. The results obtained based on over 1200 cases from the Cardiac Atlas Project show the promise of fully automatic initialization and quality control for population studies. Copyright © 2017 Elsevier B.V. All rights reserved.
Carr, T.R.; Merriam, D.F.; Bartley, J.D.
2005-01-01
Large-scale relational databases and geographic information system tools are used to integrate temperature, pressure, and water geo-chemistry data from numerous wells to better understand regional-scale geothermal and hydrogeological regimes of the lower Paleozoic aquifer systems in the mid-continent and to evaluate their potential for geologic CO2 sequestration. The lower Paleozoic (Cambrian to Mississippian) aquifer systems in Kansas, Missouri, and Oklahoma comprise one of the largest regional-scale saline aquifer systems in North America. Understanding hydrologic conditions and processes of these regional-scale aquifer systems provides insight to the evolution of the various sedimentary basins, migration of hydrocarbons out of the Anadarko and Arkoma basins, and the distribution of Arbuckle petroleum reservoirs across Kansas and provides a basis to evaluate CO2 sequestration potential. The Cambrian and Ordovician stratigraphic units form a saline aquifer that is in hydrologic continuity with the freshwater recharge from the Ozark plateau and along the Nemaha anticline. The hydrologic continuity with areas of freshwater recharge provides an explanation for the apparent underpressure in the Arbuckle Group. Copyright ?? 2005. The American Association of Petroleum Geologists. All rights reserved.
NASA Astrophysics Data System (ADS)
Tsai, Kuang-Jung; Chiang, Jie-Lun; Lee, Ming-Hsi; Chen, Yie-Ruey
2017-04-01
Analysis on the Critical Rainfall Value For Predicting Large Scale Landslides Caused by Heavy Rainfall In Taiwan. Kuang-Jung Tsai 1, Jie-Lun Chiang 2,Ming-Hsi Lee 2, Yie-Ruey Chen 1, 1Department of Land Management and Development, Chang Jung Christian Universityt, Tainan, Taiwan. 2Department of Soil and Water Conservation, National Pingtung University of Science and Technology, Pingtung, Taiwan. ABSTRACT The accumulated rainfall amount was recorded more than 2,900mm that were brought by Morakot typhoon in August, 2009 within continuous 3 days. Very serious landslides, and sediment related disasters were induced by this heavy rainfall event. The satellite image analysis project conducted by Soil and Water Conservation Bureau after Morakot event indicated that more than 10,904 sites of landslide with total sliding area of 18,113ha were found by this project. At the same time, all severe sediment related disaster areas are also characterized based on their disaster type, scale, topography, major bedrock formations and geologic structures during the period of extremely heavy rainfall events occurred at the southern Taiwan. Characteristics and mechanism of large scale landslide are collected on the basis of the field investigation technology integrated with GPS/GIS/RS technique. In order to decrease the risk of large scale landslides on slope land, the strategy of slope land conservation, and critical rainfall database should be set up and executed as soon as possible. Meanwhile, study on the establishment of critical rainfall value used for predicting large scale landslides induced by heavy rainfall become an important issue which was seriously concerned by the government and all people live in Taiwan. The mechanism of large scale landslide, rainfall frequency analysis ,sediment budge estimation and river hydraulic analysis under the condition of extremely climate change during the past 10 years would be seriously concerned and recognized as a required issue by this research. Hopefully, all results developed from this research can be used as a warning system for Predicting Large Scale Landslides in the southern Taiwan. Keywords:Heavy Rainfall, Large Scale, landslides, Critical Rainfall Value
ICA model order selection of task co-activation networks.
Ray, Kimberly L; McKay, D Reese; Fox, Peter M; Riedel, Michael C; Uecker, Angela M; Beckmann, Christian F; Smith, Stephen M; Fox, Peter T; Laird, Angela R
2013-01-01
Independent component analysis (ICA) has become a widely used method for extracting functional networks in the brain during rest and task. Historically, preferred ICA dimensionality has widely varied within the neuroimaging community, but typically varies between 20 and 100 components. This can be problematic when comparing results across multiple studies because of the impact ICA dimensionality has on the topology of its resultant components. Recent studies have demonstrated that ICA can be applied to peak activation coordinates archived in a large neuroimaging database (i.e., BrainMap Database) to yield whole-brain task-based co-activation networks. A strength of applying ICA to BrainMap data is that the vast amount of metadata in BrainMap can be used to quantitatively assess tasks and cognitive processes contributing to each component. In this study, we investigated the effect of model order on the distribution of functional properties across networks as a method for identifying the most informative decompositions of BrainMap-based ICA components. Our findings suggest dimensionality of 20 for low model order ICA to examine large-scale brain networks, and dimensionality of 70 to provide insight into how large-scale networks fractionate into sub-networks. We also provide a functional and organizational assessment of visual, motor, emotion, and interoceptive task co-activation networks as they fractionate from low to high model-orders.
ICA model order selection of task co-activation networks
Ray, Kimberly L.; McKay, D. Reese; Fox, Peter M.; Riedel, Michael C.; Uecker, Angela M.; Beckmann, Christian F.; Smith, Stephen M.; Fox, Peter T.; Laird, Angela R.
2013-01-01
Independent component analysis (ICA) has become a widely used method for extracting functional networks in the brain during rest and task. Historically, preferred ICA dimensionality has widely varied within the neuroimaging community, but typically varies between 20 and 100 components. This can be problematic when comparing results across multiple studies because of the impact ICA dimensionality has on the topology of its resultant components. Recent studies have demonstrated that ICA can be applied to peak activation coordinates archived in a large neuroimaging database (i.e., BrainMap Database) to yield whole-brain task-based co-activation networks. A strength of applying ICA to BrainMap data is that the vast amount of metadata in BrainMap can be used to quantitatively assess tasks and cognitive processes contributing to each component. In this study, we investigated the effect of model order on the distribution of functional properties across networks as a method for identifying the most informative decompositions of BrainMap-based ICA components. Our findings suggest dimensionality of 20 for low model order ICA to examine large-scale brain networks, and dimensionality of 70 to provide insight into how large-scale networks fractionate into sub-networks. We also provide a functional and organizational assessment of visual, motor, emotion, and interoceptive task co-activation networks as they fractionate from low to high model-orders. PMID:24339802
hEIDI: An Intuitive Application Tool To Organize and Treat Large-Scale Proteomics Data.
Hesse, Anne-Marie; Dupierris, Véronique; Adam, Claire; Court, Magali; Barthe, Damien; Emadali, Anouk; Masselon, Christophe; Ferro, Myriam; Bruley, Christophe
2016-10-07
Advances in high-throughput proteomics have led to a rapid increase in the number, size, and complexity of the associated data sets. Managing and extracting reliable information from such large series of data sets require the use of dedicated software organized in a consistent pipeline to reduce, validate, exploit, and ultimately export data. The compilation of multiple mass-spectrometry-based identification and quantification results obtained in the context of a large-scale project represents a real challenge for developers of bioinformatics solutions. In response to this challenge, we developed a dedicated software suite called hEIDI to manage and combine both identifications and semiquantitative data related to multiple LC-MS/MS analyses. This paper describes how, through a user-friendly interface, hEIDI can be used to compile analyses and retrieve lists of nonredundant protein groups. Moreover, hEIDI allows direct comparison of series of analyses, on the basis of protein groups, while ensuring consistent protein inference and also computing spectral counts. hEIDI ensures that validated results are compliant with MIAPE guidelines as all information related to samples and results is stored in appropriate databases. Thanks to the database structure, validated results generated within hEIDI can be easily exported in the PRIDE XML format for subsequent publication. hEIDI can be downloaded from http://biodev.extra.cea.fr/docs/heidi .
Designing for Peta-Scale in the LSST Database
NASA Astrophysics Data System (ADS)
Kantor, J.; Axelrod, T.; Becla, J.; Cook, K.; Nikolaev, S.; Gray, J.; Plante, R.; Nieto-Santisteban, M.; Szalay, A.; Thakar, A.
2007-10-01
The Large Synoptic Survey Telescope (LSST), a proposed ground-based 8.4 m telescope with a 10 deg^2 field of view, will generate 15 TB of raw images every observing night. When calibration and processed data are added, the image archive, catalogs, and meta-data will grow 15 PB yr^{-1} on average. The LSST Data Management System (DMS) must capture, process, store, index, replicate, and provide open access to this data. Alerts must be triggered within 30 s of data acquisition. To do this in real-time at these data volumes will require advances in data management, database, and file system techniques. This paper describes the design of the LSST DMS and emphasizes features for peta-scale data. The LSST DMS will employ a combination of distributed database and file systems, with schema, partitioning, and indexing oriented for parallel operations. Image files are stored in a distributed file system with references to, and meta-data from, each file stored in the databases. The schema design supports pipeline processing, rapid ingest, and efficient query. Vertical partitioning reduces disk input/output requirements, horizontal partitioning allows parallel data access using arrays of servers and disks. Indexing is extensive, utilizing both conventional RAM-resident indexes and column-narrow, row-deep tag tables/covering indices that are extracted from tables that contain many more attributes. The DMS Data Access Framework is encapsulated in a middleware framework to provide a uniform service interface to all framework capabilities. This framework will provide the automated work-flow, replication, and data analysis capabilities necessary to make data processing and data quality analysis feasible at this scale.
Vermeerbergen, Lander; Van Hootegem, Geert; Benders, Jos
2017-02-01
Ongoing shortages of care workers, together with an ageing population, make it of utmost importance to increase the quality of working life in nursing homes. Since the 1970s, normalised and small-scale nursing homes have been increasingly introduced to provide care in a family and homelike environment, potentially providing a richer work life for care workers as well as improved living conditions for residents. 'Normalised' refers to the opportunities given to residents to live in a manner as close as possible to the everyday life of persons not needing care. The study purpose is to provide a synthesis and overview of empirical research comparing the quality of working life - together with related work and health outcomes - of professional care workers in normalised small-scale nursing homes as compared to conventional large-scale ones. A systematic review of qualitative and quantitative studies. A systematic literature search (April 2015) was performed using the electronic databases Pubmed, Embase, PsycInfo, CINAHL and Web of Science. References and citations were tracked to identify additional, relevant studies. We identified 825 studies in the selected databases. After checking the inclusion and exclusion criteria, nine studies were selected for review. Two additional studies were selected after reference and citation tracking. Three studies were excluded after requesting more information on the research setting. The findings from the individual studies suggest that levels of job control and job demands (all but "time pressure") are higher in normalised small-scale homes than in conventional large-scale nursing homes. Additionally, some studies suggested that social support and work motivation are higher, while risks of burnout and mental strain are lower, in normalised small-scale nursing homes. Other studies found no differences or even opposing findings. The studies reviewed showed that these inconclusive findings can be attributed to care workers in some normalised small-scale homes experiencing isolation and too high job demands in their work roles. This systematic review suggests that normalised small-scale homes are a good starting point for creating a higher quality of working life in the nursing home sector. Higher job control enables care workers to manage higher job demands in normalised small-scale homes. However, some jobs would benefit from interventions to address care workers' perceptions of too low social support and of too high job demands. More research is needed to examine strategies to enhance these working life issues in normalised small-scale settings. Copyright © 2016 Elsevier Ltd. All rights reserved.
Ice-Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel
NASA Technical Reports Server (NTRS)
Broeren, Andy P.; Potapczuk, Mark G.; Lee, Sam; Malone, Adam M.; Paul, Benard P., Jr.; Woodard, Brian S.
2016-01-01
Icing simulation tools and computational fluid dynamics codes are reaching levels of maturity such that they are being proposed by manufacturers for use in certification of aircraft for flight in icing conditions with increasingly less reliance on natural-icing flight testing and icing-wind-tunnel testing. Sufficient high-quality data to evaluate the performance of these tools is not currently available. The objective of this work was to generate a database of ice-accretion geometry that can be used for development and validation of icing simulation tools as well as for aerodynamic testing. Three large-scale swept wing models were built and tested at the NASA Glenn Icing Research Tunnel (IRT). The models represented the Inboard (20% semispan), Midspan (64% semispan) and Outboard stations (83% semispan) of a wing based upon a 65% scale version of the Common Research Model (CRM). The IRT models utilized a hybrid design that maintained the full-scale leading-edge geometry with a truncated afterbody and flap. The models were instrumented with surface pressure taps in order to acquire sufficient aerodynamic data to verify the hybrid model design capability to simulate the full-scale wing section. A series of ice-accretion tests were conducted over a range of total temperatures from -23.8 deg C to -1.4 deg C with all other conditions held constant. The results showed the changing ice-accretion morphology from rime ice at the colder temperatures to highly 3-D scallop ice in the range of -11.2 deg C to -6.3 deg C. Warmer temperatures generated highly 3-D ice accretion with glaze ice characteristics. The results indicated that the general scallop ice morphology was similar for all three models. Icing results were documented for limited parametric variations in angle of attack, drop size and cloud liquid-water content (LWC). The effect of velocity on ice accretion was documented for the Midspan and Outboard models for a limited number of test cases. The data suggest that there are morphological characteristics of glaze and scallop ice accretion on these swept-wing models that are dependent upon the velocity. This work has resulted in a large database of ice-accretion geometry on large-scale, swept-wing models.
Ice-Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel
NASA Technical Reports Server (NTRS)
Broeren, Andy P.; Potapczuk, Mark G.; Lee, Sam; Malone, Adam M.; Paul, Bernard P., Jr.; Woodard, Brian S.
2016-01-01
Icing simulation tools and computational fluid dynamics codes are reaching levels of maturity such that they are being proposed by manufacturers for use in certification of aircraft for flight in icing conditions with increasingly less reliance on natural-icing flight testing and icing-wind-tunnel testing. Sufficient high-quality data to evaluate the performance of these tools is not currently available. The objective of this work was to generate a database of ice-accretion geometry that can be used for development and validation of icing simulation tools as well as for aerodynamic testing. Three large-scale swept wing models were built and tested at the NASA Glenn Icing Research Tunnel (IRT). The models represented the Inboard (20 percent semispan), Midspan (64 percent semispan) and Outboard stations (83 percent semispan) of a wing based upon a 65 percent scale version of the Common Research Model (CRM). The IRT models utilized a hybrid design that maintained the full-scale leading-edge geometry with a truncated afterbody and flap. The models were instrumented with surface pressure taps in order to acquire sufficient aerodynamic data to verify the hybrid model design capability to simulate the full-scale wing section. A series of ice-accretion tests were conducted over a range of total temperatures from -23.8 to -1.4 C with all other conditions held constant. The results showed the changing ice-accretion morphology from rime ice at the colder temperatures to highly 3-D scallop ice in the range of -11.2 to -6.3 C. Warmer temperatures generated highly 3-D ice accretion with glaze ice characteristics. The results indicated that the general scallop ice morphology was similar for all three models. Icing results were documented for limited parametric variations in angle of attack, drop size and cloud liquid-water content (LWC). The effect of velocity on ice accretion was documented for the Midspan and Outboard models for a limited number of test cases. The data suggest that there are morphological characteristics of glaze and scallop ice accretion on these swept-wing models that are dependent upon the velocity. This work has resulted in a large database of ice-accretion geometry on large-scale, swept-wing models.
JEnsembl: a version-aware Java API to Ensembl data systems
Paterson, Trevor; Law, Andy
2012-01-01
Motivation: The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. Results: The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing ‘through time’ comparative analyses to be performed. Availability: Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net). Contact: jensembl-develop@lists.sf.net, andy.law@roslin.ed.ac.uk, trevor.paterson@roslin.ed.ac.uk PMID:22945789
McCrae, Robert R; Scally, Matthew; Terracciano, Antonio; Abecasis, Gonçalo R; Costa, Paul T
2010-12-01
There is growing evidence that personality traits are affected by many genes, all of which have very small effects. As an alternative to the largely unsuccessful search for individual polymorphisms associated with personality traits, the authors identified large sets of potentially related single nucleotide polymorphisms (SNPs) and summed them to form molecular personality scales (MPSs) with from 4 to 2,497 SNPs. Scales were derived from two thirds of a large (N = 3,972) sample of individuals from Sardinia who completed the Revised NEO Personality Inventory (P. T. Costa, Jr., & R. R. McCrae, 1992) and were assessed in a genomewide association scan. When MPSs were correlated with the phenotype in the remaining one third of the sample, very small but significant associations were found for 4 of the 5e personality factors when the longest scales were examined. These data suggest that MPSs for Neuroticism, Openness to Experience, Agreeableness, and Conscientiousness (but not Extraversion) contain genetic information that can be refined in future studies, and the procedures described here should be applicable to other quantitative traits. PsycINFO Database Record (c) 2010 APA, all rights reserved.
NASA Astrophysics Data System (ADS)
McGranaghan, Ryan M.; Mannucci, Anthony J.; Forsyth, Colin
2017-12-01
We explore the characteristics, controlling parameters, and relationships of multiscale field-aligned currents (FACs) using a rigorous, comprehensive, and cross-platform analysis. Our unique approach combines FAC data from the Swarm satellites and the Advanced Magnetosphere and Planetary Electrodynamics Response Experiment (AMPERE) to create a database of small-scale (˜10-150 km, <1° latitudinal width), mesoscale (˜150-250 km, 1-2° latitudinal width), and large-scale (>250 km) FACs. We examine these data for the repeatable behavior of FACs across scales (i.e., the characteristics), the dependence on the interplanetary magnetic field orientation, and the degree to which each scale "departs" from nominal large-scale specification. We retrieve new information by utilizing magnetic latitude and local time dependence, correlation analyses, and quantification of the departure of smaller from larger scales. We find that (1) FACs characteristics and dependence on controlling parameters do not map between scales in a straight forward manner, (2) relationships between FAC scales exhibit local time dependence, and (3) the dayside high-latitude region is characterized by remarkably distinct FAC behavior when analyzed at different scales, and the locations of distinction correspond to "anomalous" ionosphere-thermosphere behavior. Comparing with nominal large-scale FACs, we find that differences are characterized by a horseshoe shape, maximizing across dayside local times, and that difference magnitudes increase when smaller-scale observed FACs are considered. We suggest that both new physics and increased resolution of models are required to address the multiscale complexities. We include a summary table of our findings to provide a quick reference for differences between multiscale FACs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Park, Yubin; Shankar, Mallikarjun; Park, Byung H.
Designing a database system for both efficient data management and data services has been one of the enduring challenges in the healthcare domain. In many healthcare systems, data services and data management are often viewed as two orthogonal tasks; data services refer to retrieval and analytic queries such as search, joins, statistical data extraction, and simple data mining algorithms, while data management refers to building error-tolerant and non-redundant database systems. The gap between service and management has resulted in rigid database systems and schemas that do not support effective analytics. We compose a rich graph structure from an abstracted healthcaremore » RDBMS to illustrate how we can fill this gap in practice. We show how a healthcare graph can be automatically constructed from a normalized relational database using the proposed 3NF Equivalent Graph (3EG) transformation.We discuss a set of real world graph queries such as finding self-referrals, shared providers, and collaborative filtering, and evaluate their performance over a relational database and its 3EG-transformed graph. Experimental results show that the graph representation serves as multiple de-normalized tables, thus reducing complexity in a database and enhancing data accessibility of users. Based on this finding, we propose an ensemble framework of databases for healthcare applications.« less
NASA Technical Reports Server (NTRS)
Benson, Robert F.; Fainberg, Joseph; Osherovich, Vladimir; Truhlik, Vladimir; Wang, Yongli; Arbacher, Becca
2011-01-01
The latest results from an investigation to establish links between solar-wind and topside-ionospheric parameters will be presented including a case where high-latitude topside electron-density Ne(h) profiles indicated dramatic rapid changes in the scale height during the main phase of a large magnetic storm (Dst < -200 nT). These scale-height changes suggest a large heat input to the topside ionosphere at this time. The topside profiles were derived from ISIS-1 digital ionograms obtained from the NASA Space Physics Data Facility (SPDF) Coordinated Data Analysis Web (CDA Web). Solar-wind data obtained from the NASA OMNIWeb database indicated that the magnetic storm was due to a magnetic cloud. This event is one of several large magnetic storms being investigated during the interval from 1965 to 1984 when both solar-wind and digital topside ionograms, from either Alouette-2, ISIS-1, or ISIS-2, are potentially available.
Owen, Jesse; Imel, Zac E
2016-04-01
This article introduces the special section on utilizing large data sets to explore psychotherapy processes and outcomes. The increased use of technology has provided new opportunities for psychotherapy researchers. In particular, there is a rise in large databases of tens of thousands clients. Additionally, there are new ways to pool valuable resources for meta-analytic processes. At the same time, these tools also come with limitations. These issues are introduced as well as brief overview of the articles. (c) 2016 APA, all rights reserved).
The Galics Project: Virtual Galaxy: from Cosmological N-body Simulations
NASA Astrophysics Data System (ADS)
Guiderdoni, B.
The GalICS project develops extensive semi-analytic post-processing of large cosmological simulations to describe hierarchical galaxy formation. The multiwavelength statistical properties of high-redshift and local galaxies are predicted within the large-scale structures. The fake catalogs and mock images that are generated from the outputs are used for the analysis and preparation of deep surveys. The whole set of results is now available in an on-line database that can be easily queried. The GalICS project represents a first step towards a 'Virtual Observatory of virtual galaxies'.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Poliakov, Alexander; Couronne, Olivier
2002-11-04
Aligning large vertebrate genomes that are structurally complex poses a variety of problems not encountered on smaller scales. Such genomes are rich in repetitive elements and contain multiple segmental duplications, which increases the difficulty of identifying true orthologous SNA segments in alignments. The sizes of the sequences make many alignment algorithms designed for comparing single proteins extremely inefficient when processing large genomic intervals. We integrated both local and global alignment tools and developed a suite of programs for automatically aligning large vertebrate genomes and identifying conserved non-coding regions in the alignments. Our method uses the BLAT local alignment program tomore » find anchors on the base genome to identify regions of possible homology for a query sequence. These regions are postprocessed to find the best candidates which are then globally aligned using the AVID global alignment program. In the last step conserved non-coding segments are identified using VISTA. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. The GenomeVISTA software is a suite of Perl programs that is built on a MySQL database platform. The scheduler gets control data from the database, builds a queve of jobs, and dispatches them to a PC cluster for execution. The main program, running on each node of the cluster, processes individual sequences. A Perl library acts as an interface between the database and the above programs. The use of a separate library allows the programs to function independently of the database schema. The library also improves on the standard Perl MySQL database interfere package by providing auto-reconnect functionality and improved error handling.« less
Asymmetric distances for binary embeddings.
Gordo, Albert; Perronnin, Florent; Gong, Yunchao; Lazebnik, Svetlana
2014-01-01
In large-scale query-by-example retrieval, embedding image signatures in a binary space offers two benefits: data compression and search efficiency. While most embedding algorithms binarize both query and database signatures, it has been noted that this is not strictly a requirement. Indeed, asymmetric schemes that binarize the database signatures but not the query still enjoy the same two benefits but may provide superior accuracy. In this work, we propose two general asymmetric distances that are applicable to a wide variety of embedding techniques including locality sensitive hashing (LSH), locality sensitive binary codes (LSBC), spectral hashing (SH), PCA embedding (PCAE), PCAE with random rotations (PCAE-RR), and PCAE with iterative quantization (PCAE-ITQ). We experiment on four public benchmarks containing up to 1M images and show that the proposed asymmetric distances consistently lead to large improvements over the symmetric Hamming distance for all binary embedding techniques.
Miao, Zhichao; Westhof, Eric
2016-07-08
RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
High-Performance Secure Database Access Technologies for HEP Grids
DOE Office of Scientific and Technical Information (OSTI.GOV)
Matthew Vranicar; John Weicher
2006-04-17
The Large Hadron Collider (LHC) at the CERN Laboratory will become the largest scientific instrument in the world when it starts operations in 2007. Large Scale Analysis Computer Systems (computational grids) are required to extract rare signals of new physics from petabytes of LHC detector data. In addition to file-based event data, LHC data processing applications require access to large amounts of data in relational databases: detector conditions, calibrations, etc. U.S. high energy physicists demand efficient performance of grid computing applications in LHC physics research where world-wide remote participation is vital to their success. To empower physicists with data-intensive analysismore » capabilities a whole hyperinfrastructure of distributed databases cross-cuts a multi-tier hierarchy of computational grids. The crosscutting allows separation of concerns across both the global environment of a federation of computational grids and the local environment of a physicist’s computer used for analysis. Very few efforts are on-going in the area of database and grid integration research. Most of these are outside of the U.S. and rely on traditional approaches to secure database access via an extraneous security layer separate from the database system core, preventing efficient data transfers. Our findings are shared by the Database Access and Integration Services Working Group of the Global Grid Forum, who states that "Research and development activities relating to the Grid have generally focused on applications where data is stored in files. However, in many scientific and commercial domains, database management systems have a central role in data storage, access, organization, authorization, etc, for numerous applications.” There is a clear opportunity for a technological breakthrough, requiring innovative steps to provide high-performance secure database access technologies for grid computing. We believe that an innovative database architecture where the secure authorization is pushed into the database engine will eliminate inefficient data transfer bottlenecks. Furthermore, traditionally separated database and security layers provide an extra vulnerability, leaving a weak clear-text password authorization as the only protection on the database core systems. Due to the legacy limitations of the systems’ security models, the allowed passwords often can not even comply with the DOE password guideline requirements. We see an opportunity for the tight integration of the secure authorization layer with the database server engine resulting in both improved performance and improved security. Phase I has focused on the development of a proof-of-concept prototype using Argonne National Laboratory’s (ANL) Argonne Tandem-Linac Accelerator System (ATLAS) project as a test scenario. By developing a grid-security enabled version of the ATLAS project’s current relation database solution, MySQL, PIOCON Technologies aims to offer a more efficient solution to secure database access.« less
Continuous evolutionary change in Plio-Pleistocene mammals of eastern Africa
NASA Astrophysics Data System (ADS)
Bibi, Faysal; Kiessling, Wolfgang
2015-08-01
Much debate has revolved around the question of whether the mode of evolutionary and ecological turnover in the fossil record of African mammals was continuous or pulsed, and the degree to which faunal turnover tracked changes in global climate. Here, we assembled and analyzed large specimen databases of the fossil record of eastern African Bovidae (antelopes) and Turkana Basin large mammals. Our results indicate that speciation and extinction proceeded continuously throughout the Pliocene and Pleistocene, as did increases in the relative abundance of arid-adapted bovids, and in bovid body mass. Species durations were similar among clades with different ecological attributes. Occupancy patterns were unimodal, with long and nearly symmetrical origination and extinction phases. A single origination pulse may be present at 2.0-1.75 Ma, but besides this, there is no evidence that evolutionary or ecological changes in the eastern African record tracked rapid, 100,000-y-scale changes in global climate. Rather, eastern African large mammal evolution tracked global or regional climatic trends at long (million year) time scales, while local, basin-scale changes (e.g., tectonic or hydrographic) and biotic interactions ruled at shorter timescales.
Large-Scale Quantitative Analysis of Painting Arts
Kim, Daniel; Son, Seung-Woo; Jeong, Hawoong
2014-01-01
Scientists have made efforts to understand the beauty of painting art in their own languages. As digital image acquisition of painting arts has made rapid progress, researchers have come to a point where it is possible to perform statistical analysis of a large-scale database of artistic paints to make a bridge between art and science. Using digital image processing techniques, we investigate three quantitative measures of images – the usage of individual colors, the variety of colors, and the roughness of the brightness. We found a difference in color usage between classical paintings and photographs, and a significantly low color variety of the medieval period. Interestingly, moreover, the increment of roughness exponent as painting techniques such as chiaroscuro and sfumato have advanced is consistent with historical circumstances. PMID:25501877
Large-eddy simulation of the urban boundary layer in the MEGAPOLI Paris Plume experiment
NASA Astrophysics Data System (ADS)
Esau, Igor
2010-05-01
This study presents results from the specific large-eddy simulation study of the urban boundary layer in the MEGAPOLI Paris Plume field campaign. We used LESNIC and PALM codes, MEGAPOLI city morphology database, nudging to the observed meteorological conditions during the Paris Plume campaign and some concentration measurements from that campaign to simulate and better understand the nature of the urban boundary layer on scales larger then the street canyon scales. The primary attention was paid to turbulence self-organization and structure-to-surface interaction. The study has been aimed to demonstrate feasibility and estimate required resources for such research. Therefore, at this stage we do not compare the simulation with other relevant studies as well as we do not formulate the theoretical conclusions.
Smirani, Rawen; Truchetet, Marie-Elise; Poursac, Nicolas; Naveau, Adrien; Schaeverbeke, Thierry; Devillard, Raphaël
2018-06-01
Oropharyngeal features are frequent and often understated in the treatment clinical guidelines of systemic sclerosis in spite of important consequences on comfort, esthetics, nutrition and daily life. The aim of this systematic review was to assess a correlation between the oropharyngeal manifestations of systemic sclerosis and patients' health-related quality of life. A systematic search was conducted using four databases [PubMed ® , Cochrane Database ® , Dentistry & Oral Sciences Source ® , and SCOPUS ® ] up to January 2018, according to the Preferred reporting items for systematic reviews and meta analyses. Grey literature and hand search were also included. Study selection, risk bias assessment (Newcastle-Ottawa scale) and data extraction were performed by two independent reviewers. The review protocol was registered on PROSPERO database with the code CRD42018085994. From 375 screened studies, 6 cross-sectional studies were included in the systematic review. The total number of patients included per study ranged from 84 to 178. These studies reported a statistically significant association between oropharyngeal manifestations of systemic sclerosis (mainly assessed by maximal mouth opening and the mouth handicap in systemic sclerosis scale) and an impaired quality of life (measured by different scales). Studies were unequal concerning risk of bias mostly because of low level of evidence, different recruiting sources of samples, and different scales to assess the quality of life. This systematic review demonstrates a correlation between oropharyngeal manifestations of systemic sclerosis and impaired quality of life, despite the low level of evidence of included studies. Large-scaled studies are needed to provide stronger evidence of this association. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Kevin M. Potter; Jeanine L. Paschke
2013-01-01
Analyzing patterns of forest pest infestations, diseases occurrences, forest declines and related biotic stress factors is necessary to monitor the health of forested ecosystems and their potential impacts on forest structure, composition, biodiversity, and species distributions (Castello and others 1995). Introduced nonnative insects and diseases, in particular, can...
Stability and Change in Interests: A Longitudinal Study of Adolescents from Grades 8 through 12
ERIC Educational Resources Information Center
Tracey, Terence J. G.; Robbins, Steven B.; Hofsess, Christy D.
2005-01-01
The pattern of RIASEC interests and academic skills were assessed longitudinally from a large-scale national database at three time points: eight grade, 10th grade, and 12th grade. Validation and cross-validation samples of 1000 males and 1000 females in each set were used to test the pattern of these scores over time relative to mean changes,…
Statistical Literacy in Data Revolution Era: Building Blocks and Instructional Dilemmas
ERIC Educational Resources Information Center
Prodromou, Theodosia; Dunne, Tim
2017-01-01
The data revolution has given citizens access to enormous large-scale open databases. In order to take into account the full complexity of data, we have to change the way we think in terms of the nature of data and its availability, the ways in which it is displayed and used, and the skills that are required for its interpretation. Substantial…
NASA Technical Reports Server (NTRS)
Saeed, M.; Lieu, C.; Raber, G.; Mark, R. G.
2002-01-01
Development and evaluation of Intensive Care Unit (ICU) decision-support systems would be greatly facilitated by the availability of a large-scale ICU patient database. Following our previous efforts with the MIMIC (Multi-parameter Intelligent Monitoring for Intensive Care) Database, we have leveraged advances in networking and storage technologies to develop a far more massive temporal database, MIMIC II. MIMIC II is an ongoing effort: data is continuously and prospectively archived from all ICU patients in our hospital. MIMIC II now consists of over 800 ICU patient records including over 120 gigabytes of data and is growing. A customized archiving system was used to store continuously up to four waveforms and 30 different parameters from ICU patient monitors. An integrated user-friendly relational database was developed for browsing of patients' clinical information (lab results, fluid balance, medications, nurses' progress notes). Based upon its unprecedented size and scope, MIMIC II will prove to be an important resource for intelligent patient monitoring research, and will support efforts in medical data mining and knowledge-discovery.
NASA Astrophysics Data System (ADS)
Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George
2017-10-01
In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.
NASA Astrophysics Data System (ADS)
Madin, Joshua S.; Anderson, Kristen D.; Andreasen, Magnus Heide; Bridge, Tom C. L.; Cairns, Stephen D.; Connolly, Sean R.; Darling, Emily S.; Diaz, Marcela; Falster, Daniel S.; Franklin, Erik C.; Gates, Ruth D.; Hoogenboom, Mia O.; Huang, Danwei; Keith, Sally A.; Kosnik, Matthew A.; Kuo, Chao-Yang; Lough, Janice M.; Lovelock, Catherine E.; Luiz, Osmar; Martinelli, Julieta; Mizerek, Toni; Pandolfi, John M.; Pochon, Xavier; Pratchett, Morgan S.; Putnam, Hollie M.; Roberts, T. Edward; Stat, Michael; Wallace, Carden C.; Widman, Elizabeth; Baird, Andrew H.
2016-03-01
Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism’s function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research.
Madin, Joshua S.; Anderson, Kristen D.; Andreasen, Magnus Heide; Bridge, Tom C.L.; Cairns, Stephen D.; Connolly, Sean R.; Darling, Emily S.; Diaz, Marcela; Falster, Daniel S.; Franklin, Erik C.; Gates, Ruth D.; Hoogenboom, Mia O.; Huang, Danwei; Keith, Sally A.; Kosnik, Matthew A.; Kuo, Chao-Yang; Lough, Janice M.; Lovelock, Catherine E.; Luiz, Osmar; Martinelli, Julieta; Mizerek, Toni; Pandolfi, John M.; Pochon, Xavier; Pratchett, Morgan S.; Putnam, Hollie M.; Roberts, T. Edward; Stat, Michael; Wallace, Carden C.; Widman, Elizabeth; Baird, Andrew H.
2016-01-01
Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism’s function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research. PMID:27023900
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2010-01-01
GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bi-monthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI homepage: www.ncbi.nlm.nih.gov.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2009-01-01
GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank(R) staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Madin, Joshua S; Anderson, Kristen D; Andreasen, Magnus Heide; Bridge, Tom C L; Cairns, Stephen D; Connolly, Sean R; Darling, Emily S; Diaz, Marcela; Falster, Daniel S; Franklin, Erik C; Gates, Ruth D; Harmer, Aaron; Hoogenboom, Mia O; Huang, Danwei; Keith, Sally A; Kosnik, Matthew A; Kuo, Chao-Yang; Lough, Janice M; Lovelock, Catherine E; Luiz, Osmar; Martinelli, Julieta; Mizerek, Toni; Pandolfi, John M; Pochon, Xavier; Pratchett, Morgan S; Putnam, Hollie M; Roberts, T Edward; Stat, Michael; Wallace, Carden C; Widman, Elizabeth; Baird, Andrew H
2016-03-29
Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism's function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research.
The MIND PALACE: A Multi-Spectral Imaging and Spectroscopy Database for Planetary Science
NASA Astrophysics Data System (ADS)
Eshelman, E.; Doloboff, I.; Hara, E. K.; Uckert, K.; Sapers, H. M.; Abbey, W.; Beegle, L. W.; Bhartia, R.
2017-12-01
The Multi-Instrument Database (MIND) is the web-based home to a well-characterized set of analytical data collected by a suite of deep-UV fluorescence/Raman instruments built at the Jet Propulsion Laboratory (JPL). Samples derive from a growing body of planetary surface analogs, mineral and microbial standards, meteorites, spacecraft materials, and other astrobiologically relevant materials. In addition to deep-UV spectroscopy, datasets stored in MIND are obtained from a variety of analytical techniques obtained over multiple spatial and spectral scales including electron microscopy, optical microscopy, infrared spectroscopy, X-ray fluorescence, and direct fluorescence imaging. Multivariate statistical analysis techniques, primarily Principal Component Analysis (PCA), are used to guide interpretation of these large multi-analytical spectral datasets. Spatial co-referencing of integrated spectral/visual maps is performed using QGIS (geographic information system software). Georeferencing techniques transform individual instrument data maps into a layered co-registered data cube for analysis across spectral and spatial scales. The body of data in MIND is intended to serve as a permanent, reliable, and expanding database of deep-UV spectroscopy datasets generated by this unique suite of JPL-based instruments on samples of broad planetary science interest.
Development of a database for the verification of trans-ionospheric remote sensing systems
NASA Astrophysics Data System (ADS)
Leitinger, R.
2005-08-01
Remote sensing systems need verification by means of in-situ data or by means of model data. In the case of ionospheric occultation inversion, ionosphere tomography and other imaging methods on the basis of satellite-to-ground or satellite-to-satellite electron content, the availability of in-situ data with adequate spatial and temporal co-location is a very rare case, indeed. Therefore the method of choice for verification is to produce artificial electron content data with realistic properties, subject these data to the inversion/retrieval method, compare the results with model data and apply a suitable type of “goodness of fit” classification. Inter-comparison of inversion/retrieval methods should be done with sets of artificial electron contents in a “blind” (or even “double blind”) way. The set up of a relevant database for the COST 271 Action is described. One part of the database will be made available to everyone interested in testing of inversion/retrieval methods. The artificial electron content data are calculated by means of large-scale models that are “modulated” in a realistic way to include smaller scale and dynamic structures, like troughs and traveling ionospheric disturbances.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2007-01-01
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (www.ncbi.nlm.nih.gov).
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2005-01-01
GenBank is a comprehensive database that contains publicly available DNA sequences for more than 165,000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L
2006-01-01
GenBank (R) is a comprehensive database that contains publicly available DNA sequences for more than 205 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the Web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at www.ncbi.nlm.nih.gov.
A psycholinguistic database for traditional Chinese character naming.
Chang, Ya-Ning; Hsu, Chun-Hsien; Tsai, Jie-Li; Chen, Chien-Liang; Lee, Chia-Ying
2016-03-01
In this study, we aimed to provide a large-scale set of psycholinguistic norms for 3,314 traditional Chinese characters, along with their naming reaction times (RTs), collected from 140 Chinese speakers. The lexical and semantic variables in the database include frequency, regularity, familiarity, consistency, number of strokes, homophone density, semantic ambiguity rating, phonetic combinability, semantic combinability, and the number of disyllabic compound words formed by a character. Multiple regression analyses were conducted to examine the predictive powers of these variables for the naming RTs. The results demonstrated that these variables could account for a significant portion of variance (55.8%) in the naming RTs. An additional multiple regression analysis was conducted to demonstrate the effects of consistency and character frequency. Overall, the regression results were consistent with the findings of previous studies on Chinese character naming. This database should be useful for research into Chinese language processing, Chinese education, or cross-linguistic comparisons. The database can be accessed via an online inquiry system (http://ball.ling.sinica.edu.tw/namingdatabase/index.html).
EDULISS: a small-molecule database with data-mining and pharmacophore searching capabilities
Hsin, Kun-Yi; Morgan, Hugh P.; Shave, Steven R.; Hinton, Andrew C.; Taylor, Paul; Walkinshaw, Malcolm D.
2011-01-01
We present the relational database EDULISS (EDinburgh University Ligand Selection System), which stores structural, physicochemical and pharmacophoric properties of small molecules. The database comprises a collection of over 4 million commercially available compounds from 28 different suppliers. A user-friendly web-based interface for EDULISS (available at http://eduliss.bch.ed.ac.uk/) has been established providing a number of data-mining possibilities. For each compound a single 3D conformer is stored along with over 1600 calculated descriptor values (molecular properties). A very efficient method for unique compound recognition, especially for a large scale database, is demonstrated by making use of small subgroups of the descriptors. Many of the shape and distance descriptors are held as pre-calculated bit strings permitting fast and efficient similarity and pharmacophore searches which can be used to identify families of related compounds for biological testing. Two ligand searching applications are given to demonstrate how EDULISS can be used to extract families of molecules with selected structural and biophysical features. PMID:21051336
SPSmart: adapting population based SNP genotype databases for fast and comprehensive web access.
Amigo, Jorge; Salas, Antonio; Phillips, Christopher; Carracedo, Angel
2008-10-10
In the last five years large online resources of human variability have appeared, notably HapMap, Perlegen and the CEPH foundation. These databases of genotypes with population information act as catalogues of human diversity, and are widely used as reference sources for population genetics studies. Although many useful conclusions may be extracted by querying databases individually, the lack of flexibility for combining data from within and between each database does not allow the calculation of key population variability statistics. We have developed a novel tool for accessing and combining large-scale genomic databases of single nucleotide polymorphisms (SNPs) in widespread use in human population genetics: SPSmart (SNPs for Population Studies). A fast pipeline creates and maintains a data mart from the most commonly accessed databases of genotypes containing population information: data is mined, summarized into the standard statistical reference indices, and stored into a relational database that currently handles as many as 4 x 10(9) genotypes and that can be easily extended to new database initiatives. We have also built a web interface to the data mart that allows the browsing of underlying data indexed by population and the combining of populations, allowing intuitive and straightforward comparison of population groups. All the information served is optimized for web display, and most of the computations are already pre-processed in the data mart to speed up the data browsing and any computational treatment requested. In practice, SPSmart allows populations to be combined into user-defined groups, while multiple databases can be accessed and compared in a few simple steps from a single query. It performs the queries rapidly and gives straightforward graphical summaries of SNP population variability through visual inspection of allele frequencies outlined in standard pie-chart format. In addition, full numerical description of the data is output in statistical results panels that include common population genetics metrics such as heterozygosity, Fst and In.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation.
Dudek, Christian-Alexander; Dannheim, Henning; Schomburg, Dietmar
2017-01-01
The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation
Schomburg, Dietmar
2017-01-01
The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de. PMID:28750104
NASA Astrophysics Data System (ADS)
Gatto, Francesca; Katsanevakis, Stelios; Vandekerkhove, Jochen; Zenetos, Argyro; Cardoso, Ana Cristina
2013-06-01
Europe is severely affected by alien invasions, which impact biodiversity, ecosystem services, economy, and human health. A large number of national, regional, and global online databases provide information on the distribution, pathways of introduction, and impacts of alien species. The sufficiency and efficiency of the current online information systems to assist the European policy on alien species was investigated by a comparative analysis of occurrence data across 43 online databases. Large differences among databases were found which are partially explained by variations in their taxonomical, environmental, and geographical scopes but also by the variable efforts for continuous updates and by inconsistencies on the definition of "alien" or "invasive" species. No single database covered all European environments, countries, and taxonomic groups. In many European countries national databases do not exist, which greatly affects the quality of reported information. To be operational and useful to scientists, managers, and policy makers, online information systems need to be regularly updated through continuous monitoring on a country or regional level. We propose the creation of a network of online interoperable web services through which information in distributed resources can be accessed, aggregated and then used for reporting and further analysis at different geographical and political scales, as an efficient approach to increase the accessibility of information. Harmonization, standardization, conformity on international standards for nomenclature, and agreement on common definitions of alien and invasive species are among the necessary prerequisites.
Fire Detection Organizing Questions
NASA Technical Reports Server (NTRS)
2004-01-01
Verified models of fire precursor transport in low and partial gravity: a. Development of models for large-scale transport in reduced gravity. b. Validated CFD simulations of transport of fire precursors. c. Evaluation of the effect of scale on transport and reduced gravity fires. Advanced fire detection system for gaseous and particulate pre-fire and fire signaturesa: a. Quantification of pre-fire pyrolysis products in microgravity. b. Suite of gas and particulate sensors. c. Reduced gravity evaluation of candidate detector technologies. d. Reduced gravity verification of advanced fire detection system. e. Validated database of fire and pre-fire signatures in low and partial gravity.
Usaj, Matej; Tan, Yizhao; Wang, Wen; VanderSluis, Benjamin; Zou, Albert; Myers, Chad L.; Costanzo, Michael; Andrews, Brenda; Boone, Charles
2017-01-01
Providing access to quantitative genomic data is key to ensure large-scale data validation and promote new discoveries. TheCellMap.org serves as a central repository for storing and analyzing quantitative genetic interaction data produced by genome-scale Synthetic Genetic Array (SGA) experiments with the budding yeast Saccharomyces cerevisiae. In particular, TheCellMap.org allows users to easily access, visualize, explore, and functionally annotate genetic interactions, or to extract and reorganize subnetworks, using data-driven network layouts in an intuitive and interactive manner. PMID:28325812
Usaj, Matej; Tan, Yizhao; Wang, Wen; VanderSluis, Benjamin; Zou, Albert; Myers, Chad L; Costanzo, Michael; Andrews, Brenda; Boone, Charles
2017-05-05
Providing access to quantitative genomic data is key to ensure large-scale data validation and promote new discoveries. TheCellMap.org serves as a central repository for storing and analyzing quantitative genetic interaction data produced by genome-scale Synthetic Genetic Array (SGA) experiments with the budding yeast Saccharomyces cerevisiae In particular, TheCellMap.org allows users to easily access, visualize, explore, and functionally annotate genetic interactions, or to extract and reorganize subnetworks, using data-driven network layouts in an intuitive and interactive manner. Copyright © 2017 Usaj et al.
NASA Astrophysics Data System (ADS)
Nyitrai, Daniel; Martinho, Filipe; Dolbeth, Marina; Rito, João; Pardal, Miguel A.
2013-12-01
Large-scale and local climate patterns are known to influence several aspects of the life cycle of marine fish. In this paper, we used a 9-year database (2003-2011) to analyse the populations of two estuarine resident fishes, Pomatoschistus microps and Pomatoschistus minutus, in order to determine their relationships with varying environmental stressors operating over local and large scales. This study was performed in the Mondego estuary, Portugal. Firstly, the variations in abundance, growth, population structure and secondary production were evaluated. These species appeared in high densities in the beginning of the study period, with subsequent occasional high annual density peaks, while their secondary production was lower in dry years. The relationships between yearly fish abundance and the environmental variables were evaluated separately for both species using Spearman correlation analysis, considering the yearly abundance peaks for the whole population, juveniles and adults. Among the local climate patterns, precipitation, river runoff, salinity and temperature were used in the analyses, and North Atlantic Oscillation (NAO) index and sea surface temperature (SST) were tested as large-scale factors. For P. microps, precipitation and NAO were the significant factors explaining abundance of the whole population, the adults and the juveniles as well. Regarding P. minutus, for the whole population, juveniles and adults river runoff was the significant predictor. The results for both species suggest a differential influence of climate patterns on the various life cycle stages, confirming also the importance of estuarine resident fishes as indicators of changes in local and large-scale climate patterns, related to global climate change.
Database for the geologic map of the Chelan 30-minute by 60-minute quadrangle, Washington (I-1661)
Tabor, R.W.; Frizzell, V.A.; Whetten, J.T.; Waitt, R.B.; Swanson, D.A.; Byerly, G.R.; Booth, D.B.; Hetherington, M.J.; Zartman, R.E.
2006-01-01
This digital map database has been prepared by R. W. Tabor from the published Geologic map of the Chelan 30-Minute Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
Tabor, R.W.; Frizzell, V.A.; Booth, D.B.; Waitt, R.B.
2006-01-01
This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Snoqualmie Pass 30' X 60' Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
Geologic Map of the Wenatchee 1:100,000 Quadrangle, Central Washington: A Digital Database
Tabor, R.W.; Waitt, R.B.; Frizzell, V.A.; Swanson, D.A.; Byerly, G.R.; Bentley, R.D.
2005-01-01
This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Wenatchee 1:100,000 Quadrangle, Central Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
An Evaluation of Database Solutions to Spatial Object Association
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumar, V S; Kurc, T; Saltz, J
2008-06-24
Object association is a common problem encountered in many applications. Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two datasets based on their positions in a common spatial coordinate system--one of the datasets may correspond to a catalog of objects observed over time in a multi-dimensional domain; the other dataset may consist of objects observed in a snapshot of the domain at a time point. The use of database management systems to the solve the object association problem provides portability across different platforms and also greater flexibility. Increasingmore » dataset sizes in today's applications, however, have made object association a data/compute-intensive problem that requires targeted optimizations for efficient execution. In this work, we investigate how database-based crossmatch algorithms can be deployed on different database system architectures and evaluate the deployments to understand the impact of architectural choices on crossmatch performance and associated trade-offs. We investigate the execution of two crossmatch algorithms on (1) a parallel database system with active disk style processing capabilities, (2) a high-throughput network database (MySQL Cluster), and (3) shared-nothing databases with replication. We have conducted our study in the context of a large-scale astronomy application with real use-case scenarios.« less
NASA Astrophysics Data System (ADS)
Ricco, George Dante
In higher education and in engineering education in particular, changing majors is generally considered a negative event - or at least an event with negative consequences. An emergent field of study within engineering education revolves around understanding the factors and processes driving student changes of major. Of key importance to further the field of change of major research is a grasp of large scale phenomena occurring throughout multiple systems, knowledge of previous attempts at describing such issues, and the adoption of metrics to probe them effectively. The problem posed is exacerbated by the drive in higher education institutions and among state legislatures to understand and reduce time-to-degree and student attrition. With these factors in mind, insights into large-scale processes that affect student progression are essential to evaluating the success or failure of programs. The goals of this work include describing the current educational research on switchers, identifying core concepts and stumbling blocks in my treatment of switchers, and using the Multiple Institutional Database for Investigating Engineering Longitudinal Development (MIDFIELD) to explore how those who change majors perform as a function of large-scale academic pathways within and without the engineering context. To accomplish these goals, it was first necessary to delve into a recent history of the treatment of switchers within the literature and categorize their approach. While three categories of papers exist in the literature concerning change of major, all three may or may not be applicable to a given database of students or even a single institution. Furthermore, while the term has been coined in the literature, no portable metric for discussing large-scale navigational flexibility exists in engineering education. What such a metric would look like will be discussed as well as the delimitations involved. The results and subsequent discussion will include a description of changes of major, how they may or may not have a deleterious effect on one's academic pathway, the special context of changes of major in the pathways of students within first-year engineering programs students labeled as undecided, an exploration of curricular flexibility by the construction of a novel metric, and proposed future work.
The Computing and Data Grid Approach: Infrastructure for Distributed Science Applications
NASA Technical Reports Server (NTRS)
Johnston, William E.
2002-01-01
With the advent of Grids - infrastructure for using and managing widely distributed computing and data resources in the science environment - there is now an opportunity to provide a standard, large-scale, computing, data, instrument, and collaboration environment for science that spans many different projects and provides the required infrastructure and services in a relatively uniform and supportable way. Grid technology has evolved over the past several years to provide the services and infrastructure needed for building 'virtual' systems and organizations. We argue that Grid technology provides an excellent basis for the creation of the integrated environments that can combine the resources needed to support the large- scale science projects located at multiple laboratories and universities. We present some science case studies that indicate that a paradigm shift in the process of science will come about as a result of Grids providing transparent and secure access to advanced and integrated information and technologies infrastructure: powerful computing systems, large-scale data archives, scientific instruments, and collaboration tools. These changes will be in the form of services that can be integrated with the user's work environment, and that enable uniform and highly capable access to these computers, data, and instruments, regardless of the location or exact nature of these resources. These services will integrate transient-use resources like computing systems, scientific instruments, and data caches (e.g., as they are needed to perform a simulation or analyze data from a single experiment); persistent-use resources. such as databases, data catalogues, and archives, and; collaborators, whose involvement will continue for the lifetime of a project or longer. While we largely address large-scale science in this paper, Grids, particularly when combined with Web Services, will address a broad spectrum of science scenarios. both large and small scale.
Introducing the Global Fire WEather Database (GFWED)
NASA Astrophysics Data System (ADS)
Field, R. D.
2015-12-01
The Canadian Fire Weather Index (FWI) System is the mostly widely used fire danger rating system in the world. We have developed a global database of daily FWI System calculations beginning in 1980 called the Global Fire WEather Database (GFWED) gridded to a spatial resolution of 0.5° latitude by 2/3° longitude. Input weather data were obtained from the NASA Modern Era Retrospective-Analysis for Research (MERRA), and two different estimates of daily precipitation from rain gauges over land. FWI System Drought Code calculations from the gridded datasets were compared to calculations from individual weather station data for a representative set of 48 stations in North, Central and South America, Europe, Russia, Southeast Asia and Australia. Agreement between gridded calculations and the station-based calculations tended to be most different at low latitudes for strictly MERRA-based calculations. Strong biases could be seen in either direction: MERRA DC over the Mato Grosso in Brazil reached unrealistically high values exceeding DC=1500 during the dry season but was too low over Southeast Asia during the dry season. These biases are consistent with those previously-identified in MERRA's precipitation and reinforce the need to consider alternative sources of precipitation data. GFWED is being used by researchers around the world for analyzing historical relationships between fire weather and fire activity at large scales, in identifying large-scale atmosphere-ocean controls on fire weather, and calibration of FWI-based fire prediction models. These applications will be discussed. More information on GFWED can be found at http://data.giss.nasa.gov/impacts/gfwed/
ClearedLeavesDB: an online database of cleared plant leaf images
2014-01-01
Background Leaf vein networks are critical to both the structure and function of leaves. A growing body of recent work has linked leaf vein network structure to the physiology, ecology and evolution of land plants. In the process, multiple institutions and individual researchers have assembled collections of cleared leaf specimens in which vascular bundles (veins) are rendered visible. In an effort to facilitate analysis and digitally preserve these specimens, high-resolution images are usually created, either of entire leaves or of magnified leaf subsections. In a few cases, collections of digital images of cleared leaves are available for use online. However, these collections do not share a common platform nor is there a means to digitally archive cleared leaf images held by individual researchers (in addition to those held by institutions). Hence, there is a growing need for a digital archive that enables online viewing, sharing and disseminating of cleared leaf image collections held by both institutions and individual researchers. Description The Cleared Leaf Image Database (ClearedLeavesDB), is an online web-based resource for a community of researchers to contribute, access and share cleared leaf images. ClearedLeavesDB leverages resources of large-scale, curated collections while enabling the aggregation of small-scale collections within the same online platform. ClearedLeavesDB is built on Drupal, an open source content management platform. It allows plant biologists to store leaf images online with corresponding meta-data, share image collections with a user community and discuss images and collections via a common forum. We provide tools to upload processed images and results to the database via a web services client application that can be downloaded from the database. Conclusions We developed ClearedLeavesDB, a database focusing on cleared leaf images that combines interactions between users and data via an intuitive web interface. The web interface allows storage of large collections and integrates with leaf image analysis applications via an open application programming interface (API). The open API allows uploading of processed images and other trait data to the database, further enabling distribution and documentation of analyzed data within the community. The initial database is seeded with nearly 19,000 cleared leaf images representing over 40 GB of image data. Extensible storage and growth of the database is ensured by using the data storage resources of the iPlant Discovery Environment. ClearedLeavesDB can be accessed at http://clearedleavesdb.org. PMID:24678985
ClearedLeavesDB: an online database of cleared plant leaf images.
Das, Abhiram; Bucksch, Alexander; Price, Charles A; Weitz, Joshua S
2014-03-28
Leaf vein networks are critical to both the structure and function of leaves. A growing body of recent work has linked leaf vein network structure to the physiology, ecology and evolution of land plants. In the process, multiple institutions and individual researchers have assembled collections of cleared leaf specimens in which vascular bundles (veins) are rendered visible. In an effort to facilitate analysis and digitally preserve these specimens, high-resolution images are usually created, either of entire leaves or of magnified leaf subsections. In a few cases, collections of digital images of cleared leaves are available for use online. However, these collections do not share a common platform nor is there a means to digitally archive cleared leaf images held by individual researchers (in addition to those held by institutions). Hence, there is a growing need for a digital archive that enables online viewing, sharing and disseminating of cleared leaf image collections held by both institutions and individual researchers. The Cleared Leaf Image Database (ClearedLeavesDB), is an online web-based resource for a community of researchers to contribute, access and share cleared leaf images. ClearedLeavesDB leverages resources of large-scale, curated collections while enabling the aggregation of small-scale collections within the same online platform. ClearedLeavesDB is built on Drupal, an open source content management platform. It allows plant biologists to store leaf images online with corresponding meta-data, share image collections with a user community and discuss images and collections via a common forum. We provide tools to upload processed images and results to the database via a web services client application that can be downloaded from the database. We developed ClearedLeavesDB, a database focusing on cleared leaf images that combines interactions between users and data via an intuitive web interface. The web interface allows storage of large collections and integrates with leaf image analysis applications via an open application programming interface (API). The open API allows uploading of processed images and other trait data to the database, further enabling distribution and documentation of analyzed data within the community. The initial database is seeded with nearly 19,000 cleared leaf images representing over 40 GB of image data. Extensible storage and growth of the database is ensured by using the data storage resources of the iPlant Discovery Environment. ClearedLeavesDB can be accessed at http://clearedleavesdb.org.
What if we took a global look?
NASA Astrophysics Data System (ADS)
Ouellet Dallaire, C.; Lehner, B.
2014-12-01
Freshwater resources are facing unprecedented pressures. In hope to cope with this, Environmental Hydrology, Freshwater Biology, and Fluvial Geomorphology have defined conceptual approaches such as "environmental flow requirements", "instream flow requirements" or "normative flow regime" to define appropriate flow regime to maintain a given ecological status. These advances in the fields of freshwater resources management are asking scientists to create bridges across disciplines. Holistic and multi-scales approaches are becoming more and more common in water sciences research. The intrinsic nature of river systems demands these approaches to account for the upstream-downstream link of watersheds. Before recent technological developments, large scale analyses were cumbersome and, often, the necessary data was unavailable. However, new technologies, both for information collection and computing capacity, enable a high resolution look at the global scale. For rivers around the world, this new outlook is facilitated by the hydrologically relevant geo-spatial database HydroSHEDS. This database now offers more than 24 millions of kilometers of rivers, some never mapped before, at the click of a fingertip. Large and, even, global scale assessments can now be used to compare rivers around the world. A river classification framework was developed using HydroSHEDS called GloRiC (Global River Classification). This framework advocates for holistic approach to river systems by using sub-classifications drawn from six disciplines related to river sciences: Hydrology, Physiography and climate, Geomorphology, Chemistry, Biology and Human impact. Each of these disciplines brings complementary information on the rivers that is relevant at different scales. A first version of a global river reach classification was produced at the 500m resolution. Variables used in the classification have influence on processes involved at different scales (ex. topography index vs. pH). However, all variables are computed at the same high spatial resolution. This way, we can have a global look at local phenomenon.
Extreme Precipitation and High-Impact Landslides
NASA Technical Reports Server (NTRS)
Kirschbaum, Dalia; Adler, Robert; Huffman, George; Peters-Lidard, Christa
2012-01-01
It is well known that extreme or prolonged rainfall is the dominant trigger of landslides; however, there remain large uncertainties in characterizing the distribution of these hazards and meteorological triggers at the global scale. Researchers have evaluated the spatiotemporal distribution of extreme rainfall and landslides at local and regional scale primarily using in situ data, yet few studies have mapped rainfall-triggered landslide distribution globally due to the dearth of landslide data and consistent precipitation information. This research uses a newly developed Global Landslide Catalog (GLC) and a 13-year satellite-based precipitation record from Tropical Rainfall Measuring Mission (TRMM) data. For the first time, these two unique products provide the foundation to quantitatively evaluate the co-occurence of precipitation and rainfall-triggered landslides globally. The GLC, available from 2007 to the present, contains information on reported rainfall-triggered landslide events around the world using online media reports, disaster databases, etc. When evaluating this database, we observed that 2010 had a large number of high-impact landslide events relative to previous years. This study considers how variations in extreme and prolonged satellite-based rainfall are related to the distribution of landslides over the same time scales for three active landslide areas: Central America, the Himalayan Arc, and central-eastern China. Several test statistics confirm that TRMM rainfall generally scales with the observed increase in landslide reports and fatal events for 2010 and previous years over each region. These findings suggest that the co-occurrence of satellite precipitation and landslide reports may serve as a valuable indicator for characterizing the spatiotemporal distribution of landslide-prone areas in order to establish a global rainfall-triggered landslide climatology. This research also considers the sources for this extreme rainfall, citing teleconnections from ENSO as likely contributors to regional precipitation variability. This work demonstrates the potential for using satellite-based precipitation estimates to identify potentially active landslide areas at the global scale in order to improve landslide cataloging and quantify landslide triggering at daily, monthly and yearly time scales.
Kennedy, Amy E.; Khoury, Muin J.; Ioannidis, John P.A.; Brotzman, Michelle; Miller, Amy; Lane, Crystal; Lai, Gabriel Y.; Rogers, Scott D.; Harvey, Chinonye; Elena, Joanne W.; Seminara, Daniela
2017-01-01
Background We report on the establishment of a web-based Cancer Epidemiology Descriptive Cohort Database (CEDCD). The CEDCD’s goals are to enhance awareness of resources, facilitate interdisciplinary research collaborations, and support existing cohorts for the study of cancer-related outcomes. Methods Comprehensive descriptive data were collected from large cohorts established to study cancer as primary outcome using a newly developed questionnaire. These included an inventory of baseline and follow-up data, biospecimens, genomics, policies, and protocols. Additional descriptive data extracted from publicly available sources were also collected. This information was entered in a searchable and publicly accessible database. We summarized the descriptive data across cohorts and reported the characteristics of this resource. Results As of December 2015, the CEDCD includes data from 46 cohorts representing more than 6.5 million individuals (29% ethnic/racial minorities). Overall, 78% of the cohorts have collected blood at least once, 57% at multiple time points, and 46% collected tissue samples. Genotyping has been performed by 67% of the cohorts, while 46% have performed whole-genome or exome sequencing in subsets of enrolled individuals. Information on medical conditions other than cancer has been collected in more than 50% of the cohorts. More than 600,000 incident cancer cases and more than 40,000 prevalent cases are reported, with 24 cancer sites represented. Conclusions The CEDCD assembles detailed descriptive information on a large number of cancer cohorts in a searchable database. Impact Information from the CEDCD may assist the interdisciplinary research community by facilitating identification of well-established population resources and large-scale collaborative and integrative research. PMID:27439404
Computational Modeling as a Design Tool in Microelectronics Manufacturing
NASA Technical Reports Server (NTRS)
Meyyappan, Meyya; Arnold, James O. (Technical Monitor)
1997-01-01
Plans to introduce pilot lines or fabs for 300 mm processing are in progress. The IC technology is simultaneously moving towards 0.25/0.18 micron. The convergence of these two trends places unprecedented stringent demands on processes and equipments. More than ever, computational modeling is called upon to play a complementary role in equipment and process design. The pace in hardware/process development needs a matching pace in software development: an aggressive move towards developing "virtual reactors" is desirable and essential to reduce design cycle and costs. This goal has three elements: reactor scale model, feature level model, and database of physical/chemical properties. With these elements coupled, the complete model should function as a design aid in a CAD environment. This talk would aim at the description of various elements. At the reactor level, continuum, DSMC(or particle) and hybrid models will be discussed and compared using examples of plasma and thermal process simulations. In microtopography evolution, approaches such as level set methods compete with conventional geometric models. Regardless of the approach, the reliance on empricism is to be eliminated through coupling to reactor model and computational surface science. This coupling poses challenging issues of orders of magnitude variation in length and time scales. Finally, database development has fallen behind; current situation is rapidly aggravated by the ever newer chemistries emerging to meet process metrics. The virtual reactor would be a useless concept without an accompanying reliable database that consists of: thermal reaction pathways and rate constants, electron-molecule cross sections, thermochemical properties, transport properties, and finally, surface data on the interaction of radicals, atoms and ions with various surfaces. Large scale computational chemistry efforts are critical as experiments alone cannot meet database needs due to the difficulties associated with such controlled experiments and costs.
The role of Natural Flood Management in managing floods in large scale basins during extreme events
NASA Astrophysics Data System (ADS)
Quinn, Paul; Owen, Gareth; ODonnell, Greg; Nicholson, Alex; Hetherington, David
2016-04-01
There is a strong evidence database showing the negative impacts of land use intensification and soil degradation in NW European river basins on hydrological response and to flood impact downstream. However, the ability to target zones of high runoff production and the extent to which we can manage flood risk using nature-based flood management solution are less known. A move to planting more trees and having less intense farmed landscapes is part of natural flood management (NFM) solutions and these methods suggest that flood risk can be managed in alternative and more holistic ways. So what local NFM management methods should be used, where in large scale basin should they be deployed and how does flow is propagate to any point downstream? Generally, how much intervention is needed and will it compromise food production systems? If we are observing record levels of rainfall and flow, for example during Storm Desmond in Dec 2015 in the North West of England, what other flood management options are really needed to complement our traditional defences in large basins for the future? In this paper we will show examples of NFM interventions in the UK that have impacted at local scale sites. We will demonstrate the impact of interventions at local, sub-catchment (meso-scale) and finally at the large scale. These tools include observations, process based models and more generalised Flood Impact Models. Issues of synchronisation and the design level of protection will be debated. By reworking observed rainfall and discharge (runoff) for observed extreme events in the River Eden and River Tyne, during Storm Desmond, we will show how much flood protection is needed in large scale basins. The research will thus pose a number of key questions as to how floods may have to be managed in large scale basins in the future. We will seek to support a method of catchment systems engineering that holds water back across the whole landscape as a major opportunity to management water in large scale basins in the future. The broader benefits of engineering landscapes to hold water for pollution control, sediment loss and drought minimisation will also be shown.
Salehi, Ali; Jimenez-Berni, Jose; Deery, David M; Palmer, Doug; Holland, Edward; Rozas-Larraondo, Pablo; Chapman, Scott C; Georgakopoulos, Dimitrios; Furbank, Robert T
2015-01-01
To our knowledge, there is no software or database solution that supports large volumes of biological time series sensor data efficiently and enables data visualization and analysis in real time. Existing solutions for managing data typically use unstructured file systems or relational databases. These systems are not designed to provide instantaneous response to user queries. Furthermore, they do not support rapid data analysis and visualization to enable interactive experiments. In large scale experiments, this behaviour slows research discovery, discourages the widespread sharing and reuse of data that could otherwise inform critical decisions in a timely manner and encourage effective collaboration between groups. In this paper we present SensorDB, a web based virtual laboratory that can manage large volumes of biological time series sensor data while supporting rapid data queries and real-time user interaction. SensorDB is sensor agnostic and uses web-based, state-of-the-art cloud and storage technologies to efficiently gather, analyse and visualize data. Collaboration and data sharing between different agencies and groups is thereby facilitated. SensorDB is available online at http://sensordb.csiro.au.
Setoguchi, Soko; Zhu, Ying; Jalbert, Jessica J; Williams, Lauren A; Chen, Chih-Ying
2014-05-01
Linking patient registries with administrative databases can enhance the utility of the databases for epidemiological and comparative effectiveness research. However, registries often lack direct personal identifiers, and the validity of record linkage using multiple indirect personal identifiers is not well understood. Using a large contemporary national cardiovascular device registry and 100% Medicare inpatient data, we linked hospitalization-level records. The main outcomes were the validity measures of several deterministic linkage rules using multiple indirect personal identifiers compared with rules using both direct and indirect personal identifiers. Linkage rules using 2 or 3 indirect, patient-level identifiers (ie, date of birth, sex, admission date) and hospital ID produced linkages with sensitivity of 95% and specificity of 98% compared with a gold standard linkage rule using a combination of both direct and indirect identifiers. Ours is the first large-scale study to validate the performance of deterministic linkage rules without direct personal identifiers. When linking hospitalization-level records in the absence of direct personal identifiers, provider information is necessary for successful linkage. © 2014 American Heart Association, Inc.
Durbin, Kenneth R.; Tran, John C.; Zamdborg, Leonid; Sweet, Steve M. M.; Catherman, Adam D.; Lee, Ji Eun; Li, Mingxi; Kellie, John F.; Kelleher, Neil L.
2011-01-01
Applying high-throughput Top-Down MS to an entire proteome requires a yet-to-be-established model for data processing. Since Top-Down is becoming possible on a large scale, we report our latest software pipeline dedicated to capturing the full value of intact protein data in automated fashion. For intact mass detection, we combine algorithms for processing MS1 data from both isotopically resolved (FT) and charge-state resolved (ion trap) LC-MS data, which are then linked to their fragment ions for database searching using ProSight. Automated determination of human keratin and tubulin isoforms is one result. Optimized for the intricacies of whole proteins, new software modules visualize proteome-scale data based on the LC retention time and intensity of intact masses and enable selective detection of PTMs to automatically screen for acetylation, phosphorylation, and methylation. Software functionality was demonstrated using comparative LC-MS data from yeast strains in addition to human cells undergoing chemical stress. We further these advances as a key aspect of realizing Top-Down MS on a proteomic scale. PMID:20848673
Kevin M. Potter
2012-01-01
Analyzing patterns of forest pest infestation is necessary for monitoring the health of forested ecosystems because of the impacts that insects and diseases can have on forest structure, composition, biodiversity, and species distributions (Castello and others 1995). In particular, introduced nonnative insects and diseases can extensively damage the diversity, ecology...
Jennifer C. Jenkins; Richard A. Birdsey
2000-01-01
As interest grows in the role of forest growth in the carbon cycle, and as simulation models are applied to predict future forest productivity at large spatial scales, the need for reliable and field-based data for evaluation of model estimates is clear. We created estimates of potential forest biomass and annual aboveground production for the Chesapeake Bay watershed...
LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis
2016-05-23
LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF. Keywords—LLMapReduce; map-reduce; performance; scheduler; Grid Engine ...SLURM; LSF I. INTRODUCTION Large scale computing is currently dominated by four ecosystems: supercomputing, database, enterprise , and big data [1...interconnects [6]), High performance math libraries (e.g., BLAS [7, 8], LAPACK [9], ScaLAPACK [10]) designed to exploit special processing hardware, High
ERIC Educational Resources Information Center
Association for Education in Journalism and Mass Communication.
The Technology and the Media section of the proceedings contains the following 18 papers: "What's Wrong with This Picture?: Attitudes of Photographic Editors at Daily Newspapers and Their Tolerance toward Digital Manipulation" (Shiela Reaves); "Strategies for the Analysis of Large-Scale Databases in Computer-Assisted Investigative…
Design and Implementation of a Metadata-rich File System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ames, S; Gokhale, M B; Maltzahn, C
2010-01-19
Despite continual improvements in the performance and reliability of large scale file systems, the management of user-defined file system metadata has changed little in the past decade. The mismatch between the size and complexity of large scale data stores and their ability to organize and query their metadata has led to a de facto standard in which raw data is stored in traditional file systems, while related, application-specific metadata is stored in relational databases. This separation of data and semantic metadata requires considerable effort to maintain consistency and can result in complex, slow, and inflexible system operation. To address thesemore » problems, we have developed the Quasar File System (QFS), a metadata-rich file system in which files, user-defined attributes, and file relationships are all first class objects. In contrast to hierarchical file systems and relational databases, QFS defines a graph data model composed of files and their relationships. QFS incorporates Quasar, an XPATH-extended query language for searching the file system. Results from our QFS prototype show the effectiveness of this approach. Compared to the de facto standard, the QFS prototype shows superior ingest performance and comparable query performance on user metadata-intensive operations and superior performance on normal file metadata operations.« less
Ordinal feature selection for iris and palmprint recognition.
Sun, Zhenan; Wang, Libin; Tan, Tieniu
2014-09-01
Ordinal measures have been demonstrated as an effective feature representation model for iris and palmprint recognition. However, ordinal measures are a general concept of image analysis and numerous variants with different parameter settings, such as location, scale, orientation, and so on, can be derived to construct a huge feature space. This paper proposes a novel optimization formulation for ordinal feature selection with successful applications to both iris and palmprint recognition. The objective function of the proposed feature selection method has two parts, i.e., misclassification error of intra and interclass matching samples and weighted sparsity of ordinal feature descriptors. Therefore, the feature selection aims to achieve an accurate and sparse representation of ordinal measures. And, the optimization subjects to a number of linear inequality constraints, which require that all intra and interclass matching pairs are well separated with a large margin. Ordinal feature selection is formulated as a linear programming (LP) problem so that a solution can be efficiently obtained even on a large-scale feature pool and training database. Extensive experimental results demonstrate that the proposed LP formulation is advantageous over existing feature selection methods, such as mRMR, ReliefF, Boosting, and Lasso for biometric recognition, reporting state-of-the-art accuracy on CASIA and PolyU databases.
Proteogenomic database construction driven from large scale RNA-seq data.
Woo, Sunghee; Cha, Seong Won; Merrihew, Gennifer; He, Yupeng; Castellana, Natalie; Guest, Clark; MacCoss, Michael; Bafna, Vineet
2014-01-03
The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496.2 GB of aligned RNA-seq SAM files to 410 MB of splice graph database written in FASTA format. This corresponds to 1000× compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom data set, using a completely automated pipeline, and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame shifts, 1166 reverse strands, and 42 translated UTRs. Our results highlight the usefulness of transcript + proteomic integration for improved genome annotations.
Geologic map of Chickasaw National Recreation Area, Murray County, Oklahoma
Blome, Charles D.; Lidke, David J.; Wahl, Ronald R.; Golab, James A.
2013-01-01
This 1:24,000-scale geologic map is a compilation of previous geologic maps and new geologic mapping of areas in and around Chickasaw National Recreation Area. The geologic map includes revisions of numerous unit contacts and faults and a number of previously “undifferentiated” rock units were subdivided in some areas. Numerous circular-shaped hills in and around Chickasaw National Recreation Area are probably the result of karst-related collapse and may represent the erosional remnants of large, exhumed sinkholes. Geospatial registration of existing, smaller scale (1:72,000- and 1:100,000-scale) geologic maps of the area and construction of an accurate Geographic Information System (GIS) database preceded 2 years of fieldwork wherein previously mapped geology (unit contacts and faults) was verified and new geologic mapping was carried out. The geologic map of Chickasaw National Recreation Area and this pamphlet include information pertaining to how the geologic units and structural features in the map area relate to the formation of the northern Arbuckle Mountains and its Arbuckle-Simpson aquifer. The development of an accurate geospatial GIS database and the use of a handheld computer in the field greatly increased both the accuracy and efficiency in producing the 1:24,000-scale geologic map.
Automation of a N-S S and C Database Generation for the Harrier in Ground Effect
NASA Technical Reports Server (NTRS)
Murman, Scott M.; Chaderjian, Neal M.; Pandya, Shishir; Kwak, Dochan (Technical Monitor)
2001-01-01
A method of automating the generation of a time-dependent, Navier-Stokes static stability and control database for the Harrier aircraft in ground effect is outlined. Reusable, lightweight components arc described which allow different facets of the computational fluid dynamic simulation process to utilize a consistent interface to a remote database. These components also allow changes and customizations to easily be facilitated into the solution process to enhance performance, without relying upon third-party support. An analysis of the multi-level parallel solver OVERFLOW-MLP is presented, and the results indicate that it is feasible to utilize large numbers of processors (= 100) even with a grid system with relatively small number of cells (= 10(exp 6)). A more detailed discussion of the simulation process, as well as refined data for the scaling of the OVERFLOW-MLP flow solver will be included in the full paper.
Schröter, Pauline; Schroeder, Sascha
2017-12-01
With the Developmental Lexicon Project (DeveL), we present a large-scale study that was conducted to collect data on visual word recognition in German across the lifespan. A total of 800 children from Grades 1 to 6, as well as two groups of younger and older adults, participated in the study and completed a lexical decision and a naming task. We provide a database for 1,152 German words, comprising behavioral data from seven different stages of reading development, along with sublexical and lexical characteristics for all stimuli. The present article describes our motivation for this project, explains the methods we used to collect the data, and reports analyses on the reliability of our results. In addition, we explored developmental changes in three marker effects in psycholinguistic research: word length, word frequency, and orthographic similarity. The database is available online.
NASA Technical Reports Server (NTRS)
Winckelmans, G. S.; Lund, T. S.; Carati, D.; Wray, A. A.
1996-01-01
Subgrid-scale models for Large Eddy Simulation (LES) in both the velocity-pressure and the vorticity-velocity formulations were evaluated and compared in a priori tests using spectral Direct Numerical Simulation (DNS) databases of isotropic turbulence: 128(exp 3) DNS of forced turbulence (Re(sub(lambda))=95.8) filtered, using the sharp cutoff filter, to both 32(exp 3) and 16(exp 3) synthetic LES fields; 512(exp 3) DNS of decaying turbulence (Re(sub(Lambda))=63.5) filtered to both 64(exp 3) and 32(exp 3) LES fields. Gaussian and top-hat filters were also used with the 128(exp 3) database. Different LES models were evaluated for each formulation: eddy-viscosity models, hyper eddy-viscosity models, mixed models, and scale-similarity models. Correlations between exact versus modeled subgrid-scale quantities were measured at three levels: tensor (traceless), vector (solenoidal 'force'), and scalar (dissipation) levels, and for both cases of uniform and variable coefficient(s). Different choices for the 1/T scaling appearing in the eddy-viscosity were also evaluated. It was found that the models for the vorticity-velocity formulation produce higher correlations with the filtered DNS data than their counterpart in the velocity-pressure formulation. It was also found that the hyper eddy-viscosity model performs better than the eddy viscosity model, in both formulations.
Fast large-scale object retrieval with binary quantization
NASA Astrophysics Data System (ADS)
Zhou, Shifu; Zeng, Dan; Shen, Wei; Zhang, Zhijiang; Tian, Qi
2015-11-01
The objective of large-scale object retrieval systems is to search for images that contain the target object in an image database. Where state-of-the-art approaches rely on global image representations to conduct searches, we consider many boxes per image as candidates to search locally in a picture. In this paper, a feature quantization algorithm called binary quantization is proposed. In binary quantization, a scale-invariant feature transform (SIFT) feature is quantized into a descriptive and discriminative bit-vector, which allows itself to adapt to the classic inverted file structure for box indexing. The inverted file, which stores the bit-vector and box ID where the SIFT feature is located inside, is compact and can be loaded into the main memory for efficient box indexing. We evaluate our approach on available object retrieval datasets. Experimental results demonstrate that the proposed approach is fast and achieves excellent search quality. Therefore, the proposed approach is an improvement over state-of-the-art approaches for object retrieval.
NASA Technical Reports Server (NTRS)
Hueschen, Richard M.
2011-01-01
A six degree-of-freedom, flat-earth dynamics, non-linear, and non-proprietary aircraft simulation was developed that is representative of a generic mid-sized twin-jet transport aircraft. The simulation was developed from a non-proprietary, publicly available, subscale twin-jet transport aircraft simulation using scaling relationships and a modified aerodynamic database. The simulation has an extended aerodynamics database with aero data outside the normal transport-operating envelope (large angle-of-attack and sideslip values). The simulation has representative transport aircraft surface actuator models with variable rate-limits and generally fixed position limits. The simulation contains a generic 40,000 lb sea level thrust engine model. The engine model is a first order dynamic model with a variable time constant that changes according to simulation conditions. The simulation provides a means for interfacing a flight control system to use the simulation sensor variables and to command the surface actuators and throttle position of the engine model.
Intermittency measurement in two-dimensional bacterial turbulence
NASA Astrophysics Data System (ADS)
Qiu, Xiang; Ding, Long; Huang, Yongxiang; Chen, Ming; Lu, Zhiming; Liu, Yulu; Zhou, Quan
2016-06-01
In this paper, an experimental velocity database of a bacterial collective motion, e.g., Bacillus subtilis, in turbulent phase with volume filling fraction 84 % provided by Professor Goldstein at Cambridge University (UK), was analyzed to emphasize the scaling behavior of this active turbulence system. This was accomplished by performing a Hilbert-based methodology analysis to retrieve the scaling property without the β -limitation. A dual-power-law behavior separated by the viscosity scale ℓν was observed for the q th -order Hilbert moment Lq(k ) . This dual-power-law belongs to an inverse-cascade since the scaling range is above the injection scale R , e.g., the bacterial body length. The measured scaling exponents ζ (q ) of both the small-scale (k >kν ) and large-scale (k
Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf
2014-01-01
CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB PMID:25281234
Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf
2014-01-01
CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB. © The Author(s) 2014. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Miyazaki, Kazuteru; Tsuboi, Sougo; Kobayashi, Shigenobu
The purpose of reinforcement learning is to learn an optimal policy in general. However, in 2-players games such as the othello game, it is important to acquire a penalty avoiding policy. In this paper, we focus on formation of a penalty avoiding policy based on the Penalty Avoiding Rational Policy Making algorithm [Miyazaki 01]. In applying it to large-scale problems, we are confronted with the curse of dimensionality. We introduce several ideas and heuristics to overcome the combinational explosion in large-scale problems. First, we propose an algorithm to save the memory by calculation of state transition. Second, we describe how to restrict exploration by two type knowledge; KIFU database and evaluation funcion. We show that our learning player can always defeat against the well-known othello game program KITTY.
NASA Astrophysics Data System (ADS)
Zhan, Aibin; Bao, Zhenmin; Hu, Xiaoli; Lu, Wei; Hu, Jingjie
2009-06-01
Microsatellite markers have become one kind of the most important molecular tools used in various researches. A large number of microsatellite markers are required for the whole genome survey in the fields of molecular ecology, quantitative genetics and genomics. Therefore, it is extremely necessary to select several versatile, low-cost, efficient and time- and labor-saving methods to develop a large panel of microsatellite markers. In this study, we used Zhikong scallop ( Chlamys farreri) as the target species to compare the efficiency of the five methods derived from three strategies for microsatellite marker development. The results showed that the strategy of constructing small insert genomic DNA library resulted in poor efficiency, while the microsatellite-enriched strategy highly improved the isolation efficiency. Although the mining public database strategy is time- and cost-saving, it is difficult to obtain a large number of microsatellite markers, mainly due to the limited sequence data of non-model species deposited in public databases. Based on the results in this study, we recommend two methods, microsatellite-enriched library construction method and FIASCO-colony hybridization method, for large-scale microsatellite marker development. Both methods were derived from the microsatellite-enriched strategy. The experimental results obtained from Zhikong scallop also provide the reference for microsatellite marker development in other species with large genomes.
Adverse Events Associated with Prolonged Antibiotic Use
Meropol, Sharon B.; Chan, K. Arnold; Chen, Zhen; Finkelstein, Jonathan A.; Hennessy, Sean; Lautenbach, Ebbing; Platt, Richard; Schech, Stephanie D.; Shatin, Deborah; Metlay, Joshua P.
2014-01-01
Purpose The Infectious Diseases Society of America and US CDC recommend 60 days of ciprofloxacin, doxycycline or amoxicillin for anthrax prophylaxis. It is not possible to determine severe adverse drug event (ADE) risks from the few people thus far exposed to anthrax prophylaxis. This study’s objective was to estimate risks of severe ADEs associated with long-term ciprofloxacin, doxycycline and amoxicillin exposure using 3 large databases: one electronic medical record (General Practice Research Database) and two claims databases (UnitedHealthcare, HMO Research Network). Methods We include office visit, hospital admission and prescription data for 1/1/1999–6/30/2001. Exposure variable was oral antibiotic person-days (pds). Primary outcome was hospitalization during exposure with ADE diagnoses: anaphylaxis, phototoxicity, hepatotoxicity, nephrotoxicity, seizures, ventricular arrhythmia or infectious colitis. Results We randomly sampled 999,773, 1,047,496 and 1,819,004 patients from Databases A, B and C respectively. 33,183 amoxicillin, 15,250 ciprofloxacin and 50,171 doxycycline prescriptions continued ≥30 days. ADE hospitalizations during long-term exposure were not observed in Database A. ADEs during long-term amoxicillin were seen only in Database C with 5 ADEs or 1.2(0.4–2.7) ADEs/100,000 pds exposure. Long-term ciprofloxacin showed 3 and 4 ADEs with 5.7(1.2–16.6) and 3.5(1.0–9.0) ADEs/100,000 pds in Databases B and C, respectively. Only Database B had ADEs during long-term doxycycline with 3 ADEs or 0.9(0.2–2.6) ADEs/100,000 pds. For most events, the incidence rate ratio, comparing >28 vs.1–28 pds exposure was <1, showing limited evidence for cumulative dose-related ADEs from long-term exposure. Conclusions Long-term amoxicillin, ciprofloxacin and doxycycline appears safe, supporting use of these medications if needed for large-scale post-exposure anthrax prophylaxis. PMID:18215001
Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy
2013-08-01
Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.
Scaling laws and fluctuations in the statistics of word frequencies
NASA Astrophysics Data System (ADS)
Gerlach, Martin; Altmann, Eduardo G.
2014-11-01
In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps’ law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylor's law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps’ and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipf's law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.
bioNerDS: exploring bioinformatics’ database and software use through literature mining
2013-01-01
Background Biology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology. Results We have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing. Abstract Conclusions We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/. PMID:23768135
Visualising biological data: a semantic approach to tool and database integration
Pettifer, Steve; Thorne, David; McDermott, Philip; Marsh, James; Villéger, Alice; Kell, Douglas B; Attwood, Teresa K
2009-01-01
Motivation In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customised for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. Methods To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. Results The toolkit, named Utopia, is freely available from . PMID:19534744
High throughput profile-profile based fold recognition for the entire human proteome.
McGuffin, Liam J; Smith, Richard T; Bryson, Kevin; Sørensen, Søren-Aksel; Jones, David T
2006-06-07
In order to maintain the most comprehensive structural annotation databases we must carry out regular updates for each proteome using the latest profile-profile fold recognition methods. The ability to carry out these updates on demand is necessary to keep pace with the regular updates of sequence and structure databases. Providing the highest quality structural models requires the most intensive profile-profile fold recognition methods running with the very latest available sequence databases and fold libraries. However, running these methods on such a regular basis for every sequenced proteome requires large amounts of processing power. In this paper we describe and benchmark the JYDE (Job Yield Distribution Environment) system, which is a meta-scheduler designed to work above cluster schedulers, such as Sun Grid Engine (SGE) or Condor. We demonstrate the ability of JYDE to distribute the load of genomic-scale fold recognition across multiple independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software in order to annotate the latest version of the Human proteome against the latest sequence and structure databases in as short a time as possible. We show that our JYDE system is able to scale to large numbers of intensive fold recognition jobs running across several independent computer clusters. Using our JYDE system we have been able to annotate 99.9% of the protein sequences within the Human proteome in less than 24 hours, by harnessing over 500 CPUs from 3 independent Grid domains. This study clearly demonstrates the feasibility of carrying out on demand high quality structural annotations for the proteomes of major eukaryotic organisms. Specifically, we have shown that it is now possible to provide complete regular updates of profile-profile based fold recognition models for entire eukaryotic proteomes, through the use of Grid middleware such as JYDE.
2013-01-01
Background Large-scale pharmaco-epidemiological studies of Chinese herbal medicine (CHM) for treatment of urticaria are few, even though clinical trials showed some CHM are effective. The purpose of this study was to explore the frequencies and patterns of CHM prescriptions for urticaria by analysing the population-based CHM database in Taiwan. Methods This study was linked to and processed through the complete traditional CHM database of the National Health Insurance Research Database in Taiwan during 2009. We calculated the frequencies and patterns of CHM prescriptions used for treatment of urticaria, of which the diagnosis was defined as the single ICD-9 Code of 708. Frequent itemset mining, as applied to data mining, was used to analyse co-prescription of CHM for patients with urticaria. Results There were 37,386 subjects who visited traditional Chinese Medicine clinics for urticaria in Taiwan during 2009 and received a total of 95,765 CHM prescriptions. Subjects between 18 and 35 years of age comprised the largest number of those treated (32.76%). In addition, women used CHM for urticaria more frequently than men (female:male = 1.94:1). There was an average of 5.54 items prescribed in the form of either individual Chinese herbs or a formula in a single CHM prescription for urticaria. Bai-Xian-Pi (Dictamnus dasycarpus Turcz) was the most commonly prescribed single Chinese herb while Xiao-Feng San was the most commonly prescribed Chinese herbal formula. The most commonly prescribed CHM drug combination was Xiao-Feng San plus Bai-Xian-Pi while the most commonly prescribed triple drug combination was Xiao-Feng San, Bai-Xian-Pi, and Di-Fu Zi (Kochia scoparia). Conclusions In view of the popularity of CHM such as Xiao-Feng San prescribed for the wind-heat pattern of urticaria in this study, a large-scale, randomized clinical trial is warranted to research their efficacy and safety. PMID:23947955
Chien, Pei-Shan; Tseng, Yu-Fang; Hsu, Yao-Chin; Lai, Yu-Kai; Weng, Shih-Feng
2013-08-15
Large-scale pharmaco-epidemiological studies of Chinese herbal medicine (CHM) for treatment of urticaria are few, even though clinical trials showed some CHM are effective. The purpose of this study was to explore the frequencies and patterns of CHM prescriptions for urticaria by analysing the population-based CHM database in Taiwan. This study was linked to and processed through the complete traditional CHM database of the National Health Insurance Research Database in Taiwan during 2009. We calculated the frequencies and patterns of CHM prescriptions used for treatment of urticaria, of which the diagnosis was defined as the single ICD-9 Code of 708. Frequent itemset mining, as applied to data mining, was used to analyse co-prescription of CHM for patients with urticaria. There were 37,386 subjects who visited traditional Chinese Medicine clinics for urticaria in Taiwan during 2009 and received a total of 95,765 CHM prescriptions. Subjects between 18 and 35 years of age comprised the largest number of those treated (32.76%). In addition, women used CHM for urticaria more frequently than men (female:male = 1.94:1). There was an average of 5.54 items prescribed in the form of either individual Chinese herbs or a formula in a single CHM prescription for urticaria. Bai-Xian-Pi (Dictamnus dasycarpus Turcz) was the most commonly prescribed single Chinese herb while Xiao-Feng San was the most commonly prescribed Chinese herbal formula. The most commonly prescribed CHM drug combination was Xiao-Feng San plus Bai-Xian-Pi while the most commonly prescribed triple drug combination was Xiao-Feng San, Bai-Xian-Pi, and Di-Fu Zi (Kochia scoparia). In view of the popularity of CHM such as Xiao-Feng San prescribed for the wind-heat pattern of urticaria in this study, a large-scale, randomized clinical trial is warranted to research their efficacy and safety.
Visualising biological data: a semantic approach to tool and database integration.
Pettifer, Steve; Thorne, David; McDermott, Philip; Marsh, James; Villéger, Alice; Kell, Douglas B; Attwood, Teresa K
2009-06-16
In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customized for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. The toolkit, named Utopia, is freely available from http://utopia.cs.man.ac.uk/.
Tabor, R.W.; Booth, D.B.; Vance, J.A.; Ford, A.B.
2006-01-01
This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Sauk River 30- by 60 Minute Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled most Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
GIS-project: geodynamic globe for global monitoring of geological processes
NASA Astrophysics Data System (ADS)
Ryakhovsky, V.; Rundquist, D.; Gatinsky, Yu.; Chesalova, E.
2003-04-01
A multilayer geodynamic globe at the scale 1:10,000,000 was created at the end of the nineties in the GIS Center of the Vernadsky Museum. A special soft-and-hardware complex was elaborated for its visualization with a set of multitarget object directed databases. The globe includes separate thematic covers represented by digital sets of spatial geological, geochemical, and geophysical information (maps, schemes, profiles, stratigraphic columns, arranged databases etc.). At present the largest databases included in the globe program are connected with petrochemical and isotopic data on magmatic rocks of the World Ocean and with the large and supperlarge mineral deposits. Software by the Environmental Scientific Research Institute (ESRI), USA as well as ArcScan vectrorizator were used for covers digitizing and database adaptation (ARC/INFO 7.0, 8.0). All layers of the geoinformational project were obtained by scanning of separate objects and their transfer to the real geographic co-ordinates of an equiintermediate conic projection. Then the covers were projected on plane degree-system geographic co-ordinates. Some attributive databases were formed for each thematic layer, and in the last stage all covers were combined into the single information system. Separate digital covers represent mathematical descriptions of geological objects and relations between them, such as Earth's altimetry, active fault systems, seismicity etc. Some grounds of the cartographic generalization were taken into consideration in time of covers compilation with projection and co-ordinate systems precisely answered a given scale. The globe allows us to carry out in the interactive regime the formation of coordinated with each other object-oriented databases and thematic covers directly connected with them. They can be spread for all the Earth and the near-Earth space, and for the most well known parts of divergent and convergent boundaries of the lithosphere plates. Such covers and time series reflect in diagram form a total combination and dynamics of data on the geological structure, geophysical fields, seismicity, geomagnetism, composition of rock complexes, and metalloge-ny of different areas on the Earth's surface. They give us possibility to scale, detail, and develop 3D spatial visualization. Information filling the covers could be replenished as in the existing so in newly formed databases with new data. The integrated analyses of the data allows us more precisely to define our ideas on regularities in development of lithosphere and mantle unhomogeneities using some original technologies. It also enables us to work out 3D digital models for geodynamic development of tectonic zones in convergent and divergent plate boundaries with the purpose of integrated monitoring of mineral resources and establishing correlation between seismicity, magmatic activity, and metallogeny in time-spatial co-ordinates. The created multifold geoinformation system gives a chance to execute an integral analyses of geoinformation flows in the interactive regime and, in particular, to establish some regularities in the time-spatial distribution and dynamics of main structural units in the lithosphere, as well as illuminate the connection between stages of their development and epochs of large and supperlarge mineral deposit formation. Now we try to use the system for prediction of large oil and gas concentration in the main sedimentary basins. The work was supported by RFBR, (grants 93-07-14680, 96-07-89499, 99-07-90030, 00-15-98535, 02-07-90140) and MTC.
Lottig, Noah R.; Wagner, Tyler; Henry, Emily N.; Cheruvelil, Kendra Spence; Webster, Katherine E.; Downing, John A.; Stow, Craig A.
2014-01-01
We compiled a lake-water clarity database using publically available, citizen volunteer observations made between 1938 and 2012 across eight states in the Upper Midwest, USA. Our objectives were to determine (1) whether temporal trends in lake-water clarity existed across this large geographic area and (2) whether trends were related to the lake-specific characteristics of latitude, lake size, or time period the lake was monitored. Our database consisted of >140,000 individual Secchi observations from 3,251 lakes that we summarized per lake-year, resulting in 21,020 summer averages. Using Bayesian hierarchical modeling, we found approximately a 1% per year increase in water clarity (quantified as Secchi depth) for the entire population of lakes. On an individual lake basis, 7% of lakes showed increased water clarity and 4% showed decreased clarity. Trend direction and strength were related to latitude and median sample date. Lakes in the southern part of our study-region had lower average annual summer water clarity, more negative long-term trends, and greater inter-annual variability in water clarity compared to northern lakes. Increasing trends were strongest for lakes with median sample dates earlier in the period of record (1938–2012). Our ability to identify specific mechanisms for these trends is currently hampered by the lack of a large, multi-thematic database of variables that drive water clarity (e.g., climate, land use/cover). Our results demonstrate, however, that citizen science can provide the critical monitoring data needed to address environmental questions at large spatial and long temporal scales. Collaborations among citizens, research scientists, and government agencies may be important for developing the data sources and analytical tools necessary to move toward an understanding of the factors influencing macro-scale patterns such as those shown here for lake water clarity.
Wawrzyniak, Zbigniew M; Paczesny, Daniel; Mańczuk, Marta; Zatoński, Witold A
2011-01-01
Large-scale epidemiologic studies can assess health indicators differentiating social groups and important health outcomes of the incidence and mortality of cancer, cardiovascular disease, and others, to establish a solid knowledgebase for the prevention management of premature morbidity and mortality causes. This study presents new advanced methods of data collection and data management systems with current data quality control and security to ensure high quality data assessment of health indicators in the large epidemiologic PONS study (The Polish-Norwegian Study). The material for experiment is the data management design of the large-scale population study in Poland (PONS) and the managed processes are applied into establishing a high quality and solid knowledge. The functional requirements of the PONS study data collection, supported by the advanced IT web-based methods, resulted in medical data of a high quality, data security, with quality data assessment, control process and evolution monitoring are fulfilled and shared by the IT system. Data from disparate and deployed sources of information are integrated into databases via software interfaces, and archived by a multi task secure server. The practical and implemented solution of modern advanced database technologies and remote software/hardware structure successfully supports the research of the big PONS study project. Development and implementation of follow-up control of the consistency and quality of data analysis and the processes of the PONS sub-databases have excellent measurement properties of data consistency of more than 99%. The project itself, by tailored hardware/software application, shows the positive impact of Quality Assurance (QA) on the quality of outcomes analysis results, effective data management within a shorter time. This efficiency ensures the quality of the epidemiological data and indicators of health by the elimination of common errors of research questionnaires and medical measurements.
KA-SB: from data integration to large scale reasoning
Roldán-García, María del Mar; Navas-Delgado, Ismael; Kerzazi, Amine; Chniber, Othmane; Molina-Castro, Joaquín; Aldana-Montes, José F
2009-01-01
Background The analysis of information in the biological domain is usually focused on the analysis of data from single on-line data sources. Unfortunately, studying a biological process requires having access to disperse, heterogeneous, autonomous data sources. In this context, an analysis of the information is not possible without the integration of such data. Methods KA-SB is a querying and analysis system for final users based on combining a data integration solution with a reasoner. Thus, the tool has been created with a process divided into two steps: 1) KOMF, the Khaos Ontology-based Mediator Framework, is used to retrieve information from heterogeneous and distributed databases; 2) the integrated information is crystallized in a (persistent and high performance) reasoner (DBOWL). This information could be further analyzed later (by means of querying and reasoning). Results In this paper we present a novel system that combines the use of a mediation system with the reasoning capabilities of a large scale reasoner to provide a way of finding new knowledge and of analyzing the integrated information from different databases, which is retrieved as a set of ontology instances. This tool uses a graphical query interface to build user queries easily, which shows a graphical representation of the ontology and allows users o build queries by clicking on the ontology concepts. Conclusion These kinds of systems (based on KOMF) will provide users with very large amounts of information (interpreted as ontology instances once retrieved), which cannot be managed using traditional main memory-based reasoners. We propose a process for creating persistent and scalable knowledgebases from sets of OWL instances obtained by integrating heterogeneous data sources with KOMF. This process has been applied to develop a demo tool , which uses the BioPax Level 3 ontology as the integration schema, and integrates UNIPROT, KEGG, CHEBI, BRENDA and SABIORK databases. PMID:19796402
High dimensional biological data retrieval optimization with NoSQL technology.
Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike
2014-01-01
High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
High dimensional biological data retrieval optimization with NoSQL technology
2014-01-01
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data. PMID:25435347
Lottig, Noah R.; Wagner, Tyler; Norton Henry, Emily; Spence Cheruvelil, Kendra; Webster, Katherine E.; Downing, John A.; Stow, Craig A.
2014-01-01
We compiled a lake-water clarity database using publically available, citizen volunteer observations made between 1938 and 2012 across eight states in the Upper Midwest, USA. Our objectives were to determine (1) whether temporal trends in lake-water clarity existed across this large geographic area and (2) whether trends were related to the lake-specific characteristics of latitude, lake size, or time period the lake was monitored. Our database consisted of >140,000 individual Secchi observations from 3,251 lakes that we summarized per lake-year, resulting in 21,020 summer averages. Using Bayesian hierarchical modeling, we found approximately a 1% per year increase in water clarity (quantified as Secchi depth) for the entire population of lakes. On an individual lake basis, 7% of lakes showed increased water clarity and 4% showed decreased clarity. Trend direction and strength were related to latitude and median sample date. Lakes in the southern part of our study-region had lower average annual summer water clarity, more negative long-term trends, and greater inter-annual variability in water clarity compared to northern lakes. Increasing trends were strongest for lakes with median sample dates earlier in the period of record (1938–2012). Our ability to identify specific mechanisms for these trends is currently hampered by the lack of a large, multi-thematic database of variables that drive water clarity (e.g., climate, land use/cover). Our results demonstrate, however, that citizen science can provide the critical monitoring data needed to address environmental questions at large spatial and long temporal scales. Collaborations among citizens, research scientists, and government agencies may be important for developing the data sources and analytical tools necessary to move toward an understanding of the factors influencing macro-scale patterns such as those shown here for lake water clarity. PMID:24788722
The Porcelain Crab Transcriptome and PCAD, the Porcelain Crab Microarray and Sequence Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tagmount, Abderrahmane; Wang, Mei; Lindquist, Erika
2010-01-27
Background: With the emergence of a completed genome sequence of the freshwater crustacean Daphnia pulex, construction of genomic-scale sequence databases for additional crustacean sequences are important for comparative genomics and annotation. Porcelain crabs, genus Petrolisthes, have been powerful crustacean models for environmental and evolutionary physiology with respect to thermal adaptation and understanding responses of marine organisms to climate change. Here, we present a large-scale EST sequencing and cDNA microarray database project for the porcelain crab Petrolisthes cinctipes. Methodology/Principal Findings: A set of ~;;30K unique sequences (UniSeqs) representing ~;;19K clusters were generated from ~;;98K high quality ESTs from a set ofmore » tissue specific non-normalized and mixed-tissue normalized cDNA libraries from the porcelain crab Petrolisthes cinctipes. Homology for each UniSeq was assessed using BLAST, InterProScan, GO and KEGG database searches. Approximately 66percent of the UniSeqs had homology in at least one of the databases. All EST and UniSeq sequences along with annotation results and coordinated cDNA microarray datasets have been made publicly accessible at the Porcelain Crab Array Database (PCAD), a feature-enriched version of the Stanford and Longhorn Array Databases.Conclusions/Significance: The EST project presented here represents the third largest sequencing effort for any crustacean, and the largest effort for any crab species. Our assembly and clustering results suggest that our porcelain crab EST data set is equally diverse to the much larger EST set generated in the Daphnia pulex genome sequencing project, and thus will be an important resource to the Daphnia research community. Our homology results support the pancrustacea hypothesis and suggest that Malacostraca may be ancestral to Branchiopoda and Hexapoda. Our results also suggest that our cDNA microarrays cover as much of the transcriptome as can reasonably be captured in EST library sequencing approaches, and thus represent a rich resource for studies of environmental genomics.« less
A Web-based Distributed Voluntary Computing Platform for Large Scale Hydrological Computations
NASA Astrophysics Data System (ADS)
Demir, I.; Agliamzanov, R.
2014-12-01
Distributed volunteer computing can enable researchers and scientist to form large parallel computing environments to utilize the computing power of the millions of computers on the Internet, and use them towards running large scale environmental simulations and models to serve the common good of local communities and the world. Recent developments in web technologies and standards allow client-side scripting languages to run at speeds close to native application, and utilize the power of Graphics Processing Units (GPU). Using a client-side scripting language like JavaScript, we have developed an open distributed computing framework that makes it easy for researchers to write their own hydrologic models, and run them on volunteer computers. Users will easily enable their websites for visitors to volunteer sharing their computer resources to contribute running advanced hydrological models and simulations. Using a web-based system allows users to start volunteering their computational resources within seconds without installing any software. The framework distributes the model simulation to thousands of nodes in small spatial and computational sizes. A relational database system is utilized for managing data connections and queue management for the distributed computing nodes. In this paper, we present a web-based distributed volunteer computing platform to enable large scale hydrological simulations and model runs in an open and integrated environment.
Private and Efficient Query Processing on Outsourced Genomic Databases.
Ghasemi, Reza; Al Aziz, Md Momin; Mohammed, Noman; Dehkordi, Massoud Hadian; Jiang, Xiaoqian
2017-09-01
Applications of genomic studies are spreading rapidly in many domains of science and technology such as healthcare, biomedical research, direct-to-consumer services, and legal and forensic. However, there are a number of obstacles that make it hard to access and process a big genomic database for these applications. First, sequencing genomic sequence is a time consuming and expensive process. Second, it requires large-scale computation and storage systems to process genomic sequences. Third, genomic databases are often owned by different organizations, and thus, not available for public usage. Cloud computing paradigm can be leveraged to facilitate the creation and sharing of big genomic databases for these applications. Genomic data owners can outsource their databases in a centralized cloud server to ease the access of their databases. However, data owners are reluctant to adopt this model, as it requires outsourcing the data to an untrusted cloud service provider that may cause data breaches. In this paper, we propose a privacy-preserving model for outsourcing genomic data to a cloud. The proposed model enables query processing while providing privacy protection of genomic databases. Privacy of the individuals is guaranteed by permuting and adding fake genomic records in the database. These techniques allow cloud to evaluate count and top-k queries securely and efficiently. Experimental results demonstrate that a count and a top-k query over 40 Single Nucleotide Polymorphisms (SNPs) in a database of 20 000 records takes around 100 and 150 s, respectively.
Private and Efficient Query Processing on Outsourced Genomic Databases
Ghasemi, Reza; Al Aziz, Momin; Mohammed, Noman; Dehkordi, Massoud Hadian; Jiang, Xiaoqian
2017-01-01
Applications of genomic studies are spreading rapidly in many domains of science and technology such as healthcare, biomedical research, direct-to-consumer services, and legal and forensic. However, there are a number of obstacles that make it hard to access and process a big genomic database for these applications. First, sequencing genomic sequence is a time-consuming and expensive process. Second, it requires large-scale computation and storage systems to processes genomic sequences. Third, genomic databases are often owned by different organizations and thus not available for public usage. Cloud computing paradigm can be leveraged to facilitate the creation and sharing of big genomic databases for these applications. Genomic data owners can outsource their databases in a centralized cloud server to ease the access of their databases. However, data owners are reluctant to adopt this model, as it requires outsourcing the data to an untrusted cloud service provider that may cause data breaches. In this paper, we propose a privacy-preserving model for outsourcing genomic data to a cloud. The proposed model enables query processing while providing privacy protection of genomic databases. Privacy of the individuals is guaranteed by permuting and adding fake genomic records in the database. These techniques allow cloud to evaluate count and top-k queries securely and efficiently. Experimental results demonstrate that a count and a top-k query over 40 SNPs in a database of 20,000 records takes around 100 and 150 seconds, respectively. PMID:27834660
Preparing Laboratory and Real-World EEG Data for Large-Scale Analysis: A Containerized Approach
Bigdely-Shamlo, Nima; Makeig, Scott; Robbins, Kay A.
2016-01-01
Large-scale analysis of EEG and other physiological measures promises new insights into brain processes and more accurate and robust brain–computer interface models. However, the absence of standardized vocabularies for annotating events in a machine understandable manner, the welter of collection-specific data organizations, the difficulty in moving data across processing platforms, and the unavailability of agreed-upon standards for preprocessing have prevented large-scale analyses of EEG. Here we describe a “containerized” approach and freely available tools we have developed to facilitate the process of annotating, packaging, and preprocessing EEG data collections to enable data sharing, archiving, large-scale machine learning/data mining and (meta-)analysis. The EEG Study Schema (ESS) comprises three data “Levels,” each with its own XML-document schema and file/folder convention, plus a standardized (PREP) pipeline to move raw (Data Level 1) data to a basic preprocessed state (Data Level 2) suitable for application of a large class of EEG analysis methods. Researchers can ship a study as a single unit and operate on its data using a standardized interface. ESS does not require a central database and provides all the metadata data necessary to execute a wide variety of EEG processing pipelines. The primary focus of ESS is automated in-depth analysis and meta-analysis EEG studies. However, ESS can also encapsulate meta-information for the other modalities such as eye tracking, that are increasingly used in both laboratory and real-world neuroimaging. ESS schema and tools are freely available at www.eegstudy.org and a central catalog of over 850 GB of existing data in ESS format is available at studycatalog.org. These tools and resources are part of a larger effort to enable data sharing at sufficient scale for researchers to engage in truly large-scale EEG analysis and data mining (BigEEG.org). PMID:27014048
NASA Astrophysics Data System (ADS)
Huang, Liang; Ni, Xuan; Ditto, William L.; Spano, Mark; Carney, Paul R.; Lai, Ying-Cheng
2017-01-01
We develop a framework to uncover and analyse dynamical anomalies from massive, nonlinear and non-stationary time series data. The framework consists of three steps: preprocessing of massive datasets to eliminate erroneous data segments, application of the empirical mode decomposition and Hilbert transform paradigm to obtain the fundamental components embedded in the time series at distinct time scales, and statistical/scaling analysis of the components. As a case study, we apply our framework to detecting and characterizing high-frequency oscillations (HFOs) from a big database of rat electroencephalogram recordings. We find a striking phenomenon: HFOs exhibit on-off intermittency that can be quantified by algebraic scaling laws. Our framework can be generalized to big data-related problems in other fields such as large-scale sensor data and seismic data analysis.
Part 2 of a Computational Study of a Drop-Laden Mixing Layer
NASA Technical Reports Server (NTRS)
Okongo, Nora; Bellan, Josette
2004-01-01
This second of three reports on a computational study of a mixing layer laden with evaporating liquid drops presents the evaluation of Large Eddy Simulation (LES) models. The LES models were evaluated on an existing database that had been generated using Direct Numerical Simulation (DNS). The DNS method and the database are described in the first report of this series, Part 1 of a Computational Study of a Drop-Laden Mixing Layer (NPO-30719), NASA Tech Briefs, Vol. 28, No.7 (July 2004), page 59. The LES equations, which are derived by applying a spatial filter to the DNS set, govern the evolution of the larger scales of the flow and can therefore be solved on a coarser grid. Consistent with the reduction in grid points, the DNS drops would be represented by fewer drops, called computational drops in the LES context. The LES equations contain terms that cannot be directly computed on the coarser grid and that must instead be modeled. Two types of models are necessary: (1) those for the filtered source terms representing the effects of drops on the filtered flow field and (2) those for the sub-grid scale (SGS) fluxes arising from filtering the convective terms in the DNS equations. All of the filtered-sourceterm models that were developed were found to overestimate the filtered source terms. For modeling the SGS fluxes, constant-coefficient Smagorinsky, gradient, and scale-similarity models were assessed and calibrated on the DNS database. The Smagorinsky model correlated poorly with the SGS fluxes, whereas the gradient and scale-similarity models were well correlated with the SGS quantities that they represented.
Including the Group Quarters Population in the US Synthesized Population Database
Chasteen, Bernadette M.; Wheaton, William D.; Cooley, Philip C.; Ganapathi, Laxminarayana; Wagener, Diane K.
2011-01-01
In 2005, RTI International researchers developed methods to generate synthesized population data on US households for the US Synthesized Population Database. These data are used in agent-based modeling, which simulates large-scale social networks to test how changes in the behaviors of individuals affect the overall network. Group quarters are residences where individuals live in close proximity and interact frequently. Although the Synthesized Population Database represents the population living in households, data for the nation’s group quarters residents are not easily quantified because of US Census Bureau reporting methods designed to protect individuals’ privacy. Including group quarters population data can be an important factor in agent-based modeling because the number of residents and the frequency of their interactions are variables that directly affect modeling results. Particularly with infectious disease modeling, the increased frequency of agent interaction may increase the probability of infectious disease transmission between individuals and the probability of disease outbreaks. This report reviews our methods to synthesize data on group quarters residents to match US Census Bureau data. Our goal in developing the Group Quarters Population Database was to enable its use with RTI’s US Synthesized Population Database in the Modeling of Infectious Diseases Agent Study. PMID:21841972
A DBMS architecture for global change research
NASA Astrophysics Data System (ADS)
Hachem, Nabil I.; Gennert, Michael A.; Ward, Matthew O.
1993-08-01
The goal of this research is the design and development of an integrated system for the management of very large scientific databases, cartographic/geographic information processing, and exploratory scientific data analysis for global change research. The system will represent both spatial and temporal knowledge about natural and man-made entities on the eath's surface, following an object-oriented paradigm. A user will be able to derive, modify, and apply, procedures to perform operations on the data, including comparison, derivation, prediction, validation, and visualization. This work represents an effort to extend the database technology with an intrinsic class of operators, which is extensible and responds to the growing needs of scientific research. Of significance is the integration of many diverse forms of data into the database, including cartography, geography, hydrography, hypsography, images, and urban planning data. Equally important is the maintenance of metadata, that is, data about the data, such as coordinate transformation parameters, map scales, and audit trails of previous processing operations. This project will impact the fields of geographical information systems and global change research as well as the database community. It will provide an integrated database management testbed for scientific research, and a testbed for the development of analysis tools to understand and predict global change.
ChlamyCyc: an integrative systems biology database and web-portal for Chlamydomonas reinhardtii.
May, Patrick; Christian, Jan-Ole; Kempa, Stefan; Walther, Dirk
2009-05-04
The unicellular green alga Chlamydomonas reinhardtii is an important eukaryotic model organism for the study of photosynthesis and plant growth. In the era of modern high-throughput technologies there is an imperative need to integrate large-scale data sets from high-throughput experimental techniques using computational methods and database resources to provide comprehensive information about the molecular and cellular organization of a single organism. In the framework of the German Systems Biology initiative GoFORSYS, a pathway database and web-portal for Chlamydomonas (ChlamyCyc) was established, which currently features about 250 metabolic pathways with associated genes, enzymes, and compound information. ChlamyCyc was assembled using an integrative approach combining the recently published genome sequence, bioinformatics methods, and experimental data from metabolomics and proteomics experiments. We analyzed and integrated a combination of primary and secondary database resources, such as existing genome annotations from JGI, EST collections, orthology information, and MapMan classification. ChlamyCyc provides a curated and integrated systems biology repository that will enable and assist in systematic studies of fundamental cellular processes in Chlamydomonas. The ChlamyCyc database and web-portal is freely available under http://chlamycyc.mpimp-golm.mpg.de.
Bockholt, Henry J.; Scully, Mark; Courtney, William; Rachakonda, Srinivas; Scott, Adam; Caprihan, Arvind; Fries, Jill; Kalyanam, Ravi; Segall, Judith M.; de la Garza, Raul; Lane, Susan; Calhoun, Vince D.
2009-01-01
A neuroinformatics (NI) system is critical to brain imaging research in order to shorten the time between study conception and results. Such a NI system is required to scale well when large numbers of subjects are studied. Further, when multiple sites participate in research projects organizational issues become increasingly difficult. Optimized NI applications mitigate these problems. Additionally, NI software enables coordination across multiple studies, leveraging advantages potentially leading to exponential research discoveries. The web-based, Mind Research Network (MRN), database system has been designed and improved through our experience with 200 research studies and 250 researchers from seven different institutions. The MRN tools permit the collection, management, reporting and efficient use of large scale, heterogeneous data sources, e.g., multiple institutions, multiple principal investigators, multiple research programs and studies, and multimodal acquisitions. We have collected and analyzed data sets on thousands of research participants and have set up a framework to automatically analyze the data, thereby making efficient, practical data mining of this vast resource possible. This paper presents a comprehensive framework for capturing and analyzing heterogeneous neuroscience research data sources that has been fully optimized for end-users to perform novel data mining. PMID:20461147
Gibon, Thomas; Wood, Richard; Arvesen, Anders; Bergesen, Joseph D; Suh, Sangwon; Hertwich, Edgar G
2015-09-15
Climate change mitigation demands large-scale technological change on a global level and, if successfully implemented, will significantly affect how products and services are produced and consumed. In order to anticipate the life cycle environmental impacts of products under climate mitigation scenarios, we present the modeling framework of an integrated hybrid life cycle assessment model covering nine world regions. Life cycle assessment databases and multiregional input-output tables are adapted using forecasted changes in technology and resources up to 2050 under a 2 °C scenario. We call the result of this modeling "technology hybridized environmental-economic model with integrated scenarios" (THEMIS). As a case study, we apply THEMIS in an integrated environmental assessment of concentrating solar power. Life-cycle greenhouse gas emissions for this plant range from 33 to 95 g CO2 eq./kWh across different world regions in 2010, falling to 30-87 g CO2 eq./kWh in 2050. Using regional life cycle data yields insightful results. More generally, these results also highlight the need for systematic life cycle frameworks that capture the actual consequences and feedback effects of large-scale policies in the long term.
Large-scale Exploration of Neuronal Morphologies Using Deep Learning and Augmented Reality.
Li, Zhongyu; Butler, Erik; Li, Kang; Lu, Aidong; Ji, Shuiwang; Zhang, Shaoting
2018-02-12
Recently released large-scale neuron morphological data has greatly facilitated the research in neuroinformatics. However, the sheer volume and complexity of these data pose significant challenges for efficient and accurate neuron exploration. In this paper, we propose an effective retrieval framework to address these problems, based on frontier techniques of deep learning and binary coding. For the first time, we develop a deep learning based feature representation method for the neuron morphological data, where the 3D neurons are first projected into binary images and then learned features using an unsupervised deep neural network, i.e., stacked convolutional autoencoders (SCAEs). The deep features are subsequently fused with the hand-crafted features for more accurate representation. Considering the exhaustive search is usually very time-consuming in large-scale databases, we employ a novel binary coding method to compress feature vectors into short binary codes. Our framework is validated on a public data set including 58,000 neurons, showing promising retrieval precision and efficiency compared with state-of-the-art methods. In addition, we develop a novel neuron visualization program based on the techniques of augmented reality (AR), which can help users take a deep exploration of neuron morphologies in an interactive and immersive manner.
Seeing is believing: on the use of image databases for visually exploring plant organelle dynamics.
Mano, Shoji; Miwa, Tomoki; Nishikawa, Shuh-ichi; Mimura, Tetsuro; Nishimura, Mikio
2009-12-01
Organelle dynamics vary dramatically depending on cell type, developmental stage and environmental stimuli, so that various parameters, such as size, number and behavior, are required for the description of the dynamics of each organelle. Imaging techniques are superior to other techniques for describing organelle dynamics because these parameters are visually exhibited. Therefore, as the results can be seen immediately, investigators can more easily grasp organelle dynamics. At present, imaging techniques are emerging as fundamental tools in plant organelle research, and the development of new methodologies to visualize organelles and the improvement of analytical tools and equipment have allowed the large-scale generation of image and movie data. Accordingly, image databases that accumulate information on organelle dynamics are an increasingly indispensable part of modern plant organelle research. In addition, image databases are potentially rich data sources for computational analyses, as image and movie data reposited in the databases contain valuable and significant information, such as size, number, length and velocity. Computational analytical tools support image-based data mining, such as segmentation, quantification and statistical analyses, to extract biologically meaningful information from each database and combine them to construct models. In this review, we outline the image databases that are dedicated to plant organelle research and present their potential as resources for image-based computational analyses.
NASA Astrophysics Data System (ADS)
Fontaine, Alain; Sauvage, Bastien; Pétetin, Hervé; Auby, Antoine; Boulanger, Damien; Thouret, Valerie
2016-04-01
Since 1994, the IAGOS program (In-Service Aircraft for a Global Observing System http://www.iagos.org) and its predecessor MOZAIC has produced in-situ measurements of the atmospheric composition during more than 46000 commercial aircraft flights. In order to help analyzing these observations and further understanding the processes driving their evolution, we developed a modelling tool SOFT-IO quantifying their source/receptor link. We improved the methodology used by Stohl et al. (2003), based on the FLEXPART plume dispersion model, to simulate the contributions of anthropogenic and biomass burning emissions from the ECCAD database (http://eccad.aeris-data.fr) to the measured carbon monoxide mixing ratio along each IAGOS flight. Thanks to automated processes, contributions are simulated for the last 20 days before observation, separating individual contributions from the different source regions. The main goal is to supply add-value products to the IAGOS database showing pollutants geographical origin and emission type. Using this information, it may be possible to link trends in the atmospheric composition to changes in the transport pathways and to the evolution of emissions. This tool could be used for statistical validation as well as for inter-comparisons of emission inventories using large amounts of data, as Lagrangian models are able to bring the global scale emissions down to a smaller scale, where they can be directly compared to the in-situ observations from the IAGOS database.
Studying the Sky/Planets Can Drown You in Images: Machine Learning Solutions at JPL/Caltech
NASA Technical Reports Server (NTRS)
Fayyad, U. M.
1995-01-01
JPL is working to develop a domain-independent system capable of small-scale object recognition in large image databases for science analysis. Two applications discussed are the cataloging of three billion sky objects in the Sky Image Cataloging and Analysis Tool (SKICAT) and the detection of possibly one million small volcanoes visible in the Magellan synthetic aperture radar images of Venus (JPL Adaptive Recognition Tool, JARTool).
Modeling and Databases for Teaching Petrology
NASA Astrophysics Data System (ADS)
Asher, P.; Dutrow, B.
2003-12-01
With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.
NASA Astrophysics Data System (ADS)
Vollant, A.; Balarac, G.; Corre, C.
2017-09-01
New procedures are explored for the development of models in the context of large eddy simulation (LES) of a passive scalar. They rely on the combination of the optimal estimator theory with machine-learning algorithms. The concept of optimal estimator allows to identify the most accurate set of parameters to be used when deriving a model. The model itself can then be defined by training an artificial neural network (ANN) on a database derived from the filtering of direct numerical simulation (DNS) results. This procedure leads to a subgrid scale model displaying good structural performance, which allows to perform LESs very close to the filtered DNS results. However, this first procedure does not control the functional performance so that the model can fail when the flow configuration differs from the training database. Another procedure is then proposed, where the model functional form is imposed and the ANN used only to define the model coefficients. The training step is a bi-objective optimisation in order to control both structural and functional performances. The model derived from this second procedure proves to be more robust. It also provides stable LESs for a turbulent plane jet flow configuration very far from the training database but over-estimates the mixing process in that case.
Composing Data Parallel Code for a SPARQL Graph Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste
Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Christopher Slominski
2009-10-01
Archiving a large fraction of the EPICS signals within the Jefferson Lab (JLAB) Accelerator control system is vital for postmortem and real-time analysis of the accelerator performance. This analysis is performed on a daily basis by scientists, operators, engineers, technicians, and software developers. Archiving poses unique challenges due to the magnitude of the control system. A MySQL Archiving system (Mya) was developed to scale to the needs of the control system; currently archiving 58,000 EPICS variables, updating at a rate of 11,000 events per second. In addition to the large collection rate, retrieval of the archived data must also bemore » fast and robust. Archived data retrieval clients obtain data at a rate over 100,000 data points per second. Managing the data in a relational database provides a number of benefits. This paper describes an archiving solution that uses an open source database and standard off the shelf hardware to reach high performance archiving needs. Mya has been in production at Jefferson Lab since February of 2007.« less
Using the Saccharomyces Genome Database (SGD) for analysis of genomic information
Skrzypek, Marek S.; Hirschman, Jodi
2011-01-01
Analysis of genomic data requires access to software tools that place the sequence-derived information in the context of biology. The Saccharomyces Genome Database (SGD) integrates functional information about budding yeast genes and their products with a set of analysis tools that facilitate exploring their biological details. This unit describes how the various types of functional data available at SGD can be searched, retrieved, and analyzed. Starting with the guided tour of the SGD Home page and Locus Summary page, this unit highlights how to retrieve data using YeastMine, how to visualize genomic information with GBrowse, how to explore gene expression patterns with SPELL, and how to use Gene Ontology tools to characterize large-scale datasets. PMID:21901739
Computing Properties of Hadrons, Nuclei and Nuclear Matter from Quantum Chromodynamics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Savage, Martin J.
This project was part of a coordinated software development effort which the nuclear physics lattice QCD community pursues in order to ensure that lattice calculations can make optimal use of present, and forthcoming leadership-class and dedicated hardware, including those of the national laboratories, and prepares for the exploitation of future computational resources in the exascale era. The UW team improved and extended software libraries used in lattice QCD calculations related to multi-nucleon systems, enhanced production running codes related to load balancing multi-nucleon production on large-scale computing platforms, and developed SQLite (addressable database) interfaces to efficiently archive and analyze multi-nucleon datamore » and developed a Mathematica interface for the SQLite databases.« less
Performance evaluation of redundant disk array support for transaction recovery
NASA Technical Reports Server (NTRS)
Mourad, Antoine N.; Fuchs, W. Kent; Saab, Daniel G.
1991-01-01
Redundant disk arrays provide a way of achieving rapid recovery from media failures with a relatively low storage cost for large scale data systems requiring high availability. Here, we propose a method for using redundant disk arrays to support rapid recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, we show that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
[Effects of soil data and map scale on assessment of total phosphorus storage in upland soils.
Li, Heng Rong; Zhang, Li Ming; Li, Xiao di; Yu, Dong Sheng; Shi, Xue Zheng; Xing, Shi He; Chen, Han Yue
2016-06-01
Accurate assessment of total phosphorus storage in farmland soils is of great significance to sustainable agricultural and non-point source pollution control. However, previous studies haven't considered the estimation errors from mapping scales and various databases with different sources of soil profile data. In this study, a total of 393×10 4 hm 2 of upland in the 29 counties (or cities) of North Jiangsu was cited as a case for study. Analysis was performed of how the four sources of soil profile data, namely, "Soils of County", "Soils of Prefecture", "Soils of Province" and "Soils of China", and the six scales, i.e. 1:50000, 1:250000, 1:500000, 1:1000000, 1:4000000 and1:10000000, used in the 24 soil databases established for the four soil journals, affected assessment of soil total phosphorus. Compared with the most detailed 1:50000 soil database established with 983 upland soil profiles, relative deviation of the estimates of soil total phosphorus density (STPD) and soil total phosphorus storage (STPS) from the other soil databases varied from 4.8% to 48.9% and from 1.6% to 48.4%, respectively. The estimated STPD and STPS based on the 1:50000 database of "Soils of County" and most of the estimates based on the databases of each scale in "Soils of County" and "Soils of Prefecture" were different, with the significance levels of P<0.001 or P<0.05. Extremely significant differences (P<0.001) existed between the estimates based on the 1:50000 database of "Soils of County" and the estimates based on the databases of each scale in "Soils of Province" and "Soils of China". This study demonstrated the significance of appropriate soil data sources and appropriate mapping scales in estimating STPS.
MEGALEX: A megastudy of visual and auditory word recognition.
Ferrand, Ludovic; Méot, Alain; Spinelli, Elsa; New, Boris; Pallier, Christophe; Bonin, Patrick; Dufau, Stéphane; Mathôt, Sebastiaan; Grainger, Jonathan
2018-06-01
Using the megastudy approach, we report a new database (MEGALEX) of visual and auditory lexical decision times and accuracy rates for tens of thousands of words. We collected visual lexical decision data for 28,466 French words and the same number of pseudowords, and auditory lexical decision data for 17,876 French words and the same number of pseudowords (synthesized tokens were used for the auditory modality). This constitutes the first large-scale database for auditory lexical decision, and the first database to enable a direct comparison of word recognition in different modalities. Different regression analyses were conducted to illustrate potential ways to exploit this megastudy database. First, we compared the proportions of variance accounted for by five word frequency measures. Second, we conducted item-level regression analyses to examine the relative importance of the lexical variables influencing performance in the different modalities (visual and auditory). Finally, we compared the similarities and differences between the two modalities. All data are freely available on our website ( https://sedufau.shinyapps.io/megalex/ ) and are searchable at www.lexique.org , inside the Open Lexique search engine.
Wheeler, David
2007-01-01
GenBank(R) is a comprehensive database of publicly available DNA sequences for more than 205,000 named organisms and for more than 60,000 within the embryophyta, obtained through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Daily data exchange with the European Molecular Biology Laboratory (EMBL) in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases with taxonomy, genome, mapping, protein structure, and domain information and the biomedical journal literature through PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available through FTP. GenBank usage scenarios ranging from local analyses of the data available through FTP to online analyses supported by the NCBI Web-based tools are discussed. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2011-01-01
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Extraction of land cover change information from ENVISAT-ASAR data in Chengdu Plain
NASA Astrophysics Data System (ADS)
Xu, Wenbo; Fan, Jinlong; Huang, Jianxi; Tian, Yichen; Zhang, Yong
2006-10-01
Land cover data are essential to most global change research objectives, including the assessment of current environmental conditions and the simulation of future environmental scenarios that ultimately lead to public policy development. Chinese Academy of Sciences generated a nationwide land cover database in order to carry out the quantification and spatial characterization of land use/cover changes (LUCC) in 1990s. In order to improve the reliability of the database, we will update the database anytime. But it is difficult to obtain remote sensing data to extract land cover change information in large-scale. It is hard to acquire optical remote sensing data in Chengdu plain, so the objective of this research was to evaluate multitemporal ENVISAT advanced synthetic aperture radar (ASAR) data for extracting land cover change information. Based on the fieldwork and the nationwide 1:100000 land cover database, the paper assesses several land cover changes in Chengdu plain, for example: crop to buildings, forest to buildings, and forest to bare land. The results show that ENVISAT ASAR data have great potential for the applications of extracting land cover change information.
Extracting Databases from Dark Data with DeepDive.
Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng
2016-01-01
DeepDive is a system for extracting relational databases from dark data : the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
Switching Phenomena in a System with No Switches
NASA Astrophysics Data System (ADS)
Preis, Tobias; Stanley, H. Eugene
2010-02-01
It is widely believed that switching phenomena require switches, but this is actually not true. For an intriguing variety of switching phenomena in nature, the underlying complex system abruptly changes from one state to another in a highly discontinuous fashion. For example, financial market fluctuations are characterized by many abrupt switchings creating increasing trends ("bubble formation") and decreasing trends ("financial collapse"). Such switching occurs on time scales ranging from macroscopic bubbles persisting for hundreds of days to microscopic bubbles persisting only for a few seconds. We analyze a database containing 13,991,275 German DAX Future transactions recorded with a time resolution of 10 msec. For comparison, a database providing 2,592,531 of all S&P500 daily closing prices is used. We ask whether these ubiquitous switching phenomena have quantifiable features independent of the time horizon studied. We find striking scale-free behavior of the volatility after each switching occurs. We interpret our findings as being consistent with time-dependent collective behavior of financial market participants. We test the possible universality of our result by performing a parallel analysis of fluctuations in transaction volume and time intervals between trades. We show that these financial market switching processes have properties similar to those of phase transitions. We suggest that the well-known catastrophic bubbles that occur on large time scales—such as the most recent financial crisis—are no outliers but single dramatic representatives caused by the switching between upward and downward trends on time scales varying over nine orders of magnitude from very large (≈102 days) down to very small (≈10 ms).
Kennedy, Amy E; Khoury, Muin J; Ioannidis, John P A; Brotzman, Michelle; Miller, Amy; Lane, Crystal; Lai, Gabriel Y; Rogers, Scott D; Harvey, Chinonye; Elena, Joanne W; Seminara, Daniela
2016-10-01
We report on the establishment of a web-based Cancer Epidemiology Descriptive Cohort Database (CEDCD). The CEDCD's goals are to enhance awareness of resources, facilitate interdisciplinary research collaborations, and support existing cohorts for the study of cancer-related outcomes. Comprehensive descriptive data were collected from large cohorts established to study cancer as primary outcome using a newly developed questionnaire. These included an inventory of baseline and follow-up data, biospecimens, genomics, policies, and protocols. Additional descriptive data extracted from publicly available sources were also collected. This information was entered in a searchable and publicly accessible database. We summarized the descriptive data across cohorts and reported the characteristics of this resource. As of December 2015, the CEDCD includes data from 46 cohorts representing more than 6.5 million individuals (29% ethnic/racial minorities). Overall, 78% of the cohorts have collected blood at least once, 57% at multiple time points, and 46% collected tissue samples. Genotyping has been performed by 67% of the cohorts, while 46% have performed whole-genome or exome sequencing in subsets of enrolled individuals. Information on medical conditions other than cancer has been collected in more than 50% of the cohorts. More than 600,000 incident cancer cases and more than 40,000 prevalent cases are reported, with 24 cancer sites represented. The CEDCD assembles detailed descriptive information on a large number of cancer cohorts in a searchable database. Information from the CEDCD may assist the interdisciplinary research community by facilitating identification of well-established population resources and large-scale collaborative and integrative research. Cancer Epidemiol Biomarkers Prev; 25(10); 1392-401. ©2016 AACR. ©2016 American Association for Cancer Research.
Fujino, Yuri; Asaoka, Ryo; Murata, Hiroshi; Miki, Atsuya; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki
2016-04-01
To develop a large-scale real clinical database of glaucoma (Japanese Archive of Multicentral Databases in Glaucoma: JAMDIG) and to investigate the effect of treatment. The study included a total of 1348 eyes of 805 primary open-angle glaucoma patients with 10 visual fields (VFs) measured with 24-2 or 30-2 Humphrey Field Analyzer (HFA) and intraocular pressure (IOP) records in 10 institutes in Japan. Those with 10 reliable VFs were further identified (638 eyes of 417 patients). Mean total deviation (mTD) of the 52 test points in the 24-2 HFA VF was calculated, and the relationship between mTD progression rate and seven variables (age, mTD of baseline VF, average IOP, standard deviation (SD) of IOP, previous argon/selective laser trabeculoplasties (ALT/SLT), previous trabeculectomy, and previous trabeculotomy) was analyzed. The mTD in the initial VF was -6.9 ± 6.2 dB and the mTD progression rate was -0.26 ± 0.46 dB/year. Mean IOP during the follow-up period was 13.5 ± 2.2 mm Hg. Age and SD of IOP were related to mTD progression rate. However, in eyes with average IOP below 15 and also 13 mm Hg, only age and baseline VF mTD were related to mTD progression rate. Age and the degree of VF damage were related to future progression. Average IOP was not related to the progression rate; however, fluctuation of IOP was associated with faster progression, although this was not the case when average IOP was below 15 mm Hg.
Broberg, Craig S; Mitchell, Julie; Rehel, Silven; Grant, Andrew; Gianola, Ann; Beninato, Peter; Winter, Christiane; Verstappen, Amy; Valente, Anne Marie; Weiss, Joseph; Zaidi, Ali; Earing, Michael G; Cook, Stephen; Daniels, Curt; Webb, Gary; Khairy, Paul; Marelli, Ariane; Gurvitz, Michelle Z; Sahn, David J
2015-10-01
The adoption of electronic health records (EHR) has created an opportunity for multicenter data collection, yet the feasibility and reliability of this methodology is unknown. The aim of this study was to integrate EHR data into a homogeneous central repository specifically addressing the field of adult congenital heart disease (ACHD). Target data variables were proposed and prioritized by consensus of investigators at five target ACHD programs. Database analysts determined which variables were available within their institutions' EHR and stratified their accessibility, and results were compared between centers. Data for patients seen in a single calendar year were extracted to a uniform database and subsequently consolidated. From 415 proposed target variables, only 28 were available in discrete formats at all centers. For variables of highest priority, 16/28 (57%) were available at all four sites, but only 11% for those of high priority. Integration was neither simple nor straightforward. Coding schemes in use for congenital heart diagnoses varied and would require additional user input for accurate mapping. There was considerable variability in procedure reporting formats and medication schemes, often with center-specific modifications. Despite the challenges, the final acquisition included limited data on 2161 patients, and allowed for population analysis of race/ethnicity, defect complexity, and body morphometrics. Large-scale multicenter automated data acquisition from EHRs is feasible yet challenging. Obstacles stem from variability in data formats, coding schemes, and adoption of non-standard lists within each EHR. The success of large-scale multicenter ACHD research will require institution-specific data integration efforts. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Guillen, Reynal; Gu, D.; Holbrook, J.; Murillo, L. F.; Traweek, S.
2011-01-01
Our current research focuses on the trajectory of scientists working with large-scale databases in astronomy, following them as they strategically build their careers, digital infrastructures, and make their epistemological commitments. We look specifically at how gender, ethnicity, nationality intersect in the process of subject formation in astronomy, as well as in the process of enrolling partners for the construction of instruments, design and implementation of large-scale databases. Work once figured as merely technical support, such assembling data catalogs, or as graphic design, generating pleasing images for public support, has been repositioned at the core of the field. Some have argued that such databases enable a new kind of scientific inquiry based on data exploration, such as the "fourth paradigm" or "data-driven" science. Our preliminary findings based on oral history interviews and ethnography provide insights into meshworks of women, African-American, "Hispanic," Asian-American and foreign-born astronomers. Our preliminary data suggest African-American men are more successful in sustaining astronomy careers than Chicano and Asian-American men. A distinctive theme in our data is the glocal character of meshworks available to and created by foreign-born women astronomers working at US facilities. Other data show that the proportion of Asian to Asian American and foreign-born Latina/o to Chicana/o astronomers is approximately equal. Futhermore, Asians and Latinas/os are represented in significantly greater numbers than Asian Americans and Chicanas/os. Among professional astronomers in the US, each ethnic minority group is numbered on the order of tens, not hundreds. Project support is provided by the NSF EAGER program to University of California, Los Angeles under award 0956589.
Tomio, Jun; Yamana, Hayato; Matsui, Hiroki; Yamashita, Hiroyuki; Yoshiyama, Takashi; Yasunaga, Hideo
2017-11-01
Tuberculosis screening is recommended for patients with immune-mediated inflammatory diseases (IMIDs) prior to anti-tumor necrosis factor (TNF) therapy. However, adherence to the recommended practice is unknown in the current clinical setting in Japan. We used a large-scale health insurance claims database in Japan to conduct a longitudinal observational study. Of more than two million beneficiaries in the database between 2013 and 2014, we enrolled those with IMIDs aged 15-69 years who had initiated anti-TNF therapy. We defined tuberculosis screening primarily as tuberculin skin test and/or interferon-gamma release assay (TST/IGRA) within 2 months before commencing anti-TNF therapy. We analyzed the proportions of the patients who had undergone tuberculosis screening and the associations with primary disease, type of anti-TNF agent, methotrexate prescription prior to anti-TNF therapy, and treatment for latent tuberculosis infection (LTBI). Of 385 patients presumed to have initiated anti-TNF therapy, 252 (66%) had undergone tuberculosis screening by TST/IGRA (22% TST, 56% IGRA, and 12% both TST and IGRA), and 231 (60%) had undergone TST/IGRA and radiography. Patients with psoriasis tended to be more likely to undergo tuberculosis screening than those with other diseases; however, this association was not statistically significant. Treatment for LTBI was provided to 43 (11%) patients; 123 (32%) received neither TST/IGRA nor LTBI treatment. Tuberculosis screening was often not performed prior to anti-TNF therapy despite the guidelines' recommendations; thus, patients could be put at unnecessary risk of reactivation of tuberculosis. © 2017 Asia Pacific League of Associations for Rheumatology and John Wiley & Sons Australia, Ltd.
The USA-NPN Information Management System: A tool in support of phenological assessments
NASA Astrophysics Data System (ADS)
Rosemartin, A.; Vazquez, R.; Wilson, B. E.; Denny, E. G.
2009-12-01
The USA National Phenology Network (USA-NPN) serves science and society by promoting a broad understanding of plant and animal phenology and the relationships among phenological patterns and all aspects of environmental change. Data management and information sharing are central to the USA-NPN mission. The USA-NPN develops, implements, and maintains a comprehensive Information Management System (IMS) to serve the needs of the network, including the collection, storage and dissemination of phenology data, access to phenology-related information, tools for data interpretation, and communication among partners of the USA-NPN. The IMS includes components for data storage, such as the National Phenology Database (NPD), and several online user interfaces to accommodate data entry, data download, data visualization and catalog searches for phenology-related information. The IMS is governed by a set of standards to ensure security, privacy, data access, and data quality. The National Phenology Database is designed to efficiently accommodate large quantities of phenology data, to be flexible to the changing needs of the network, and to provide for quality control. The database stores phenology data from multiple sources (e.g., partner organizations, researchers and citizen observers), and provides for integration with legacy datasets. Several services will be created to provide access to the data, including reports, visualization interfaces, and web services. These services will provide integrated access to phenology and related information for scientists, decision-makers and general audiences. Phenological assessments at any scale will rely on secure and flexible information management systems for the organization and analysis of phenology data. The USA-NPN’s IMS can serve phenology assessments directly, through data management and indirectly as a model for large-scale integrated data management.
Spiegel, Paul B; Le, Phuoc; Ververs, Mija-Tesse; Salama, Peter
2007-01-01
Background The fields of expertise of natural disasters and complex emergencies (CEs) are quite distinct, with different tools for mitigation and response as well as different types of competent organizations and qualified professionals who respond. However, natural disasters and CEs can occur concurrently in the same geographic location, and epidemics can occur during or following either event. The occurrence and overlap of these three types of events have not been well studied. Methods All natural disasters, CEs and epidemics occurring within the past decade (1995–2004) that met the inclusion criteria were included. The largest 30 events in each category were based on the total number of deaths recorded. The main databases used were the Emergency Events Database for natural disasters, the Uppsala Conflict Database Program for CEs and the World Health Organization outbreaks archive for epidemics. Analysis During the past decade, 63% of the largest CEs had ≥1 epidemic compared with 23% of the largest natural disasters. Twenty-seven percent of the largest natural disasters occurred in areas with ≥1 ongoing CE while 87% of the largest CEs had ≥1 natural disaster. Conclusion Epidemics commonly occur during CEs. The data presented in this article do not support the often-repeated assertion that epidemics, especially large-scale epidemics, commonly occur following large-scale natural disasters. This observation has important policy and programmatic implications when preparing and responding to epidemics. There is an important and previously unrecognized overlap between natural disasters and CEs. Training and tools are needed to help bridge the gap between the different type of organizations and professionals who respond to natural disasters and CEs to ensure an integrated and coordinated response. PMID:17411460
Waveform Fingerprinting for Efficient Seismic Signal Detection
NASA Astrophysics Data System (ADS)
Yoon, C. E.; OReilly, O. J.; Beroza, G. C.
2013-12-01
Cross-correlating an earthquake waveform template with continuous waveform data has proven a powerful approach for detecting events missing from earthquake catalogs. If templates do not exist, it is possible to divide the waveform data into short overlapping time windows, then identify window pairs with similar waveforms. Applying these approaches to earthquake monitoring in seismic networks has tremendous potential to improve the completeness of earthquake catalogs, but because effort scales quadratically with time, it rapidly becomes computationally infeasible. We develop a fingerprinting technique to identify similar waveforms, using only a few compact features of the original data. The concept is similar to human fingerprints, which utilize key diagnostic features to identify people uniquely. Analogous audio-fingerprinting approaches have accurately and efficiently found similar audio clips within large databases; example applications include identifying songs and finding copyrighted content within YouTube videos. In order to fingerprint waveforms, we compute a spectrogram of the time series, and segment it into multiple overlapping windows (spectral images). For each spectral image, we apply a wavelet transform, and retain only the sign of the maximum magnitude wavelet coefficients. This procedure retains just the large-scale structure of the data, providing both robustness to noise and significant dimensionality reduction. Each fingerprint is a high-dimensional, sparse, binary data object that can be stored in a database without significant storage costs. Similar fingerprints within the database are efficiently searched using locality-sensitive hashing. We test this technique on waveform data from the Northern California Seismic Network that contains events not detected in the catalog. We show that this algorithm successfully identifies similar waveforms and detects uncataloged low magnitude events in addition to cataloged events, while running to completion faster than a comparison waveform autocorrelation code.
NASA Astrophysics Data System (ADS)
Kucharenko, Evgeniy; Asavin, Alex
2015-04-01
Resource depletion has forced us to search for new ore deposit and reanalyze old mineral deposits. This is the main aim of metallogenic studies. Synthesis information about features resources work out deposit and emerging fields will play a key role in future. Development of metallogeny databases is one of the most difficult tasks for Earth sciences. Database needs to enter a large number of parameters describing the object of study - mine or ore occurrence. Majority of these parameters belong to different areas of geological knowledge. It can be ore mineralogy, geochemistry, lithology of host rocks, tectonic characteristics ore-controlling structures, geochemical parameters of ore processes, geochronological data on age of geological formations and processes of ore formation and some others. However, the cartographic materials of various scales apart from diverse documentation and numerical information are of a great importance. The adopted framework for the analysis of large-scale metallogeny has several levels: 1. The ore body (usually 1: 50000, 1: 100000) 2. The ore field, the field (1: 200000) 3. The ore cluster (1: 500000) Researchers can vary scheme and scale values, but fundamentally three levels of scale describing the location and geological structures controlling the placement of ore are included at least. Attention should be pay to the system of description the ore deposit. It is necessary to create the universal scheme for development of metallogeny information systems and set up the universal algorithm of ore deposit description. There is its own order of importance of used features and a form of description for each type of deposits and ore and genetic group and ore element. Lack of definition in the classification of a particular metallogenic object makes the choice of algorithm description justified quite weakly. It is quite notable that available features which used for description of different deposit (even of the same genetic group) are not of the same type or detailed enough. Waste deposit usually takes as a reference object with the most complete description in opposite to the recently discovered deposit not enough studied and with quite limited list of information indicators. There are following most actual tasks for information metallogeny system: 1. Search summarizing the characteristics of different objects 2. Select the most informative group of features 3. Show the links of groups of signs and analyze it as far as genesis of deposits. The actual task's list could be continued but it is enough to start. Essentially mentioned problems put us in a situation when deposit's metallogenic database is not available. There is only limited number of typical databases (for certain types of minerals) characterized nothing more than name of the fields and basic indicators of its economic importance (stocks, component content, ore types). The additional information: the age of host rock or ores or geochemistry features of some geological objects uses quite rarely. There is no systematic data for all objects in the database. Database of carbonatite deposits is the most well-developed. It should be also mentioned some works [Woolley & Kjarsgaard 2009; Bagdasarov et al.,2001; Burmistrov et al., 2008]. Unfortunately, such important characteristics as geological maps are not included there as
A biologically inspired neural network model to transformation invariant object recognition
NASA Astrophysics Data System (ADS)
Iftekharuddin, Khan M.; Li, Yaqin; Siddiqui, Faraz
2007-09-01
Transformation invariant image recognition has been an active research area due to its widespread applications in a variety of fields such as military operations, robotics, medical practices, geographic scene analysis, and many others. The primary goal for this research is detection of objects in the presence of image transformations such as changes in resolution, rotation, translation, scale and occlusion. We investigate a biologically-inspired neural network (NN) model for such transformation-invariant object recognition. In a classical training-testing setup for NN, the performance is largely dependent on the range of transformation or orientation involved in training. However, an even more serious dilemma is that there may not be enough training data available for successful learning or even no training data at all. To alleviate this problem, a biologically inspired reinforcement learning (RL) approach is proposed. In this paper, the RL approach is explored for object recognition with different types of transformations such as changes in scale, size, resolution and rotation. The RL is implemented in an adaptive critic design (ACD) framework, which approximates the neuro-dynamic programming of an action network and a critic network, respectively. Two ACD algorithms such as Heuristic Dynamic Programming (HDP) and Dual Heuristic dynamic Programming (DHP) are investigated to obtain transformation invariant object recognition. The two learning algorithms are evaluated statistically using simulated transformations in images as well as with a large-scale UMIST face database with pose variations. In the face database authentication case, the 90° out-of-plane rotation of faces from 20 different subjects in the UMIST database is used. Our simulations show promising results for both designs for transformation-invariant object recognition and authentication of faces. Comparing the two algorithms, DHP outperforms HDP in learning capability, as DHP takes fewer steps to perform a successful recognition task in general. Further, the residual critic error in DHP is generally smaller than that of HDP, and DHP achieves a 100% success rate more frequently than HDP for individual objects/subjects. On the other hand, HDP is more robust than the DHP as far as success rate across the database is concerned when applied in a stochastic and uncertain environment, and the computational time involved in DHP is more.
Using LUCAS topsoil database to estimate soil organic carbon content in local spectral libraries
NASA Astrophysics Data System (ADS)
Castaldi, Fabio; van Wesemael, Bas; Chabrillat, Sabine; Chartin, Caroline
2017-04-01
The quantification of the soil organic carbon (SOC) content over large areas is mandatory to obtain accurate soil characterization and classification, which can improve site specific management at local or regional scale exploiting the strong relationship between SOC and crop growth. The estimation of the SOC is not only important for agricultural purposes: in recent years, the increasing attention towards global warming highlighted the crucial role of the soil in the global carbon cycle. In this context, soil spectroscopy is a well consolidated and widespread method to estimate soil variables exploiting the interaction between chromophores and electromagnetic radiation. The importance of spectroscopy in soil science is reflected by the increasing number of large soil spectral libraries collected in the world. These large libraries contain soil samples derived from a consistent number of pedological regions and thus from different parent material and soil types; this heterogeneity entails, in turn, a large variability in terms of mineralogical and organic composition. In the light of the huge variability of the spectral responses to SOC content and composition, a rigorous classification process is necessary to subset large spectral libraries and to avoid the calibration of global models failing to predict local variation in SOC content. In this regard, this study proposes a method to subset the European LUCAS topsoil database into soil classes using a clustering analysis based on a large number of soil properties. The LUCAS database was chosen to apply a standardized multivariate calibration approach valid for large areas without the need for extensive field and laboratory work for calibration of local models. Seven soil classes were detected by the clustering analyses and the samples belonging to each class were used to calibrate specific partial least square regression (PLSR) models to estimate SOC content of three local libraries collected in Belgium (Loam belt and Wallonia) and Luxembourg. The three local libraries only consist of spectral data (199 samples) acquired using the same protocol as the one used for the LUCAS database. SOC was estimated with a good accuracy both within each local library (RMSE: 1.2 ÷ 5.4 g kg-1; RPD: 1.41 ÷ 2.06) and for the samples of the three libraries together (RMSE: 3.9 g kg-1; RPD: 2.47). The proposed approach could allow to estimate SOC everywhere in Europe only collecting spectra, without the need for chemical laboratory analyses, exploiting the potentiality of the LUCAS database and specific PLSR models.
Li, Caijuan; Ling, Qufei; Ge, Chen; Ye, Zhuqing; Han, Xiaofei
2015-02-25
The large-scale loach (Paramisgurnus dabryanus, Cypriniformes) is a bottom-dwelling freshwater species of fish found mainly in eastern Asia. The natural germplasm resources of this important aquaculture species has been recently threatened due to overfishing and artificial propagation. The objective of this study is to obtain the first functional genomic resource and candidate molecular markers for future conservation and breeding research. Illumina paired-end sequencing generated over one hundred million reads that resulted in 71,887 assembled transcripts, with an average length of 1465bp. 42,093 (58.56%) protein-coding sequences were predicted; and 43,837 transcripts had significant matches to NCBI nonredundant protein (Nr) database. 29,389 and 14,419 transcripts were assigned into gene ontology (GO) categories and Eukaryotic Orthologous Groups (KOG), respectively. 22,102 (31.14%) transcripts were mapped to 302 KEGG pathways. In addition, 15,106 candidate SSR markers were identified, with 11,037 pairs of PCR primers designed. 400 primers pairs of SSR selected randomly were validated, of which 364 (91%) pairs of primers were able to produce PCR products. Further test with 41 loci and 20 large-scale loach specimens collected from the four largest lakes in China showed that 36 (87.8%) loci were polymorphic. The transcriptomic profile and SSR repertoire obtained in this study will facilitate population genetic studies and selective breeding of large-scale loach in the future. Copyright © 2015. Published by Elsevier B.V.
NASA Technical Reports Server (NTRS)
1993-01-01
The purpose of the STME Main Injector Program was to enhance the technology base for the large-scale main injector-combustor system of oxygen-hydrogen booster engines in the areas of combustion efficiency, chamber heating rates, and combustion stability. The initial task of the Main Injector Program, focused on analysis and theoretical predictions using existing models, was complemented by the design, fabrication, and test at MSFC of a subscale calorimetric, 40,000-pound thrust class, axisymmetric thrust chamber operating at approximately 2,250 psi and a 7:1 expansion ratio. Test results were used to further define combustion stability bounds, combustion efficiency, and heating rates using a large injector scale similar to the Pratt & Whitney (P&W) STME main injector design configuration including the tangential entry swirl coaxial injection elements. The subscale combustion data was used to verify and refine analytical modeling simulation and extend the database range to guide the design of the large-scale system main injector. The subscale injector design incorporated fuel and oxidizer flow area control features which could be varied; this allowed testing of several design points so that the STME conditions could be bracketed. The subscale injector design also incorporated high-reliability and low-cost fabrication techniques such as a one-piece electrical discharged machined (EDMed) interpropellant plate. Both subscale and large-scale injectors incorporated outer row injector elements with scarfed tip features to allow evaluation of reduced heating rates to the combustion chamber.
Ke, Tao; Yu, Jingyin; Dong, Caihua; Mao, Han; Hua, Wei; Liu, Shengyi
2015-01-21
Oil crop seeds are important sources of fatty acids (FAs) for human and animal nutrition. Despite their importance, there is a lack of an essential bioinformatics resource on gene transcription of oil crops from a comparative perspective. In this study, we developed ocsESTdb, the first database of expressed sequence tag (EST) information on seeds of four large-scale oil crops with an emphasis on global metabolic networks and oil accumulation metabolism that target the involved unigenes. A total of 248,522 ESTs and 106,835 unigenes were collected from the cDNA libraries of rapeseed (Brassica napus), soybean (Glycine max), sesame (Sesamum indicum) and peanut (Arachis hypogaea). These unigenes were annotated by a sequence similarity search against databases including TAIR, NR protein database, Gene Ontology, COG, Swiss-Prot, TrEMBL and Kyoto Encyclopedia of Genes and Genomes (KEGG). Five genome-scale metabolic networks that contain different numbers of metabolites and gene-enzyme reaction-association entries were analysed and constructed using Cytoscape and yEd programs. Details of unigene entries, deduced amino acid sequences and putative annotation are available from our database to browse, search and download. Intuitive and graphical representations of EST/unigene sequences, functional annotations, metabolic pathways and metabolic networks are also available. ocsESTdb will be updated regularly and can be freely accessed at http://ocri-genomics.org/ocsESTdb/ . ocsESTdb may serve as a valuable and unique resource for comparative analysis of acyl lipid synthesis and metabolism in oilseed plants. It also may provide vital insights into improving oil content in seeds of oil crop species by transcriptional reconstruction of the metabolic network.
LDSplitDB: a database for studies of meiotic recombination hotspots in MHC using human genomic data.
Guo, Jing; Chen, Hao; Yang, Peng; Lee, Yew Ti; Wu, Min; Przytycka, Teresa M; Kwoh, Chee Keong; Zheng, Jie
2018-04-20
Meiotic recombination happens during the process of meiosis when chromosomes inherited from two parents exchange genetic materials to generate chromosomes in the gamete cells. The recombination events tend to occur in narrow genomic regions called recombination hotspots. Its dysregulation could lead to serious human diseases such as birth defects. Although the regulatory mechanism of recombination events is still unclear, DNA sequence polymorphisms have been found to play crucial roles in the regulation of recombination hotspots. To facilitate the studies of the underlying mechanism, we developed a database named LDSplitDB which provides an integrative and interactive data mining and visualization platform for the genome-wide association studies of recombination hotspots. It contains the pre-computed association maps of the major histocompatibility complex (MHC) region in the 1000 Genomes Project and the HapMap Phase III datasets, and a genome-scale study of the European population from the HapMap Phase II dataset. Besides the recombination profiles, related data of genes, SNPs and different types of epigenetic modifications, which could be associated with meiotic recombination, are provided for comprehensive analysis. To meet the computational requirement of the rapidly increasing population genomics data, we prepared a lookup table of 400 haplotypes for recombination rate estimation using the well-known LDhat algorithm which includes all possible two-locus haplotype configurations. To the best of our knowledge, LDSplitDB is the first large-scale database for the association analysis of human recombination hotspots with DNA sequence polymorphisms. It provides valuable resources for the discovery of the mechanism of meiotic recombination hotspots. The information about MHC in this database could help understand the roles of recombination in human immune system. DATABASE URL: http://histone.scse.ntu.edu.sg/LDSplitDB.
Blake, M.C.; Jones, D.L.; Graymer, R.W.; digital database by Soule, Adam
2000-01-01
This digital map database, compiled from previously published and unpublished data, and new mapping by the authors, represents the general distribution of bedrock and surficial deposits in the mapped area. Together with the accompanying text file (mageo.txt, mageo.pdf, or mageo.ps), it provides current information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:62,500 or smaller general distribution of bedrock and surficial deposits in the mapped area. Together with the accompanying text file (mageo.txt, mageo.pdf, or mageo.ps), it provides current information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The scale of the source maps limits the spatial resolution (scale) of the database to 1:62,500 or smaller.
NASA Astrophysics Data System (ADS)
Choung, S.; Francis, A. J.; Um, W.; Choi, S.; Kim, S.; Park, J.; Kim, S.
2013-12-01
The countries that have generated nuclear power have facing problems on the disposal of accumulated radioactive wastes. Geological disposal method has been chosen in many countries including Korea. A safety issue after the closure of geological repository has been raised, because microbial activities lead overpressure in the underground facilities through gas production. In particular, biodegradable organic materials derived from low- and intermediate-level radioactive wastes play important role on microbial activities in the geological repository. This study performed large scale in-situ experiments using organic wastes and groundwater, and investigated geochemical alteration and microbial activities at early stage (~63 days) as representative of the period, after closure of the geological repository. The geochemical alteration controlled significantly the microorganism types and populations. Database of the biogeochemical alteration facilitates prediction of radionuclides' mobility and establishment of remedial strategy against unpredictable accidents and hazards at early stage right after closure of the geological repository.
Transcriptome sequencing and annotation of the halophytic microalga Dunaliella salina * #
Hong, Ling; Liu, Jun-li; Midoun, Samira Z.; Miller, Philip C.
2017-01-01
The unicellular green alga Dunaliella salina is well adapted to salt stress and contains compounds (including β-carotene and vitamins) with potential commercial value. A large transcriptome database of D. salina during the adjustment, exponential and stationary growth phases was generated using a high throughput sequencing platform. We characterized the metabolic processes in D. salina with a focus on valuable metabolites, with the aim of manipulating D. salina to achieve greater economic value in large-scale production through a bioengineering strategy. Gene expression profiles under salt stress verified using quantitative polymerase chain reaction (qPCR) implied that salt can regulate the expression of key genes. This study generated a substantial fraction of D. salina transcriptional sequences for the entire growth cycle, providing a basis for the discovery of novel genes. This first full-scale transcriptome study of D. salina establishes a foundation for further comparative genomic studies. PMID:28990374
Experience in running relational databases on clustered storage
NASA Astrophysics Data System (ADS)
Gaspar Aparicio, Ruben; Potocky, Miroslav
2015-12-01
For past eight years, CERN IT Database group has based its backend storage on NAS (Network-Attached Storage) architecture, providing database access via NFS (Network File System) protocol. In last two and half years, our storage has evolved from a scale-up architecture to a scale-out one. This paper describes our setup and a set of functionalities providing key features to other services like Database on Demand [1] or CERN Oracle backup and recovery service. It also outlines possible trend of evolution that, storage for databases could follow.
Precision measurements from very-large scale aerial digital imagery.
Booth, D Terrance; Cox, Samuel E; Berryman, Robert D
2006-01-01
Managers need measurements and resource managers need the length/width of a variety of items including that of animals, logs, streams, plant canopies, man-made objects, riparian habitat, vegetation patches and other things important in resource monitoring and land inspection. These types of measurements can now be easily and accurately obtained from very large scale aerial (VLSA) imagery having spatial resolutions as fine as 1 millimeter per pixel by using the three new software programs described here. VLSA images have small fields of view and are used for intermittent sampling across extensive landscapes. Pixel-coverage among images is influenced by small changes in airplane altitude above ground level (AGL) and orientation relative to the ground, as well as by changes in topography. These factors affect the object-to-camera distance used for image-resolution calculations. 'ImageMeasurement' offers a user-friendly interface for accounting for pixel-coverage variation among images by utilizing a database. 'LaserLOG' records and displays airplane altitude AGL measured from a high frequency laser rangefinder, and displays the vertical velocity. 'Merge' sorts through large amounts of data generated by LaserLOG and matches precise airplane altitudes with camera trigger times for input to the ImageMeasurement database. We discuss application of these tools, including error estimates. We found measurements from aerial images (collection resolution: 5-26 mm/pixel as projected on the ground) using ImageMeasurement, LaserLOG, and Merge, were accurate to centimeters with an error less than 10%. We recommend these software packages as a means for expanding the utility of aerial image data.
NASA Astrophysics Data System (ADS)
Guenther, A. B.; Duhl, T.
2011-12-01
Increasing computational resources have enabled a steady improvement in the spatial resolution used for earth system models. Land surface models and landcover distributions have kept ahead by providing higher spatial resolution than typically used in these models. Satellite observations have played a major role in providing high resolution landcover distributions over large regions or the entire earth surface but ground observations are needed to calibrate these data and provide accurate inputs for models. As our ability to resolve individual landscape components improves, it is important to consider what scale is sufficient for providing inputs to earth system models. The required spatial scale is dependent on the processes being represented and the scientific questions being addressed. This presentation will describe the development a contiguous U.S. landcover database using high resolution imagery (1 to 1000 meters) and surface observations of species composition and other landcover characteristics. The database includes plant functional types and species composition and is suitable for driving land surface models (CLM and MEGAN) that predict land surface exchange of carbon, water, energy and biogenic reactive gases (e.g., isoprene, sesquiterpenes, and NO). We investigate the sensitivity of model results to landcover distributions with spatial scales ranging over six orders of magnitude (1 meter to 1000000 meters). The implications for predictions of regional climate and air quality will be discussed along with recommendations for regional and global earth system modeling.
Futamura, Masaki; Leshem, Yael A; Thomas, Kim S; Nankervis, Helen; Williams, Hywel C; Simpson, Eric L
2016-02-01
Investigators often use global assessments to provide a snapshot of overall disease severity in dermatologic clinical trials. Although easy to perform, the frequency of use and standardization of global assessments in studies of atopic dermatitis (AD) is unclear. We sought to assess the frequency, definitions, and methods of analysis of Investigator Global Assessment in randomized controlled trials of AD. We conducted a systematic review using all published randomized controlled trials of AD treatments in the Global Resource of Eczema Trials database (2000-2014). We determined the frequency of global scales application and defining features. Among 317 trials identified, 101 trials (32%) used an investigator-performed global assessment as an outcome measure. There was large variability in global assessments between studies in nomenclature, scale size, definitions, outcome description, and analysis. Both static and dynamic scales were identified that ranged from 4- to 7-point scales. North American studies used global assessments more commonly than studies from other countries. The search was restricted to the Global Resource of Eczema Trials database. Global assessments are used frequently in studies of AD, but their complete lack of standardized definitions and implementation preclude any meaningful comparisons between studies, which in turn impedes data synthesis to inform clinical decision-making. Standardization is urgently required. Copyright © 2015. Published by Elsevier Inc.
Christodoulidis, Argyrios; Hurtut, Thomas; Tahar, Houssem Ben; Cheriet, Farida
2016-09-01
Segmenting the retinal vessels from fundus images is a prerequisite for many CAD systems for the automatic detection of diabetic retinopathy lesions. So far, research efforts have concentrated mainly on the accurate localization of the large to medium diameter vessels. However, failure to detect the smallest vessels at the segmentation step can lead to false positive lesion detection counts in a subsequent lesion analysis stage. In this study, a new hybrid method for the segmentation of the smallest vessels is proposed. Line detection and perceptual organization techniques are combined in a multi-scale scheme. Small vessels are reconstructed from the perceptual-based approach via tracking and pixel painting. The segmentation was validated in a high resolution fundus image database including healthy and diabetic subjects using pixel-based as well as perceptual-based measures. The proposed method achieves 85.06% sensitivity rate, while the original multi-scale line detection method achieves 81.06% sensitivity rate for the corresponding images (p<0.05). The improvement in the sensitivity rate for the database is 6.47% when only the smallest vessels are considered (p<0.05). For the perceptual-based measure, the proposed method improves the detection of the vasculature by 7.8% against the original multi-scale line detection method (p<0.05). Copyright © 2016 Elsevier Ltd. All rights reserved.
Large-scale exploration and analysis of drug combinations.
Li, Peng; Huang, Chao; Fu, Yingxue; Wang, Jinan; Wu, Ziyin; Ru, Jinlong; Zheng, Chunli; Guo, Zihu; Chen, Xuetong; Zhou, Wei; Zhang, Wenjuan; Li, Yan; Chen, Jianxin; Lu, Aiping; Wang, Yonghua
2015-06-15
Drug combinations are a promising strategy for combating complex diseases by improving the efficacy and reducing corresponding side effects. Currently, a widely studied problem in pharmacology is to predict effective drug combinations, either through empirically screening in clinic or pure experimental trials. However, the large-scale prediction of drug combination by a systems method is rarely considered. We report a systems pharmacology framework to predict drug combinations (PreDCs) on a computational model, termed probability ensemble approach (PEA), for analysis of both the efficacy and adverse effects of drug combinations. First, a Bayesian network integrating with a similarity algorithm is developed to model the combinations from drug molecular and pharmacological phenotypes, and the predictions are then assessed with both clinical efficacy and adverse effects. It is illustrated that PEA can predict the combination efficacy of drugs spanning different therapeutic classes with high specificity and sensitivity (AUC = 0.90), which was further validated by independent data or new experimental assays. PEA also evaluates the adverse effects (AUC = 0.95) quantitatively and detects the therapeutic indications for drug combinations. Finally, the PreDC database includes 1571 known and 3269 predicted optimal combinations as well as their potential side effects and therapeutic indications. The PreDC database is available at http://sm.nwsuaf.edu.cn/lsp/predc.php. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Top-k similar graph matching using TraM in biological networks.
Amin, Mohammad Shafkat; Finley, Russell L; Jamil, Hasan M
2012-01-01
Many emerging database applications entail sophisticated graph-based query manipulation, predominantly evident in large-scale scientific applications. To access the information embedded in graphs, efficient graph matching tools and algorithms have become of prime importance. Although the prohibitively expensive time complexity associated with exact subgraph isomorphism techniques has limited its efficacy in the application domain, approximate yet efficient graph matching techniques have received much attention due to their pragmatic applicability. Since public domain databases are noisy and incomplete in nature, inexact graph matching techniques have proven to be more promising in terms of inferring knowledge from numerous structural data repositories. In this paper, we propose a novel technique called TraM for approximate graph matching that off-loads a significant amount of its processing on to the database making the approach viable for large graphs. Moreover, the vector space embedding of the graphs and efficient filtration of the search space enables computation of approximate graph similarity at a throw-away cost. We annotate nodes of the query graphs by means of their global topological properties and compare them with neighborhood biased segments of the datagraph for proper matches. We have conducted experiments on several real data sets, and have demonstrated the effectiveness and efficiency of the proposed method
Tran, Le-Thuy T.; Brewster, Philip J.; Chidambaram, Valliammai; Hurdle, John F.
2017-01-01
This study presents a method laying the groundwork for systematically monitoring food quality and the healthfulness of consumers’ point-of-sale grocery purchases. The method automates the process of identifying United States Department of Agriculture (USDA) Food Patterns Equivalent Database (FPED) components of grocery food items. The input to the process is the compact abbreviated descriptions of food items that are similar to those appearing on the point-of-sale sales receipts of most food retailers. The FPED components of grocery food items are identified using Natural Language Processing techniques combined with a collection of food concept maps and relationships that are manually built using the USDA Food and Nutrient Database for Dietary Studies, the USDA National Nutrient Database for Standard Reference, the What We Eat In America food categories, and the hierarchical organization of food items used by many grocery stores. We have established the construct validity of the method using data from the National Health and Nutrition Examination Survey, but further evaluation of validity and reliability will require a large-scale reference standard with known grocery food quality measures. Here we evaluate the method’s utility in identifying the FPED components of grocery food items available in a large sample of retail grocery sales data (~190 million transaction records). PMID:28475153
Tran, Le-Thuy T; Brewster, Philip J; Chidambaram, Valliammai; Hurdle, John F
2017-05-05
This study presents a method laying the groundwork for systematically monitoring food quality and the healthfulness of consumers' point-of-sale grocery purchases. The method automates the process of identifying United States Department of Agriculture (USDA) Food Patterns Equivalent Database (FPED) components of grocery food items. The input to the process is the compact abbreviated descriptions of food items that are similar to those appearing on the point-of-sale sales receipts of most food retailers. The FPED components of grocery food items are identified using Natural Language Processing techniques combined with a collection of food concept maps and relationships that are manually built using the USDA Food and Nutrient Database for Dietary Studies, the USDA National Nutrient Database for Standard Reference, the What We Eat In America food categories, and the hierarchical organization of food items used by many grocery stores. We have established the construct validity of the method using data from the National Health and Nutrition Examination Survey, but further evaluation of validity and reliability will require a large-scale reference standard with known grocery food quality measures. Here we evaluate the method's utility in identifying the FPED components of grocery food items available in a large sample of retail grocery sales data (~190 million transaction records).
Local structure of scalar flux in turbulent passive scalar mixing
NASA Astrophysics Data System (ADS)
Konduri, Aditya; Donzis, Diego
2012-11-01
Understanding the properties of scalar flux is important in the study of turbulent mixing. Classical theories suggest that it mainly depends on the large scale structures in the flow. Recent studies suggest that the mean scalar flux reaches an asymptotic value at high Peclet numbers, independent of molecular transport properties of the fluid. A large DNS database of isotropic turbulence with passive scalars forced with a mean scalar gradient with resolution up to 40963, is used to explore the structure of scalar flux based on the local topology of the flow. It is found that regions of small velocity gradients, where dissipation and enstrophy are small, constitute the main contribution to scalar flux. On the other hand, regions of very small scalar gradient (and scalar dissipation) become less important to the scalar flux at high Reynolds numbers. The scaling of the scalar flux spectra is also investigated. The k - 7 / 3 scaling proposed by Lumley (1964) is observed at high Reynolds numbers, but collapse is not complete. A spectral bump similar to that in the velocity spectrum is observed close to dissipative scales. A number of features, including the height of the bump, appear to reach an asymptotic value at high Schmidt number.
Large-scale collision cross-section profiling on a travelling wave ion mobility mass spectrometer
Lietz, Christopher B.; Yu, Qing; Li, Lingjun
2014-01-01
Ion mobility (IM) is a gas-phase electrophoretic method that separates ions according to charge and ion-neutral collision cross-section (CCS). Herein, we attempt to apply a travelling wave (TW) IM polyalanine calibration method to shotgun proteomics and create a large peptide CCS database. Mass spectrometry methods that utilize IM, such as HDMSE, often use high transmission voltages for sensitive analysis. However, polyalanine calibration has only been demonstrated with low voltage transmission used to prevent gas-phase activation. If polyalanine ions change conformation under higher transmission voltages used for HDMSE, the calibration may no longer be valid. Thus, we aimed to characterize the accuracy of calibration and CCS measurement under high transmission voltages on a TW IM instrument using the polyalanine calibration method and found that the additional error was not significant. We also evaluated the potential error introduced by liquid chromatography (LC)-HDMSE analysis, and found it to be insignificant as well, validating the calibration method. Finally, we demonstrated the utility of building a large-population peptide CCS database by investigating the effects of terminal lysine position, via LysC or LysN digestion, on the formation of two structural sub-families formed by triply charged ions. PMID:24845359
Evaluation of Smartphone Inertial Sensor Performance for Cross-Platform Mobile Applications
Kos, Anton; Tomažič, Sašo; Umek, Anton
2016-01-01
Smartphone sensors are being increasingly used in mobile applications. The performance of sensors varies considerably among different smartphone models and the development of a cross-platform mobile application might be a very complex and demanding task. A publicly accessible resource containing real-life-situation smartphone sensor parameters could be of great help for cross-platform developers. To address this issue we have designed and implemented a pilot participatory sensing application for measuring, gathering, and analyzing smartphone sensor parameters. We start with smartphone accelerometer and gyroscope bias and noise parameters. The application database presently includes sensor parameters of more than 60 different smartphone models of different platforms. It is a modest, but important start, offering information on several statistical parameters of the measured smartphone sensors and insights into their performance. The next step, a large-scale cloud-based version of the application, is already planned. The large database of smartphone sensor parameters may prove particularly useful for cross-platform developers. It may also be interesting for individual participants who would be able to check-up and compare their smartphone sensors against a large number of similar or identical models. PMID:27049391
NASA Astrophysics Data System (ADS)
Barbieux, Marie; Uitz, Julia; Bricaud, Annick; Organelli, Emanuele; Poteau, Antoine; Schmechtig, Catherine; Gentili, Bernard; Obolensky, Grigor; Leymarie, Edouard; Penkerc'h, Christophe; D'Ortenzio, Fabrizio; Claustre, Hervé
2018-02-01
Characterizing phytoplankton distribution and dynamics in the world's open oceans requires in situ observations over a broad range of space and time scales. In addition to temperature/salinity measurements, Biogeochemical-Argo (BGC-Argo) profiling floats are capable of autonomously observing at high-frequency bio-optical properties such as the chlorophyll fluorescence, a proxy of the chlorophyll a concentration (Chla), the particulate backscattering coefficient (bbp), a proxy of the stock of particulate organic carbon, and the light available for photosynthesis. We analyzed an unprecedented BGC-Argo database of more than 8,500 multivariable profiles collected in various oceanic conditions, from subpolar waters to subtropical gyres. Our objective is to refine previously established Chla versus bbp relationships and gain insights into the sources of vertical, seasonal, and regional variability in this relationship. Despite some regional, seasonal and vertical variations, a general covariation occurs at a global scale. We distinguish two main contrasted situations: (1) concomitant changes in Chla and bbp that correspond to actual variations in phytoplankton biomass, e.g., in subpolar regimes; (2) a decoupling between the two variables attributed to photoacclimation or changes in the relative abundance of nonalgal particles, e.g., in subtropical regimes. The variability in the bbp:Chla ratio in the surface layer appears to be essentially influenced by the type of particles and by photoacclimation processes. The large BGC-Argo database helps identifying the spatial and temporal scales at which this ratio is predominantly driven by one or the other of these two factors.
Cruz-Motta, Juan José; Miloslavich, Patricia; Palomo, Gabriela; Iken, Katrin; Konar, Brenda; Pohle, Gerhard; Trott, Tom; Benedetti-Cecchi, Lisandro; Herrera, César; Hernández, Alejandra; Sardi, Adriana; Bueno, Andrea; Castillo, Julio; Klein, Eduardo; Guerra-Castro, Edlin; Gobin, Judith; Gómez, Diana Isabel; Riosmena-Rodríguez, Rafael; Mead, Angela; Bigatti, Gregorio; Knowlton, Ann; Shirayama, Yoshihisa
2010-01-01
Assemblages associated with intertidal rocky shores were examined for large scale distribution patterns with specific emphasis on identifying latitudinal trends of species richness and taxonomic distinctiveness. Seventy-two sites distributed around the globe were evaluated following the standardized sampling protocol of the Census of Marine Life NaGISA project (www.nagisa.coml.org). There were no clear patterns of standardized estimators of species richness along latitudinal gradients or among Large Marine Ecosystems (LMEs); however, a strong latitudinal gradient in taxonomic composition (i.e., proportion of different taxonomic groups in a given sample) was observed. Environmental variables related to natural influences were strongly related to the distribution patterns of the assemblages on the LME scale, particularly photoperiod, sea surface temperature (SST) and rainfall. In contrast, no environmental variables directly associated with human influences (with the exception of the inorganic pollution index) were related to assemblage patterns among LMEs. Correlations of the natural assemblages with either latitudinal gradients or environmental variables were equally strong suggesting that neither neutral models nor models based solely on environmental variables sufficiently explain spatial variation of these assemblages at a global scale. Despite the data shortcomings in this study (e.g., unbalanced sample distribution), we show the importance of generating biological global databases for the use in large-scale diversity comparisons of rocky intertidal assemblages to stimulate continued sampling and analyses. PMID:21179546
Wall Modeled Large Eddy Simulation of Airfoil Trailing Edge Noise
NASA Astrophysics Data System (ADS)
Kocheemoolayil, Joseph; Lele, Sanjiva
2014-11-01
Large eddy simulation (LES) of airfoil trailing edge noise has largely been restricted to low Reynolds numbers due to prohibitive computational cost. Wall modeled LES (WMLES) is a computationally cheaper alternative that makes full-scale Reynolds numbers relevant to large wind turbines accessible. A systematic investigation of trailing edge noise prediction using WMLES is conducted. Detailed comparisons are made with experimental data. The stress boundary condition from a wall model does not constrain the fluctuating velocity to vanish at the wall. This limitation has profound implications for trailing edge noise prediction. The simulation over-predicts the intensity of fluctuating wall pressure and far-field noise. An improved wall model formulation that minimizes the over-prediction of fluctuating wall pressure is proposed and carefully validated. The flow configurations chosen for the study are from the workshop on benchmark problems for airframe noise computations. The large eddy simulation database is used to examine the adequacy of scaling laws that quantify the dependence of trailing edge noise on Mach number, Reynolds number and angle of attack. Simplifying assumptions invoked in engineering approaches towards predicting trailing edge noise are critically evaluated. We gratefully acknowledge financial support from GE Global Research and thank Cascade Technologies Inc. for providing access to their massively-parallel large eddy simulation framework.
Assembling proteomics data as a prerequisite for the analysis of large scale experiments
Schmidt, Frank; Schmid, Monika; Thiede, Bernd; Pleißner, Klaus-Peter; Böhme, Martina; Jungblut, Peter R
2009-01-01
Background Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments. Results In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML. Conclusion The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk. PMID:19166578
NASA Astrophysics Data System (ADS)
Miles, B.; Chepudira, K.; LaBar, W.
2017-12-01
The Open Geospatial Consortium (OGC) SensorThings API (STA) specification, ratified in 2016, is a next-generation open standard for enabling real-time communication of sensor data. Building on over a decade of OGC Sensor Web Enablement (SWE) Standards, STA offers a rich data model that can represent a range of sensor and phenomena types (e.g. fixed sensors sensing fixed phenomena, fixed sensors sensing moving phenomena, mobile sensors sensing fixed phenomena, and mobile sensors sensing moving phenomena) and is data agnostic. Additionally, and in contrast to previous SWE standards, STA is developer-friendly, as is evident from its convenient JSON serialization, and expressive OData-based query language (with support for geospatial queries); with its Message Queue Telemetry Transport (MQTT), STA is also well-suited to efficient real-time data publishing and discovery. All these attributes make STA potentially useful for use in environmental monitoring sensor networks. Here we present Kinota(TM), an Open-Source NoSQL implementation of OGC SensorThings for large-scale high-resolution real-time environmental monitoring. Kinota, which roughly stands for Knowledge from Internet of Things Analyses, relies on Cassandra its underlying data store, which is a horizontally scalable, fault-tolerant open-source database that is often used to store time-series data for Big Data applications (though integration with other NoSQL or rational databases is possible). With this foundation, Kinota can scale to store data from an arbitrary number of sensors collecting data every 500 milliseconds. Additionally, Kinota architecture is very modular allowing for customization by adopters who can choose to replace parts of the existing implementation when desirable. The architecture is also highly portable providing the flexibility to choose between cloud providers like azure, amazon, google etc. The scalable, flexible and cloud friendly architecture of Kinota makes it ideal for use in next-generation large-scale and high-resolution real-time environmental monitoring networks used in domains such as hydrology, geomorphology, and geophysics, as well as management applications such as flood early warning, and regulatory enforcement.
Sharma, Parichit; Mantri, Shrikant S
2014-01-01
The function of a newly sequenced gene can be discovered by determining its sequence homology with known proteins. BLAST is the most extensively used sequence analysis program for sequence similarity search in large databases of sequences. With the advent of next generation sequencing technologies it has now become possible to study genes and their expression at a genome-wide scale through RNA-seq and metagenome sequencing experiments. Functional annotation of all the genes is done by sequence similarity search against multiple protein databases. This annotation task is computationally very intensive and can take days to obtain complete results. The program mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speedup, can be used to accelerate large-scale annotation by using supercomputers and high performance computing (HPC) clusters. Although many parallel bioinformatics applications using the Message Passing Interface (MPI) are available in the public domain, researchers are reluctant to use them due to lack of expertise in the Linux command line and relevant programming experience. With these limitations, it becomes difficult for biologists to use mpiBLAST for accelerating annotation. No web interface is available in the open-source domain for mpiBLAST. We have developed WImpiBLAST, a user-friendly open-source web interface for parallel BLAST searches. It is implemented in Struts 1.3 using a Java backbone and runs atop the open-source Apache Tomcat Server. WImpiBLAST supports script creation and job submission features and also provides a robust job management interface for system administrators. It combines script creation and modification features with job monitoring and management through the Torque resource manager on a Linux-based HPC cluster. Use case information highlights the acceleration of annotation analysis achieved by using WImpiBLAST. Here, we describe the WImpiBLAST web interface features and architecture, explain design decisions, describe workflows and provide a detailed analysis.
Sharma, Parichit; Mantri, Shrikant S.
2014-01-01
The function of a newly sequenced gene can be discovered by determining its sequence homology with known proteins. BLAST is the most extensively used sequence analysis program for sequence similarity search in large databases of sequences. With the advent of next generation sequencing technologies it has now become possible to study genes and their expression at a genome-wide scale through RNA-seq and metagenome sequencing experiments. Functional annotation of all the genes is done by sequence similarity search against multiple protein databases. This annotation task is computationally very intensive and can take days to obtain complete results. The program mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speedup, can be used to accelerate large-scale annotation by using supercomputers and high performance computing (HPC) clusters. Although many parallel bioinformatics applications using the Message Passing Interface (MPI) are available in the public domain, researchers are reluctant to use them due to lack of expertise in the Linux command line and relevant programming experience. With these limitations, it becomes difficult for biologists to use mpiBLAST for accelerating annotation. No web interface is available in the open-source domain for mpiBLAST. We have developed WImpiBLAST, a user-friendly open-source web interface for parallel BLAST searches. It is implemented in Struts 1.3 using a Java backbone and runs atop the open-source Apache Tomcat Server. WImpiBLAST supports script creation and job submission features and also provides a robust job management interface for system administrators. It combines script creation and modification features with job monitoring and management through the Torque resource manager on a Linux-based HPC cluster. Use case information highlights the acceleration of annotation analysis achieved by using WImpiBLAST. Here, we describe the WImpiBLAST web interface features and architecture, explain design decisions, describe workflows and provide a detailed analysis. PMID:24979410
Shibata, Natsumi; Kimura, Shinya; Hoshino, Takahiro; Takeuchi, Masato; Urushihara, Hisashi
2018-05-11
To date, few large-scale comparative effectiveness studies of influenza vaccination have been conducted in Japan, since marketing authorization for influenza vaccines in Japan has been granted based only on the results of seroconversion and safety in small-sized populations in clinical trial phases not on the vaccine effectiveness. We evaluated the clinical effectiveness of influenza vaccination for children aged 1-15 years in Japan throughout four influenza seasons from 2010 to 2014 in the real world setting. We conducted a cohort study using a large-scale claims database for employee health care insurance plans covering more than 3 million people, including enrollees and their dependents. Vaccination status was identified using plan records for the influenza vaccination subsidies. The effectiveness of influenza vaccination in preventing influenza and its complications was evaluated. To control confounding related to influenza vaccination, odds ratios (OR) were calculated by applying a doubly robust method using the propensity score for vaccination. Total study population throughout the four consecutive influenza seasons was over 116,000. Vaccination rate was higher in younger children and in the recent influenza seasons. Throughout the four seasons, the estimated ORs for influenza onset were statistically significant and ranged from 0.797 to 0.894 after doubly robust adjustment. On age stratification, significant ORs were observed in younger children. Additionally, ORs for influenza complication outcomes, such as pneumonia, hospitalization with influenza and respiratory tract diseases, were significantly reduced, except for hospitalization with influenza in the 2010/2011 and 2012/2013 seasons. We confirmed the clinical effectiveness of influenza vaccination in children aged 1-15 years from the 2010/2011 to 2013/2014 influenza seasons. Influenza vaccine significantly prevented the onset of influenza and was effective in reducing its secondary complications. Copyright © 2018 Elsevier Ltd. All rights reserved.
Large-eddy simulations of a forced homogeneous isotropic turbulence with polymer additives
NASA Astrophysics Data System (ADS)
Wang, Lu; Cai, Wei-Hua; Li, Feng-Chen
2014-03-01
Large-eddy simulations (LES) based on the temporal approximate deconvolution model were performed for a forced homogeneous isotropic turbulence (FHIT) with polymer additives at moderate Taylor Reynolds number. Finitely extensible nonlinear elastic in the Peterlin approximation model was adopted as the constitutive equation for the filtered conformation tensor of the polymer molecules. The LES results were verified through comparisons with the direct numerical simulation results. Using the LES database of the FHIT in the Newtonian fluid and the polymer solution flows, the polymer effects on some important parameters such as strain, vorticity, drag reduction, and so forth were studied. By extracting the vortex structures and exploring the flatness factor through a high-order correlation function of velocity derivative and wavelet analysis, it can be found that the small-scale vortex structures and small-scale intermittency in the FHIT are all inhibited due to the existence of the polymers. The extended self-similarity scaling law in the polymer solution flow shows no apparent difference from that in the Newtonian fluid flow at the currently simulated ranges of Reynolds and Weissenberg numbers.
Neuromorphic Hardware Architecture Using the Neural Engineering Framework for Pattern Recognition.
Wang, Runchun; Thakur, Chetan Singh; Cohen, Gregory; Hamilton, Tara Julia; Tapson, Jonathan; van Schaik, Andre
2017-06-01
We present a hardware architecture that uses the neural engineering framework (NEF) to implement large-scale neural networks on field programmable gate arrays (FPGAs) for performing massively parallel real-time pattern recognition. NEF is a framework that is capable of synthesising large-scale cognitive systems from subnetworks and we have previously presented an FPGA implementation of the NEF that successfully performs nonlinear mathematical computations. That work was developed based on a compact digital neural core, which consists of 64 neurons that are instantiated by a single physical neuron using a time-multiplexing approach. We have now scaled this approach up to build a pattern recognition system by combining identical neural cores together. As a proof of concept, we have developed a handwritten digit recognition system using the MNIST database and achieved a recognition rate of 96.55%. The system is implemented on a state-of-the-art FPGA and can process 5.12 million digits per second. The architecture and hardware optimisations presented offer high-speed and resource-efficient means for performing high-speed, neuromorphic, and massively parallel pattern recognition and classification tasks.
NASA Astrophysics Data System (ADS)
Yulaeva, E.; Fan, Y.; Moosdorf, N.; Richard, S. M.; Bristol, S.; Peters, S. E.; Zaslavsky, I.; Ingebritsen, S.
2015-12-01
The Digital Crust EarthCube building block creates a framework for integrating disparate 3D/4D information from multiple sources into a comprehensive model of the structure and composition of the Earth's upper crust, and to demonstrate the utility of this model in several research scenarios. One of such scenarios is estimation of various crustal properties related to fluid dynamics (e.g. permeability and porosity) at each node of any arbitrary unstructured 3D grid to support continental-scale numerical models of fluid flow and transport. Starting from Macrostrat, an existing 4D database of 33,903 chronostratigraphic units, and employing GeoDeepDive, a software system for extracting structured information from unstructured documents, we construct 3D gridded fields of sediment/rock porosity, permeability and geochemistry for large sedimentary basins of North America, which will be used to improve our understanding of large-scale fluid flow, chemical weathering rates, and geochemical fluxes into the ocean. In this talk, we discuss the methods, data gaps (particularly in geologically complex terrain), and various physical and geological constraints on interpolation and uncertainty estimation.
Pereira, Florbela; Latino, Diogo A. R. S.; Gaudêncio, Susana P.
2014-01-01
The comprehensive information of small molecules and their biological activities in the PubChem database allows chemoinformatic researchers to access and make use of large-scale biological activity data to improve the precision of drug profiling. A Quantitative Structure–Activity Relationship approach, for classification, was used for the prediction of active/inactive compounds relatively to overall biological activity, antitumor and antibiotic activities using a data set of 1804 compounds from PubChem. Using the best classification models for antibiotic and antitumor activities a data set of marine and microbial natural products from the AntiMarin database were screened—57 and 16 new lead compounds for antibiotic and antitumor drug design were proposed, respectively. All compounds proposed by our approach are classified as non-antibiotic and non-antitumor compounds in the AntiMarin database. Recently several of the lead-like compounds proposed by us were reported as being active in the literature. PMID:24473174
Kim, Joongheon; Kim, Jong-Kook
2016-01-01
This paper addresses the computation procedures for estimating the impact of interference in 60 GHz IEEE 802.11ad uplink access in order to construct visual big-data database from randomly deployed surveillance camera sensing devices. The acquired large-scale massive visual information from surveillance camera devices will be used for organizing big-data database, i.e., this estimation is essential for constructing centralized cloud-enabled surveillance database. This performance estimation study captures interference impacts on the target cloud access points from multiple interference components generated by the 60 GHz wireless transmissions from nearby surveillance camera devices to their associated cloud access points. With this uplink interference scenario, the interference impacts on the main wireless transmission from a target surveillance camera device to its associated target cloud access point with a number of settings are measured and estimated under the consideration of 60 GHz radiation characteristics and antenna radiation pattern models.
Wang, Lei; Alpert, Kathryn I.; Calhoun, Vince D.; Cobia, Derin J.; Keator, David B.; King, Margaret D.; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D.; Potkin, Steven G.; Turner, Jessica A.; Ambite, Jose Luis
2015-01-01
SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation—translating across data sources—so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information are being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. PMID:26142271
Exploring Large-Scale Cross-Correlation for Teleseismic and Regional Seismic Event Characterization
NASA Astrophysics Data System (ADS)
Dodge, Doug; Walter, William; Myers, Steve; Ford, Sean; Harris, Dave; Ruppert, Stan; Buttler, Dave; Hauk, Terri
2013-04-01
The decrease in costs of both digital storage space and computation power invites new methods of seismic data processing. At Lawrence Livermore National Laboratory(LLNL) we operate a growing research database of seismic events and waveforms for nuclear explosion monitoring and other applications. Currently the LLNL database contains several million events associated with tens of millions of waveforms at thousands of stations. We are making use of this database to explore the power of seismic waveform correlation to quantify signal similarities, to discover new events not in catalogs, and to more accurately locate events and identify source types. Building on the very efficient correlation methodologies of Harris and Dodge (2011) we computed the waveform correlation for event pairs in the LLNL database in two ways. First we performed entire waveform cross-correlation over seven distinct frequency bands. The correlation coefficient exceeds 0.6 for more than 40 million waveform pairs for several hundred thousand events at more than a thousand stations. These correlations reveal clusters of mining events and aftershock sequences, which can be used to readily identify and locate events. Second we determine relative pick times by correlating signals in time windows for distinct seismic phases. These correlated picks are then used to perform very high accuracy event relocations. We are examining the percentage of events that correlate as a function of magnitude and observing station distance in selected high seismicity regions. Combining these empirical results and those using synthetic data, we are working to quantify relationships between correlation and event pair separation (in epicenter and depth) as well as mechanism differences. Our exploration of these techniques on a large seismic database is in process and we will report on our findings in more detail at the meeting.
Exploring Large-Scale Cross-Correlation for Teleseismic and Regional Seismic Event Characterization
NASA Astrophysics Data System (ADS)
Dodge, D.; Walter, W. R.; Myers, S. C.; Ford, S. R.; Harris, D.; Ruppert, S.; Buttler, D.; Hauk, T. F.
2012-12-01
The decrease in costs of both digital storage space and computation power invites new methods of seismic data processing. At Lawrence Livermore National Laboratory (LLNL) we operate a growing research database of seismic events and waveforms for nuclear explosion monitoring and other applications. Currently the LLNL database contains several million events associated with tens of millions of waveforms at thousands of stations. We are making use of this database to explore the power of seismic waveform correlation to quantify signal similarities, to discover new events not in catalogs, and to more accurately locate events and identify source types. Building on the very efficient correlation methodologies of Harris and Dodge (2011) we computed the waveform correlation for event pairs in the LLNL database in two ways. First we performed entire waveform cross-correlation over seven distinct frequency bands. The correlation coefficient exceeds 0.6 for more than 40 million waveform pairs for several hundred thousand events at more than a thousand stations. These correlations reveal clusters of mining events and aftershock sequences, which can be used to readily identify and locate events. Second we determine relative pick times by correlating signals in time windows for distinct seismic phases. These correlated picks are then used to perform very high accuracy event relocations. We are examining the percentage of events that correlate as a function of magnitude and observing station distance in selected high seismicity regions. Combining these empirical results and those using synthetic data, we are working to quantify relationships between correlation and event pair separation (in epicenter and depth) as well as mechanism differences. Our exploration of these techniques on a large seismic database is in process and we will report on our findings in more detail at the meeting.
Seismic Search Engine: A distributed database for mining large scale seismic data
NASA Astrophysics Data System (ADS)
Liu, Y.; Vaidya, S.; Kuzma, H. A.
2009-12-01
The International Monitoring System (IMS) of the CTBTO collects terabytes worth of seismic measurements from many receiver stations situated around the earth with the goal of detecting underground nuclear testing events and distinguishing them from other benign, but more common events such as earthquakes and mine blasts. The International Data Center (IDC) processes and analyzes these measurements, as they are collected by the IMS, to summarize event detections in daily bulletins. Thereafter, the data measurements are archived into a large format database. Our proposed Seismic Search Engine (SSE) will facilitate a framework for data exploration of the seismic database as well as the development of seismic data mining algorithms. Analogous to GenBank, the annotated genetic sequence database maintained by NIH, through SSE, we intend to provide public access to seismic data and a set of processing and analysis tools, along with community-generated annotations and statistical models to help interpret the data. SSE will implement queries as user-defined functions composed from standard tools and models. Each query is compiled and executed over the database internally before reporting results back to the user. Since queries are expressed with standard tools and models, users can easily reproduce published results within this framework for peer-review and making metric comparisons. As an illustration, an example query is “what are the best receiver stations in East Asia for detecting events in the Middle East?” Evaluating this query involves listing all receiver stations in East Asia, characterizing known seismic events in that region, and constructing a profile for each receiver station to determine how effective its measurements are at predicting each event. The results of this query can be used to help prioritize how data is collected, identify defective instruments, and guide future sensor placements.
Sheynkman, Gloria M.; Shortreed, Michael R.; Frey, Brian L.; Scalf, Mark; Smith, Lloyd M.
2013-01-01
Each individual carries thousands of non-synonymous single nucleotide variants (nsSNVs) in their genome, each corresponding to a single amino acid polymorphism (SAP) in the encoded proteins. It is important to be able to directly detect and quantify these variations at the protein level in order to study post-transcriptional regulation, differential allelic expression, and other important biological processes. However, such variant peptides are not generally detected in standard proteomic analyses, due to their absence from the generic databases that are employed for mass spectrometry searching. Here, we extend previous work that demonstrated the use of customized SAP databases constructed from sample-matched RNA-Seq data. We collected deep coverage RNA-Seq data from the Jurkat cell line, compiled the set of nsSNVs that are expressed, used this information to construct a customized SAP database, and searched it against deep coverage shotgun MS data obtained from the same sample. This approach enabled detection of 421 SAP peptides mapping to 395 nsSNVs. We compared these peptides to peptides identified from a large generic search database containing all known nsSNVs (dbSNP) and found that more than 70% of the SAP peptides from this dbSNP-derived search were not supported by the RNA-Seq data, and thus are likely false positives. Next, we increased the SAP coverage from the RNA-Seq derived database by utilizing multiple protease digestions, thereby increasing variant detection to 695 SAP peptides mapping to 504 nsSNV sites. These detected SAP peptides corresponded to moderate to high abundance transcripts (30+ transcripts per million, TPM). The SAP peptides included 192 allelic pairs; the relative expression levels of the two alleles were evaluated for 51 of those pairs, and found to be comparable in all cases. PMID:24175627
Akbari, Hamed; Bilello, Michel; Da, Xiao; Davatzikos, Christos
2015-01-01
Evaluating various algorithms for the inter-subject registration of brain magnetic resonance images (MRI) is a necessary topic receiving growing attention. Existing studies evaluated image registration algorithms in specific tasks or using specific databases (e.g., only for skull-stripped images, only for single-site images, etc.). Consequently, the choice of registration algorithms seems task- and usage/parameter-dependent. Nevertheless, recent large-scale, often multi-institutional imaging-related studies create the need and raise the question whether some registration algorithms can 1) generally apply to various tasks/databases posing various challenges; 2) perform consistently well, and while doing so, 3) require minimal or ideally no parameter tuning. In seeking answers to this question, we evaluated 12 general-purpose registration algorithms, for their generality, accuracy and robustness. We fixed their parameters at values suggested by algorithm developers as reported in the literature. We tested them in 7 databases/tasks, which present one or more of 4 commonly-encountered challenges: 1) inter-subject anatomical variability in skull-stripped images; 2) intensity homogeneity, noise and large structural differences in raw images; 3) imaging protocol and field-of-view (FOV) differences in multi-site data; and 4) missing correspondences in pathology-bearing images. Totally 7,562 registrations were performed. Registration accuracies were measured by (multi-)expert-annotated landmarks or regions of interest (ROIs). To ensure reproducibility, we used public software tools, public databases (whenever possible), and we fully disclose the parameter settings. We show evaluation results, and discuss the performances in light of algorithms’ similarity metrics, transformation models and optimization strategies. We also discuss future directions for the algorithm development and evaluations. PMID:24951685
NASA Astrophysics Data System (ADS)
Barfod, Adrian A. S.; Møller, Ingelise; Christiansen, Anders V.
2016-11-01
We present a large-scale study of the petrophysical relationship of resistivities obtained from densely sampled ground-based and airborne transient electromagnetic surveys and lithological information from boreholes. The overriding aim of this study is to develop a framework for examining the resistivity-lithology relationship in a statistical manner and apply this framework to gain a better description of the large-scale resistivity structures of the subsurface. In Denmark very large and extensive datasets are available through the national geophysical and borehole databases, GERDA and JUPITER respectively. In a 10 by 10 km grid, these data are compiled into histograms of resistivity versus lithology. To do this, the geophysical data are interpolated to the position of the boreholes, which allows for a lithological categorization of the interpolated resistivity values, yielding different histograms for a set of desired lithological categories. By applying the proposed algorithm to all available boreholes and airborne and ground-based transient electromagnetic data we build nation-wide maps of the resistivity-lithology relationships in Denmark. The presented Resistivity Atlas reveals varying patterns in the large-scale resistivity-lithology relations, reflecting geological details such as available source material for tills. The resistivity maps also reveal a clear ambiguity in the resistivity values for different lithologies. The Resistivity Atlas is highly useful when geophysical data are to be used for geological or hydrological modeling.
NASA Technical Reports Server (NTRS)
Hemsch, Michael J.
2016-01-01
Recently a very large (739 runs) collection of high-fidelity RANS CFD solutions was obtained for Space Launch System ascent aerodynamics for the vehicle to be used for the first exploratory (unmanned) mission (EM-1). The extensive computations, at full-scale conditions, were originally developed to obtain detailed line and protuberance loads and surface pressures for venting analyses. The line loads were eventually integrated for comparison of the resulting forces and moments to the database that was derived from wind tunnel tests conducted at sub-scale conditions. The comparisons presented herein cover the ranges 0.5 < or = M(infinity) < or = 5, -6deg < or = alpha < or = 6deg, and -6deg < or = beta < or = 6deg. For detailed comparisons, slender-body-theory-based component build-up aero models from missile aerodynamics are used. The differences in the model fit coefficients are shown to be relatively small except for the low supersonic Mach number range, 1.1 < or = M(infinity) < or = 2.0. The analysis is intended to support process improvement and development of uncertainty models.
Database for the geologic map of the Mount Baker 30- by 60-minute quadrangle, Washington (I-2660)
Tabor, R.W.; Haugerud, R.A.; Hildreth, Wes; Brown, E.H.
2006-01-01
This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Mount Baker 30- by 60-Minute Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the geology at 1:100,000. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
Harrigan, Robert L; Yvernault, Benjamin C; Boyd, Brian D; Damon, Stephen M; Gibney, Kyla David; Conrad, Benjamin N; Phillips, Nicholas S; Rogers, Baxter P; Gao, Yurui; Landman, Bennett A
2016-01-01
The Vanderbilt University Institute for Imaging Science (VUIIS) Center for Computational Imaging (CCI) has developed a database built on XNAT housing over a quarter of a million scans. The database provides framework for (1) rapid prototyping, (2) large scale batch processing of images and (3) scalable project management. The system uses the web-based interfaces of XNAT and REDCap to allow for graphical interaction. A python middleware layer, the Distributed Automation for XNAT (DAX) package, distributes computation across the Vanderbilt Advanced Computing Center for Research and Education high performance computing center. All software are made available in open source for use in combining portable batch scripting (PBS) grids and XNAT servers. Copyright © 2015 Elsevier Inc. All rights reserved.
Experts' perceptions on the entrepreneurial framework conditions
NASA Astrophysics Data System (ADS)
Correia, Aldina; e Silva, Eliana Costa; Lopes, I. Cristina; Braga, Alexandra; Braga, Vitor
2017-11-01
The Global Entrepreneurship Monitor is a large scale database for internationally comparative entrepreneurship. This database includes information of more than 100 countries concerning several aspects of entrepreneurship activities, perceptions, conditions, national and regional policy, among others, in two main sources of primary data: the Adult Population Survey and the National Expert Survey. In the present work the National Expert Survey datasets for 2011, 2012 and 2013 are analyzed with the purpose of studying the effects of different type of entrepreneurship expert specialization on the perceptions about the Entrepreneurial Framework Conditions (EFCs). The results of the multivariate analysis of variance for the 2013 data show significant differences of the entrepreneurship experts when compared the 2011 and 2012 surveys. For the 2013 data entrepreneur experts are less favorable then most of the other experts to the EFCs.
Large-Scale Spatial Distribution Patterns of Gastropod Assemblages in Rocky Shores
Miloslavich, Patricia; Cruz-Motta, Juan José; Klein, Eduardo; Iken, Katrin; Weinberger, Vanessa; Konar, Brenda; Trott, Tom; Pohle, Gerhard; Bigatti, Gregorio; Benedetti-Cecchi, Lisandro; Shirayama, Yoshihisa; Mead, Angela; Palomo, Gabriela; Ortiz, Manuel; Gobin, Judith; Sardi, Adriana; Díaz, Juan Manuel; Knowlton, Ann; Wong, Melisa; Peralta, Ana C.
2013-01-01
Gastropod assemblages from nearshore rocky habitats were studied over large spatial scales to (1) describe broad-scale patterns in assemblage composition, including patterns by feeding modes, (2) identify latitudinal pattern of biodiversity, i.e., richness and abundance of gastropods and/or regional hotspots, and (3) identify potential environmental and anthropogenic drivers of these assemblages. Gastropods were sampled from 45 sites distributed within 12 Large Marine Ecosystem regions (LME) following the NaGISA (Natural Geography in Shore Areas) standard protocol (www.nagisa.coml.org). A total of 393 gastropod taxa from 87 families were collected. Eight of these families (9.2%) appeared in four or more different LMEs. Among these, the Littorinidae was the most widely distributed (8 LMEs) followed by the Trochidae and the Columbellidae (6 LMEs). In all regions, assemblages were dominated by few species, the most diverse and abundant of which were herbivores. No latitudinal gradients were evident in relation to species richness or densities among sampling sites. Highest diversity was found in the Mediterranean and in the Gulf of Alaska, while highest densities were found at different latitudes and represented by few species within one genus (e.g. Afrolittorina in the Agulhas Current, Littorina in the Scotian Shelf, and Lacuna in the Gulf of Alaska). No significant correlation was found between species composition and environmental variables (r≤0.355, p>0.05). Contributing variables to this low correlation included invasive species, inorganic pollution, SST anomalies, and chlorophyll-a anomalies. Despite data limitations in this study which restrict conclusions in a global context, this work represents the first effort to sample gastropod biodiversity on rocky shores using a standardized protocol across a wide scale. Our results will generate more work to build global databases allowing for large-scale diversity comparisons of rocky intertidal assemblages. PMID:23967204
Yamamoto, Naoki; Suzuki, Tomohiro; Kobayashi, Masaaki; Dohra, Hideo; Sasaki, Yohei; Hirai, Hirofumi; Yokoyama, Koji; Kawagishi, Hirokazu; Yano, Kentaro
2014-12-03
The angel's wing oyster mushroom (Pleurocybella porrigens, Sugihiratake) is a well-known delicacy. However, its potential risk in acute encephalopathy was recently revealed by a food poisoning incident. To disclose the genes underlying the accident and provide mechanistic insight, we seek to develop an information infrastructure containing omics data. In our previous work, we sequenced the genome and transcriptome using next-generation sequencing techniques. The next step in achieving our goal is to develop a web database to facilitate the efficient mining of large-scale omics data and identification of genes specifically expressed in the mushroom. This paper introduces a web database A-WINGS (http://bioinf.mind.meiji.ac.jp/a-wings/) that provides integrated genomic and transcriptomic information for the angel's wing oyster mushroom. The database contains structure and functional annotations of transcripts and gene expressions. Functional annotations contain information on homologous sequences from NCBI nr and UniProt, Gene Ontology, and KEGG Orthology. Digital gene expression profiles were derived from RNA sequencing (RNA-seq) analysis in the fruiting bodies and mycelia. The omics information stored in the database is freely accessible through interactive and graphical interfaces by search functions that include 'GO TREE VIEW' browsing, keyword searches, and BLAST searches. The A-WINGS database will accelerate omics studies on specific aspects of the angel's wing oyster mushroom and the family Tricholomataceae.
Meta-analysis on Macropore Flow Velocity in Soils
NASA Astrophysics Data System (ADS)
Liu, D.; Gao, M.; Li, H. Y.; Chen, X.; Leung, L. R.
2017-12-01
Macropore flow is ubiquitous in the soils and an important hydrologic process that is not well explained using traditional hydrologic theories. Macropore Flow Velocity (MFV) is an important parameter used to describe macropore flow and quantify its effects on runoff generation and solute transport. However, the dominant factors controlling MFV are still poorly understood and the typical ranges of MFV measured at the field are not defined clearly. To address these issues, we conducted a meta-analysis based on a database created from 246 experiments on MFV collected from 76 journal articles. For a fair comparison, a conceptually unified definition of MFV is introduced to convert the MFV measured with different approaches and at various scales including soil core, field, trench or hillslope scales. The potential controlling factors of MFV considered include scale, travel distance, hydrologic conditions, site factors, macropore morphologies, soil texture, and land use. The results show that MFV is about 2 3 orders of magnitude larger than the corresponding values of saturated hydraulic conductivity. MFV is much larger at the trench and hillslope scale than at the field profile and soil core scales and shows a significant positive correlation with the travel distance. Generally, higher irrigation intensity tends to trigger faster MFV, especially at field profile scale, where MFV and irrigation intensity have significant positive correlation. At the trench and hillslope scale, the presence of large macropores (diameter>10 mm) is a key factor determining MFV. The geometric mean of MFV for sites with large macropores was found to be about 8 times larger than those without large macropores. For sites with large macropores, MFV increases with the macropore diameter. However, no noticeable difference in MFV has been observed among different soil texture and land use. Comparing the existing equations to describe MFV, the Poiseuille equation significantly overestimated the observed values, while the Manning-type equations generate reasonable values. The insights from this study will shed light on future field campaigns and modeling of macropore flow.
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.
Wang, Chunlin; Lefkowitz, Elliot J
2004-10-28
Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist.
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters
Wang, Chunlin; Lefkowitz, Elliot J
2004-01-01
Background Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. Results We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. Conclusions Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist. PMID:15511296
Chaplin, Beth; Meloni, Seema; Eisen, Geoffrey; Jolayemi, Toyin; Banigbe, Bolanle; Adeola, Juliette; Wen, Craig; Reyes Nieva, Harry; Chang, Charlotte; Okonkwo, Prosper; Kanki, Phyllis
2015-01-01
The implementation of PEPFAR programs in resource-limited settings was accompanied by the need to document patient care on a scale unprecedented in environments where paper-based records were the norm. We describe the development of an electronic medical records system (EMRS) put in place at the beginning of a large HIV/AIDS care and treatment program in Nigeria. Databases were created to record laboratory results, medications prescribed and dispensed, and clinical assessments, using a relational database program. A collection of stand-alone files recorded different elements of patient care, linked together by utilities that aggregated data on national standard indicators and assessed patient care for quality improvement, tracked patients requiring follow-up, generated counts of ART regimens dispensed, and provided 'snapshots' of a patient's response to treatment. A secure server was used to store patient files for backup and transfer. By February 2012, when the program transitioned to local in-country management by APIN, the EMRS was used in 33 hospitals across the country, with 4,947,433 adult, pediatric and PMTCT records that had been created and continued to be available for use in patient care. Ongoing trainings for data managers, along with an iterative process of implementing changes to the databases and forms based on user feedback, were needed. As the program scaled up and the volume of laboratory tests increased, results were produced in a digital format, wherever possible, that could be automatically transferred to the EMRS. Many larger clinics began to link some or all of the databases to local area networks, making them available to a larger group of staff members, or providing the ability to enter information simultaneously where needed. The EMRS improved patient care, enabled efficient reporting to the Government of Nigeria and to U.S. funding agencies, and allowed program managers and staff to conduct quality control audits. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Torgerson, Carinna M; Quinn, Catherine; Dinov, Ivo; Liu, Zhizhong; Petrosyan, Petros; Pelphrey, Kevin; Haselgrove, Christian; Kennedy, David N; Toga, Arthur W; Van Horn, John Darrell
2015-03-01
Under the umbrella of the National Database for Clinical Trials (NDCT) related to mental illnesses, the National Database for Autism Research (NDAR) seeks to gather, curate, and make openly available neuroimaging data from NIH-funded studies of autism spectrum disorder (ASD). NDAR has recently made its database accessible through the LONI Pipeline workflow design and execution environment to enable large-scale analyses of cortical architecture and function via local, cluster, or "cloud"-based computing resources. This presents a unique opportunity to overcome many of the customary limitations to fostering biomedical neuroimaging as a science of discovery. Providing open access to primary neuroimaging data, workflow methods, and high-performance computing will increase uniformity in data collection protocols, encourage greater reliability of published data, results replication, and broaden the range of researchers now able to perform larger studies than ever before. To illustrate the use of NDAR and LONI Pipeline for performing several commonly performed neuroimaging processing steps and analyses, this paper presents example workflows useful for ASD neuroimaging researchers seeking to begin using this valuable combination of online data and computational resources. We discuss the utility of such database and workflow processing interactivity as a motivation for the sharing of additional primary data in ASD research and elsewhere.
Li, J L; Deng, H; Lai, D B; Xu, F; Chen, J; Gao, G; Recker, R R; Deng, H W
2001-07-01
To efficiently manipulate large amounts of genotype data generated with fluorescently labeled dinucleotide markers, we developed a Microsoft database management system, named. offers several advantages. First, it accommodates the dynamic nature of the accumulations of genotype data during the genotyping process; some data need to be confirmed or replaced by repeat lab procedures. By using, the raw genotype data can be imported easily and continuously and incorporated into the database during the genotyping process that may continue over an extended period of time in large projects. Second, almost all of the procedures are automatic, including autocomparison of the raw data read by different technicians from the same gel, autoadjustment among the allele fragment-size data from cross-runs or cross-platforms, autobinning of alleles, and autocompilation of genotype data for suitable programs to perform inheritance check in pedigrees. Third, provides functions to track electrophoresis gel files to locate gel or sample sources for any resultant genotype data, which is extremely helpful for double-checking consistency of raw and final data and for directing repeat experiments. In addition, the user-friendly graphic interface of renders processing of large amounts of data much less labor-intensive. Furthermore, has built-in mechanisms to detect some genotyping errors and to assess the quality of genotype data that then are summarized in the statistic reports automatically generated by. The can easily handle >500,000 genotype data entries, a number more than sufficient for typical whole-genome linkage studies. The modules and programs we developed for the can be extended to other database platforms, such as Microsoft SQL server, if the capability to handle still greater quantities of genotype data simultaneously is desired.
Melloy, Patricia G
2015-01-01
A two-part laboratory exercise was developed to enhance classroom instruction on the significance of p53 mutations in cancer development. Students were asked to mine key information from an international database of p53 genetic changes related to cancer, the IARC TP53 database. Using this database, students designed several data mining activities to look at the changes in the p53 gene from a number of perspectives, including potential cancer-causing agents leading to particular changes and the prevalence of certain p53 variations in certain cancers. In addition, students gained a global perspective on cancer prevalence in different parts of the world. Students learned how to use the database in the first part of the exercise, and then used that knowledge to search particular cancers and cancer-causing agents of their choosing in the second part of the exercise. Students also connected the information gathered from the p53 exercise to a previous laboratory exercise looking at risk factors for cancer development. The goal of the experience was to increase student knowledge of the link between p53 genetic variation and cancer. Students also were able to walk a similar path through the website as a cancer researcher using the database to enhance bench work-based experiments with complementary large-scale database p53 variation information. © 2014 The International Union of Biochemistry and Molecular Biology.
Spatial adaptive sampling in multiscale simulation
NASA Astrophysics Data System (ADS)
Rouet-Leduc, Bertrand; Barros, Kipton; Cieren, Emmanuel; Elango, Venmugil; Junghans, Christoph; Lookman, Turab; Mohd-Yusof, Jamaludin; Pavel, Robert S.; Rivera, Axel Y.; Roehm, Dominic; McPherson, Allen L.; Germann, Timothy C.
2014-07-01
In a common approach to multiscale simulation, an incomplete set of macroscale equations must be supplemented with constitutive data provided by fine-scale simulation. Collecting statistics from these fine-scale simulations is typically the overwhelming computational cost. We reduce this cost by interpolating the results of fine-scale simulation over the spatial domain of the macro-solver. Unlike previous adaptive sampling strategies, we do not interpolate on the potentially very high dimensional space of inputs to the fine-scale simulation. Our approach is local in space and time, avoids the need for a central database, and is designed to parallelize well on large computer clusters. To demonstrate our method, we simulate one-dimensional elastodynamic shock propagation using the Heterogeneous Multiscale Method (HMM); we find that spatial adaptive sampling requires only ≈ 50 ×N0.14 fine-scale simulations to reconstruct the stress field at all N grid points. Related multiscale approaches, such as Equation Free methods, may also benefit from spatial adaptive sampling.
Discriminative Hierarchical K-Means Tree for Large-Scale Image Classification.
Chen, Shizhi; Yang, Xiaodong; Tian, Yingli
2015-09-01
A key challenge in large-scale image classification is how to achieve efficiency in terms of both computation and memory without compromising classification accuracy. The learning-based classifiers achieve the state-of-the-art accuracies, but have been criticized for the computational complexity that grows linearly with the number of classes. The nonparametric nearest neighbor (NN)-based classifiers naturally handle large numbers of categories, but incur prohibitively expensive computation and memory costs. In this brief, we present a novel classification scheme, i.e., discriminative hierarchical K-means tree (D-HKTree), which combines the advantages of both learning-based and NN-based classifiers. The complexity of the D-HKTree only grows sublinearly with the number of categories, which is much better than the recent hierarchical support vector machines-based methods. The memory requirement is the order of magnitude less than the recent Naïve Bayesian NN-based approaches. The proposed D-HKTree classification scheme is evaluated on several challenging benchmark databases and achieves the state-of-the-art accuracies, while with significantly lower computation cost and memory requirement.
Towards building high performance medical image management system for clinical trials
NASA Astrophysics Data System (ADS)
Wang, Fusheng; Lee, Rubao; Zhang, Xiaodong; Saltz, Joel
2011-03-01
Medical image based biomarkers are being established for therapeutic cancer clinical trials, where image assessment is among the essential tasks. Large scale image assessment is often performed by a large group of experts by retrieving images from a centralized image repository to workstations to markup and annotate images. In such environment, it is critical to provide a high performance image management system that supports efficient concurrent image retrievals in a distributed environment. There are several major challenges: high throughput of large scale image data over the Internet from the server for multiple concurrent client users, efficient communication protocols for transporting data, and effective management of versioning of data for audit trails. We study the major bottlenecks for such a system, propose and evaluate a solution by using a hybrid image storage with solid state drives and hard disk drives, RESTfulWeb Services based protocols for exchanging image data, and a database based versioning scheme for efficient archive of image revision history. Our experiments show promising results of our methods, and our work provides a guideline for building enterprise level high performance medical image management systems.
Cross-lingual neighborhood effects in generalized lexical decision and natural reading.
Dirix, Nicolas; Cop, Uschi; Drieghe, Denis; Duyck, Wouter
2017-06-01
The present study assessed intra- and cross-lingual neighborhood effects, using both a generalized lexical decision task and an analysis of a large-scale bilingual eye-tracking corpus (Cop, Dirix, Drieghe, & Duyck, 2016). Using new neighborhood density and frequency measures, the general lexical decision task yielded an inhibitory cross-lingual neighborhood density effect on reading times of second language words, replicating van Heuven, Dijkstra, and Grainger (1998). Reaction times for native language words were not influenced by neighborhood density or frequency but error rates showed cross-lingual neighborhood effects depending on target word frequency. The large-scale eye movement corpus confirmed effects of cross-lingual neighborhood on natural reading, even though participants were reading a novel in a unilingual context. Especially second language reading and to a lesser extent native language reading were influenced by lexical candidates from the nontarget language, although these effects in natural reading were largely facilitatory. These results offer strong and direct support for bilingual word recognition models that assume language-independent lexical access. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Analysis of the Appropriateness of the Use of Peltier Cells as Energy Sources.
Hájovský, Radovan; Pieš, Martin; Richtár, Lukáš
2016-05-25
The article describes the possibilities of using Peltier cells as an energy source to power the telemetry units, which are used in large-scale monitoring systems as central units, ensuring the collection of data from sensors, processing, and sending to the database server. The article describes the various experiments that were carried out, their progress and results. Based on experiments evaluated, the paper also discusses the possibilities of using various types depending on the temperature difference of the cold and hot sides.
Evolution of the Tropical Cyclone Integrated Data Exchange And Analysis System (TC-IDEAS)
NASA Technical Reports Server (NTRS)
Turk, J.; Chao, Y.; Haddad, Z.; Hristova-Veleva, S.; Knosp, B.; Lambrigtsen, B.; Li, P.; Licata, S.; Poulsen, W.; Su, H.;
2010-01-01
The Tropical Cyclone Integrated Data Exchange and Analysis System (TC-IDEAS) is being jointly developed by the Jet Propulsion Laboratory (JPL) and the Marshall Space Flight Center (MSFC) as part of NASA's Hurricane Science Research Program. The long-term goal is to create a comprehensive tropical cyclone database of satellite and airborne observations, in-situ measurements and model simulations containing parameters that pertain to the thermodynamic and microphysical structure of the storms; the air-sea interaction processes; and the large-scale environment.
Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J; Inzé, Dirk; Van de Peer, Yves
2013-03-01
Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.
Brawer, Peter A; Martielli, Richard; Pye, Patrice L; Manwaring, Jamie; Tierney, Anna
2010-06-01
The primary care health setting is in crisis. Increasing demand for services, with dwindling numbers of providers, has resulted in decreased access and decreased satisfaction for both patients and providers. Moreover, the overwhelming majority of primary care visits are for behavioral and mental health concerns rather than issues of a purely medical etiology. Integrated-collaborative models of health care delivery offer possible solutions to this crisis. The purpose of this article is to review the existing data available after 2 years of the St. Louis Initiative for Integrated Care Excellence; an example of integrated-collaborative care on a large scale model within a regional Veterans Affairs Health Care System. There is clear evidence that the SLI(2)CE initiative rather dramatically increased access to health care, and modified primary care practitioners' willingness to address mental health issues within the primary care setting. In addition, data suggests strong fidelity to a model of integrated-collaborative care which has been successful in the past. Integrated-collaborative care offers unique advantages to the traditional view and practice of medical care. Through careful implementation and practice, success is possible on a large scale model. PsycINFO Database Record (c) 2010 APA, all rights reserved.
Jaeger, Sébastien; Thieffry, Denis
2017-01-01
Abstract Transcription factor (TF) databases contain multitudes of binding motifs (TFBMs) from various sources, from which non-redundant collections are derived by manual curation. The advent of high-throughput methods stimulated the production of novel collections with increasing numbers of motifs. Meta-databases, built by merging these collections, contain redundant versions, because available tools are not suited to automatically identify and explore biologically relevant clusters among thousands of motifs. Motif discovery from genome-scale data sets (e.g. ChIP-seq) also produces redundant motifs, hampering the interpretation of results. We present matrix-clustering, a versatile tool that clusters similar TFBMs into multiple trees, and automatically creates non-redundant TFBM collections. A feature unique to matrix-clustering is its dynamic visualisation of aligned TFBMs, and its capability to simultaneously treat multiple collections from various sources. We demonstrate that matrix-clustering considerably simplifies the interpretation of combined results from multiple motif discovery tools, and highlights biologically relevant variations of similar motifs. We also ran a large-scale application to cluster ∼11 000 motifs from 24 entire databases, showing that matrix-clustering correctly groups motifs belonging to the same TF families, and drastically reduced motif redundancy. matrix-clustering is integrated within the RSAT suite (http://rsat.eu/), accessible through a user-friendly web interface or command-line for its integration in pipelines. PMID:28591841
Introducing GFWED: The Global Fire Weather Database
NASA Technical Reports Server (NTRS)
Field, R. D.; Spessa, A. C.; Aziz, N. A.; Camia, A.; Cantin, A.; Carr, R.; de Groot, W. J.; Dowdy, A. J.; Flannigan, M. D.; Manomaiphiboon, K.;
2015-01-01
The Canadian Forest Fire Weather Index (FWI) System is the mostly widely used fire danger rating system in the world. We have developed a global database of daily FWI System calculations, beginning in 1980, called the Global Fire WEather Database (GFWED) gridded to a spatial resolution of 0.5 latitude by 2-3 longitude. Input weather data were obtained from the NASA Modern Era Retrospective-Analysis for Research and Applications (MERRA), and two different estimates of daily precipitation from rain gauges over land. FWI System Drought Code calculations from the gridded data sets were compared to calculations from individual weather station data for a representative set of 48 stations in North, Central and South America, Europe, Russia,Southeast Asia and Australia. Agreement between gridded calculations and the station-based calculations tended to be most different at low latitudes for strictly MERRA based calculations. Strong biases could be seen in either direction: MERRA DC over the Mato Grosso in Brazil reached unrealistically high values exceeding DCD1500 during the dry season but was too low over Southeast Asia during the dry season. These biases are consistent with those previously identified in MERRAs precipitation, and they reinforce the need to consider alternative sources of precipitation data. GFWED can be used for analyzing historical relationships between fire weather and fire activity at continental and global scales, in identifying large-scale atmosphereocean controls on fire weather, and calibration of FWI-based fire prediction models.
ArrayBridge: Interweaving declarative array processing with high-performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros
Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files. This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aimsmore » to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.« less
Data management for community research projects: A JGOFS case study
NASA Technical Reports Server (NTRS)
Lowry, Roy K.
1992-01-01
Since the mid 1980s, much of the marine science research effort in the United Kingdom has been focused into large scale collaborative projects involving public sector laboratories and university departments, termed Community Research Projects. Two of these, the Biogeochemical Ocean Flux Study (BOFS) and the North Sea Project incorporated large scale data collection to underpin multidisciplinary modeling efforts. The challenge of providing project data sets to support the science was met by a small team within the British Oceanographic Data Centre (BODC) operating as a topical data center. The role of the data center was to both work up the data from the ship's sensors and to combine these data with sample measurements into online databases. The working up of the data was achieved by a unique symbiosis between data center staff and project scientists. The project management, programming and data processing skills of the data center were combined with the oceanographic experience of the project communities to develop a system which has produced quality controlled, calibrated data sets from 49 research cruises in 3.5 years of operation. The data center resources required to achieve this were modest and far outweighed by the time liberated in the scientific community by the removal of the data processing burden. Two online project databases have been assembled containing a very high proportion of the data collected. As these are under the control of BODC their long term availability as part of the UK national data archive is assured. The success of the topical data center model for UK Community Research Project data management has been founded upon the strong working relationships forged between the data center and project scientists. These can only be established by frequent personal contact and hence the relatively small size of the UK has been a critical factor. However, projects covering a larger, even international scale could be successfully supported by a network of topical data centers managing online databases which are interconnected by object oriented distributed data management systems over wide area networks.
CoryneRegNet 4.0 – A reference database for corynebacterial gene regulatory networks
Baumbach, Jan
2007-01-01
Background Detailed information on DNA-binding transcription factors (the key players in the regulation of gene expression) and on transcriptional regulatory interactions of microorganisms deduced from literature-derived knowledge, computer predictions and global DNA microarray hybridization experiments, has opened the way for the genome-wide analysis of transcriptional regulatory networks. The large-scale reconstruction of these networks allows the in silico analysis of cell behavior in response to changing environmental conditions. We previously published CoryneRegNet, an ontology-based data warehouse of corynebacterial transcription factors and regulatory networks. Initially, it was designed to provide methods for the analysis and visualization of the gene regulatory network of Corynebacterium glutamicum. Results Now we introduce CoryneRegNet release 4.0, which integrates data on the gene regulatory networks of 4 corynebacteria, 2 mycobacteria and the model organism Escherichia coli K12. As the previous versions, CoryneRegNet provides a web-based user interface to access the database content, to allow various queries, and to support the reconstruction, analysis and visualization of regulatory networks at different hierarchical levels. In this article, we present the further improved database content of CoryneRegNet along with novel analysis features. The network visualization feature GraphVis now allows the inter-species comparisons of reconstructed gene regulatory networks and the projection of gene expression levels onto that networks. Therefore, we added stimulon data directly into the database, but also provide Web Service access to the DNA microarray analysis platform EMMA. Additionally, CoryneRegNet now provides a SOAP based Web Service server, which can easily be consumed by other bioinformatics software systems. Stimulons (imported from the database, or uploaded by the user) can be analyzed in the context of known transcriptional regulatory networks to predict putative contradictions or further gene regulatory interactions. Furthermore, it integrates protein clusters by means of heuristically solving the weighted graph cluster editing problem. In addition, it provides Web Service based access to up to date gene annotation data from GenDB. Conclusion The release 4.0 of CoryneRegNet is a comprehensive system for the integrated analysis of procaryotic gene regulatory networks. It is a versatile systems biology platform to support the efficient and large-scale analysis of transcriptional regulation of gene expression in microorganisms. It is publicly available at . PMID:17986320
MAGA, a new database of gas natural emissions: a collaborative web environment for collecting data.
NASA Astrophysics Data System (ADS)
Cardellini, Carlo; Chiodini, Giovanni; Frigeri, Alessandro; Bagnato, Emanuela; Frondini, Francesco; Aiuppa, Alessandro
2014-05-01
The data on volcanic and non-volcanic gas emissions available online are, as today, are incomplete and most importantly, fragmentary. Hence, there is need for common frameworks to aggregate available data, in order to characterize and quantify the phenomena at various scales. A new and detailed web database (MAGA: MApping GAs emissions) has been developed, and recently improved, to collect data on carbon degassing form volcanic and non-volcanic environments. MAGA database allows researchers to insert data interactively and dynamically into a spatially referred relational database management system, as well as to extract data. MAGA kicked-off with the database set up and with the ingestion in to the database of the data from: i) a literature survey on publications on volcanic gas fluxes including data on active craters degassing, diffuse soil degassing and fumaroles both from dormant closed-conduit volcanoes (e.g., Vulcano, Phlegrean Fields, Santorini, Nysiros, Teide, etc.) and open-vent volcanoes (e.g., Etna, Stromboli, etc.) in the Mediterranean area and Azores, and ii) the revision and update of Googas database on non-volcanic emission of the Italian territory (Chiodini et al., 2008), in the framework of the Deep Earth Carbon Degassing (DECADE) research initiative of the Deep Carbon Observatory (DCO). For each geo-located gas emission site, the database holds images and description of the site and of the emission type (e.g., diffuse emission, plume, fumarole, etc.), gas chemical-isotopic composition (when available), gas temperature and gases fluxes magnitude. Gas sampling, analysis and flux measurement methods are also reported together with references and contacts to researchers expert of each site. In this phase data can be accessed on the network from a web interface, and data-driven web service, where software clients can request data directly from the database, are planned to be implemented shortly. This way Geographical Information Systems (GIS) and Virtual Globes (e.g., Google Earth) could easily access the database, and data could be exchanged with other database. At the moment the database includes: i) more than 1000 flux data about volcanic plume degassing from Etna and Stromboli volcanoes, ii) data from ~ 30 sites of diffuse soil degassing from Napoletan volcanoes, Azores, Canary, Etna, Stromboli, and Vulcano Island, several data on fumarolic emissions (~ 7 sites) with CO2 fluxes; iii) data from ~ 270 non volcanic gas emission site in Italy. We believe MAGA data-base is an important starting point to develop a large scale, expandable data-base aimed to excite, inspire, and encourage participation among researchers. In addition, the possibility to archive location and qualitative information for gas emission/sites not yet investigated, could stimulate the scientific community for future researches and will provide an indication on the current uncertainty on deep carbon fluxes global estimates
Rousselet, Jérôme; Imbert, Charles-Edouard; Dekri, Anissa; Garcia, Jacques; Goussard, Francis; Vincent, Bruno; Denux, Olivier; Robinet, Christelle; Dorkeld, Franck; Roques, Alain; Rossi, Jean-Pierre
2013-01-01
Mapping species spatial distribution using spatial inference and prediction requires a lot of data. Occurrence data are generally not easily available from the literature and are very time-consuming to collect in the field. For that reason, we designed a survey to explore to which extent large-scale databases such as Google maps and Google Street View could be used to derive valid occurrence data. We worked with the Pine Processionary Moth (PPM) Thaumetopoea pityocampa because the larvae of that moth build silk nests that are easily visible. The presence of the species at one location can therefore be inferred from visual records derived from the panoramic views available from Google Street View. We designed a standardized procedure allowing evaluating the presence of the PPM on a sampling grid covering the landscape under study. The outputs were compared to field data. We investigated two landscapes using grids of different extent and mesh size. Data derived from Google Street View were highly similar to field data in the large-scale analysis based on a square grid with a mesh of 16 km (96% of matching records). Using a 2 km mesh size led to a strong divergence between field and Google-derived data (46% of matching records). We conclude that Google database might provide useful occurrence data for mapping the distribution of species which presence can be visually evaluated such as the PPM. However, the accuracy of the output strongly depends on the spatial scales considered and on the sampling grid used. Other factors such as the coverage of Google Street View network with regards to sampling grid size and the spatial distribution of host trees with regards to road network may also be determinant.
Crowd-Sourcing with K-12 citizen scientists: The Continuing Evolution of the GLOBE Program
NASA Astrophysics Data System (ADS)
Murphy, T.; Wegner, K.; Andersen, T. J.
2016-12-01
Twenty years ago, the Internet was still in its infancy, citizen science was a relatively unknown term, and the idea of a global citizen science database was unheard of. Then the Global Learning and Observations to Benefit the Environment (GLOBE) Program was proposed and this all changed. GLOBE was one of the first K-12 citizen science programs on a global scale. An initial large scale ramp-up of the program was followed by the establishment of a network of partners in countries and within the U.S. Now in the 21st century, the program has over 50 protocols in atmosphere, biosphere, hydrosphere and pedosphere, almost 140 million measurements in the database, a visualization system, collaborations with NASA satellite mission scientists (GPM, SMAP) and other scientists, as well as research projects by GLOBE students. As technology changed over the past two decades, it was integrated into the program's outreach efforts to existing and new members with the result that the program now has a strong social media presence. In 2016, a new app was launched which opened up GLOBE and data entry to citizen scientists of all ages. The app is aimed at fresh audiences, beyond the traditional GLOBE K-12 community. Groups targeted included: scouting organizations, museums, 4H, science learning centers, retirement communities, etc. to broaden participation in the program and increase the number of data available to students and scientists. Through the 20 years of GLOBE, lessons have been learned about changing the management of this type of large-scale program, the use of technology to enhance and improve the experience for members, and increasing community involvement in the program.
Dekri, Anissa; Garcia, Jacques; Goussard, Francis; Vincent, Bruno; Denux, Olivier; Robinet, Christelle; Dorkeld, Franck; Roques, Alain; Rossi, Jean-Pierre
2013-01-01
Mapping species spatial distribution using spatial inference and prediction requires a lot of data. Occurrence data are generally not easily available from the literature and are very time-consuming to collect in the field. For that reason, we designed a survey to explore to which extent large-scale databases such as Google maps and Google street view could be used to derive valid occurrence data. We worked with the Pine Processionary Moth (PPM) Thaumetopoea pityocampa because the larvae of that moth build silk nests that are easily visible. The presence of the species at one location can therefore be inferred from visual records derived from the panoramic views available from Google street view. We designed a standardized procedure allowing evaluating the presence of the PPM on a sampling grid covering the landscape under study. The outputs were compared to field data. We investigated two landscapes using grids of different extent and mesh size. Data derived from Google street view were highly similar to field data in the large-scale analysis based on a square grid with a mesh of 16 km (96% of matching records). Using a 2 km mesh size led to a strong divergence between field and Google-derived data (46% of matching records). We conclude that Google database might provide useful occurrence data for mapping the distribution of species which presence can be visually evaluated such as the PPM. However, the accuracy of the output strongly depends on the spatial scales considered and on the sampling grid used. Other factors such as the coverage of Google street view network with regards to sampling grid size and the spatial distribution of host trees with regards to road network may also be determinant. PMID:24130675
NASA Astrophysics Data System (ADS)
Cheng, Tao; Rivard, Benoit; Sánchez-Azofeifa, Arturo G.; Féret, Jean-Baptiste; Jacquemoud, Stéphane; Ustin, Susan L.
2014-01-01
Leaf mass per area (LMA), the ratio of leaf dry mass to leaf area, is a trait of central importance to the understanding of plant light capture and carbon gain. It can be estimated from leaf reflectance spectroscopy in the infrared region, by making use of information about the absorption features of dry matter. This study reports on the application of continuous wavelet analysis (CWA) to the estimation of LMA across a wide range of plant species. We compiled a large database of leaf reflectance spectra acquired within the framework of three independent measurement campaigns (ANGERS, LOPEX and PANAMA) and generated a simulated database using the PROSPECT leaf optical properties model. CWA was applied to the measured and simulated databases to extract wavelet features that correlate with LMA. These features were assessed in terms of predictive capability and robustness while transferring predictive models from the simulated database to the measured database. The assessment was also conducted with two existing spectral indices, namely the Normalized Dry Matter Index (NDMI) and the Normalized Difference index for LMA (NDLMA). Five common wavelet features were determined from the two databases, which showed significant correlations with LMA (R2: 0.51-0.82, p < 0.0001). The best robustness (R2 = 0.74, RMSE = 18.97 g/m2 and Bias = 0.12 g/m2) was obtained using a combination of two low-scale features (1639 nm, scale 4) and (2133 nm, scale 5), the first being predominantly important. The transferability of the wavelet-based predictive model to the whole measured database was either better than or comparable to those based on spectral indices. Additionally, only the wavelet-based model showed consistent predictive capabilities among the three measured data sets. In comparison, the models based on spectral indices were sensitive to site-specific data sets. Integrating the NDLMA spectral index and the two robust wavelet features improved the LMA prediction. One of the bands used by this spectral index, 1368 nm, was located in a strong atmospheric water absorption region and replacing it with the next available band (1340 nm) led to lower predictive accuracies. However, the two wavelet features were not affected by data quality in the atmospheric absorption regions and therefore showed potential for canopy-level investigations. The wavelet approach provides a different perspective into spectral responses to LMA variation than the traditional spectral indices and holds greater promise for implementation with airborne or spaceborne imaging spectroscopy data for mapping canopy foliar dry biomass.
Dvornyk, Volodymyr; Long, Ji-Rong; Xiong, Dong-Hai; Liu, Peng-Yuan; Zhao, Lan-Juan; Shen, Hui; Zhang, Yuan-Yuan; Liu, Yong-Jun; Rocha-Sanchez, Sonia; Xiao, Peng; Recker, Robert R; Deng, Hong-Wen
2004-02-25
Public SNP databases are frequently used to choose SNPs for candidate genes in the association and linkage studies of complex disorders. However, their utility for such studies of diseases with ethnic-dependent background has never been evaluated. To estimate the accuracy and completeness of SNP public databases, we analyzed the allele frequencies of 41 SNPs in 10 candidate genes for obesity and/or osteoporosis in a large American-Caucasian sample (1,873 individuals from 405 nuclear families) by PCR-invader assay. We compared our results with those from the databases and other published studies. Of the 41 SNPs, 8 were monomorphic in our sample. Twelve were reported for the first time for Caucasians and the other 29 SNPs in our sample essentially confirmed the respective allele frequencies for Caucasians in the databases and previous studies. The comparison of our data with other ethnic groups showed significant differentiation between the three major world ethnic groups at some SNPs (Caucasians and Africans differed at 3 of the 18 shared SNPs, and Caucasians and Asians differed at 13 of the 22 shared SNPs). This genetic differentiation may have an important implication for studying the well-known ethnic differences in the prevalence of obesity and osteoporosis, and complex disorders in general. A comparative analysis of the SNP data of the candidate genes obtained in the present study, as well as those retrieved from the public domain, suggests that the databases may currently have serious limitations for studying complex disorders with an ethnic-dependent background due to the incomplete and uneven representation of the candidate SNPs in the databases for the major ethnic groups. This conclusion attests to the imperative necessity of large-scale and accurate characterization of these SNPs in different ethnic groups.
Dvornyk, Volodymyr; Long, Ji-Rong; Xiong, Dong-Hai; Liu, Peng-Yuan; Zhao, Lan-Juan; Shen, Hui; Zhang, Yuan-Yuan; Liu, Yong-Jun; Rocha-Sanchez, Sonia; Xiao, Peng; Recker, Robert R; Deng, Hong-Wen
2004-01-01
Background Public SNP databases are frequently used to choose SNPs for candidate genes in the association and linkage studies of complex disorders. However, their utility for such studies of diseases with ethnic-dependent background has never been evaluated. Results To estimate the accuracy and completeness of SNP public databases, we analyzed the allele frequencies of 41 SNPs in 10 candidate genes for obesity and/or osteoporosis in a large American-Caucasian sample (1,873 individuals from 405 nuclear families) by PCR-invader assay. We compared our results with those from the databases and other published studies. Of the 41 SNPs, 8 were monomorphic in our sample. Twelve were reported for the first time for Caucasians and the other 29 SNPs in our sample essentially confirmed the respective allele frequencies for Caucasians in the databases and previous studies. The comparison of our data with other ethnic groups showed significant differentiation between the three major world ethnic groups at some SNPs (Caucasians and Africans differed at 3 of the 18 shared SNPs, and Caucasians and Asians differed at 13 of the 22 shared SNPs). This genetic differentiation may have an important implication for studying the well-known ethnic differences in the prevalence of obesity and osteoporosis, and complex disorders in general. Conclusion A comparative analysis of the SNP data of the candidate genes obtained in the present study, as well as those retrieved from the public domain, suggests that the databases may currently have serious limitations for studying complex disorders with an ethnic-dependent background due to the incomplete and uneven representation of the candidate SNPs in the databases for the major ethnic groups. This conclusion attests to the imperative necessity of large-scale and accurate characterization of these SNPs in different ethnic groups. PMID:15113403
NASA Astrophysics Data System (ADS)
Schrodt, Franziska; Shan, Hanhuai; Fazayeli, Farideh; Karpatne, Anuj; Kattge, Jens; Banerjee, Arindam; Reichstein, Markus; Reich, Peter
2013-04-01
With the advent of remotely sensed data and coordinated efforts to create global databases, the ecological community has progressively become more data-intensive. However, in contrast to other disciplines, statistical ways of handling these large data sets, especially the gaps which are inherent to them, are lacking. Widely used theoretical approaches, for example model averaging based on Akaike's information criterion (AIC), are sensitive to missing values. Yet, the most common way of handling sparse matrices - the deletion of cases with missing data (complete case analysis) - is known to severely reduce statistical power as well as inducing biased parameter estimates. In order to address these issues, we present novel approaches to gap filling in large ecological data sets using matrix factorization techniques. Factorization based matrix completion was developed in a recommender system context and has since been widely used to impute missing data in fields outside the ecological community. Here, we evaluate the effectiveness of probabilistic matrix factorization techniques for imputing missing data in ecological matrices using two imputation techniques. Hierarchical Probabilistic Matrix Factorization (HPMF) effectively incorporates hierarchical phylogenetic information (phylogenetic group, family, genus, species and individual plant) into the trait imputation. Advanced Hierarchical Probabilistic Matrix Factorization (aHPMF) on the other hand includes climate and soil information into the matrix factorization by regressing the environmental variables against residuals of the HPMF. One unique opportunity opened up by aHPMF is out-of-sample prediction, where traits can be predicted for specific species at locations different to those sampled in the past. This has potentially far-reaching consequences for the study of global-scale plant functional trait patterns. We test the accuracy and effectiveness of HPMF and aHPMF in filling sparse matrices, using the TRY database of plant functional traits (http://www.try-db.org). TRY is one of the largest global compilations of plant trait databases (750 traits of 1 million plants), encompassing data on morphological, anatomical, biochemical, phenological and physiological features of plants. However, despite of unprecedented coverage, the TRY database is still very sparse, severely limiting joint trait analyses. Plant traits are the key to understanding how plants as primary producers adjust to changes in environmental conditions and in turn influence them. Forming the basis for Dynamic Global Vegetation Models (DGVMs), plant traits are also fundamental in global change studies for predicting future ecosystem changes. It is thus imperative that missing data is imputed in as accurate and precise a way as possible. In this study, we show the advantages and disadvantages of applying probabilistic matrix factorization techniques in incorporating hierarchical and environmental information for the prediction of missing plant traits as compared to conventional imputation techniques such as the complete case and mean approaches. We will discuss the implications of using gap-filled data for global-scale studies of plant functional trait - environment relationship as opposed to the above-mentioned conventional techniques, using examples of out-of-sample predictions of foliar Nitrogen across several species' ranges and biomes.
Accounting for Rainfall Spatial Variability in Prediction of Flash Floods
NASA Astrophysics Data System (ADS)
Saharia, M.; Kirstetter, P. E.; Gourley, J. J.; Hong, Y.; Vergara, H. J.
2016-12-01
Flash floods are a particularly damaging natural hazard worldwide in terms of both fatalities and property damage. In the United States, the lack of a comprehensive database that catalogues information related to flash flood timing, location, causative rainfall, and basin geomorphology has hindered broad characterization studies. First a representative and long archive of more than 20,000 flooding events during 2002-2011 is used to analyze the spatial and temporal variability of flash floods. We also derive large number of spatially distributed geomorphological and climatological parameters such as basin area, mean annual precipitation, basin slope etc. to identify static basin characteristics that influence flood response. For the same period, the National Severe Storms Laboratory (NSSL) has produced a decadal archive of Multi-Radar/Multi-Sensor (MRMS) radar-only precipitation rates at 1-km spatial resolution with 5-min temporal resolution. This provides an unprecedented opportunity to analyze the impact of event-level precipitation variability on flooding using a big data approach. To analyze the impact of sub-basin scale rainfall spatial variability on flooding, certain indices such as the first and second scaled moment of rainfall, horizontal gap, vertical gap etc. are computed from the MRMS dataset. Finally, flooding characteristics such as rise time, lag time, and peak discharge are linked to derived geomorphologic, climatologic, and rainfall indices to identify basin characteristics that drive flash floods. Next the model is used to predict flash flooding characteristics all over the continental U.S., specifically over regions poorly covered by hydrological observations. So far studies involving rainfall variability indices have only been performed on a case study basis, and a large scale approach is expected to provide a deeper insight into how sub-basin scale precipitation variability affects flooding. Finally, these findings are validated using the National Weather Service storm reports and a historical flood fatalities database. This analysis framework will serve as a baseline for evaluating distributed hydrologic model simulations such as the Flooded Locations And Simulated Hydrographs Project (FLASH) (http://flash.ou.edu).
Extracting Databases from Dark Data with DeepDive
Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng
2016-01-01
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts a massive and valuable new set of “big data.” DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference. PMID:28316365
NASA Astrophysics Data System (ADS)
Fiore, Sandro; Williams, Dean; Aloisio, Giovanni
2016-04-01
In many scientific domains such as climate, data is often n-dimensional and requires tools that support specialized data types and primitives to be properly stored, accessed, analysed and visualized. Moreover, new challenges arise in large-scale scenarios and eco-systems where petabytes (PB) of data can be available and data can be distributed and/or replicated (e.g., the Earth System Grid Federation (ESGF) serving the Coupled Model Intercomparison Project, Phase 5 (CMIP5) experiment, providing access to 2.5PB of data for the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (AR5). Most of the tools currently available for scientific data analysis in the climate domain fail at large scale since they: (1) are desktop based and need the data locally; (2) are sequential, so do not benefit from available multicore/parallel machines; (3) do not provide declarative languages to express scientific data analysis tasks; (4) are domain-specific, which ties their adoption to a specific domain; and (5) do not provide a workflow support, to enable the definition of complex "experiments". The Ophidia project aims at facing most of the challenges highlighted above by providing a big data analytics framework for eScience. Ophidia provides declarative, server-side, and parallel data analysis, jointly with an internal storage model able to efficiently deal with multidimensional data and a hierarchical data organization to manage large data volumes ("datacubes"). The project relies on a strong background of high performance database management and OLAP systems to manage large scientific data sets. It also provides a native workflow management support, to define processing chains and workflows with tens to hundreds of data analytics operators to build real scientific use cases. With regard to interoperability aspects, the talk will present the contribution provided both to the RDA Working Group on Array Databases, and the Earth System Grid Federation (ESGF) Compute Working Team. Also highlighted will be the results of large scale climate model intercomparison data analysis experiments, for example: (1) defined in the context of the EU H2020 INDIGO-DataCloud project; (2) implemented in a real geographically distributed environment involving CMCC (Italy) and LLNL (US) sites; (3) exploiting Ophidia as server-side, parallel analytics engine; and (4) applied on real CMIP5 data sets available through ESGF.
Why bottom-up taxonomies are unlikely to satisfy the quest for a definitive taxonomy of situations.
Reis, Harry T
2018-03-01
The recent advent of methods for large-scale data collection has provided an unprecedented opportunity for researchers who seek to develop a taxonomy of situations. Parrigon, Woo, Tay, and Wang's (2017) CAPTIONs model is the latest such effort. In this comment, I argue that although bottom-up approaches of this sort have clear value, they are unlikely to provide the sort of definitive, comprehensive, and theoretically integrative taxonomy that the field wants and needs. In large part, this is because bottom-up taxonomies represent what is common about situations and not what is theoretically important and influential about them. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
EuroPhenome and EMPReSS: online mouse phenotyping resource
Mallon, Ann-Marie; Hancock, John M.
2008-01-01
EuroPhenome (http://www.europhenome.org) and EMPReSS (http://empress.har.mrc.ac.uk/) form an integrated resource to provide access to data and procedures for mouse phenotyping. EMPReSS describes 96 Standard Operating Procedures for mouse phenotyping. EuroPhenome contains data resulting from carrying out EMPReSS protocols on four inbred laboratory mouse strains. As well as web interfaces, both resources support web services to enable integration with other mouse phenotyping and functional genetics resources, and are committed to initiatives to improve integration of mouse phenotype databases. EuroPhenome will be the repository for a recently initiated effort to carry out large-scale phenotyping on a large number of knockout mouse lines (EUMODIC). PMID:17905814
EuroPhenome and EMPReSS: online mouse phenotyping resource.
Mallon, Ann-Marie; Blake, Andrew; Hancock, John M
2008-01-01
EuroPhenome (http://www.europhenome.org) and EMPReSS (http://empress.har.mrc.ac.uk/) form an integrated resource to provide access to data and procedures for mouse phenotyping. EMPReSS describes 96 Standard Operating Procedures for mouse phenotyping. EuroPhenome contains data resulting from carrying out EMPReSS protocols on four inbred laboratory mouse strains. As well as web interfaces, both resources support web services to enable integration with other mouse phenotyping and functional genetics resources, and are committed to initiatives to improve integration of mouse phenotype databases. EuroPhenome will be the repository for a recently initiated effort to carry out large-scale phenotyping on a large number of knockout mouse lines (EUMODIC).
Panigrahi, Priyabrata; Jere, Abhay; Anamika, Krishanpal
2018-01-01
Gene fusion is a chromosomal rearrangement event which plays a significant role in cancer due to the oncogenic potential of the chimeric protein generated through fusions. At present many databases are available in public domain which provides detailed information about known gene fusion events and their functional role. Existing gene fusion detection tools, based on analysis of transcriptomics data usually report a large number of fusion genes as potential candidates, which could be either known or novel or false positives. Manual annotation of these putative genes is indeed time-consuming. We have developed a web platform FusionHub, which acts as integrated search engine interfacing various fusion gene databases and simplifies large scale annotation of fusion genes in a seamless way. In addition, FusionHub provides three ways of visualizing fusion events: circular view, domain architecture view and network view. Design of potential siRNA molecules through ensemble method is another utility integrated in FusionHub that could aid in siRNA-based targeted therapy. FusionHub is freely available at https://fusionhub.persistent.co.in.