large-scale chemoinformatics database: Topics by Science.gov

Sample records for large-scale chemoinformatics database

A Chemoinformatics Approach to the Discovery of Lead-Like Molecules from Marine and Microbial Sources En Route to Antitumor and Antibiotic Drugs

PubMed Central

Pereira, Florbela; Latino, Diogo A. R. S.; Gaudêncio, Susana P.

2014-01-01

The comprehensive information of small molecules and their biological activities in the PubChem database allows chemoinformatic researchers to access and make use of large-scale biological activity data to improve the precision of drug profiling. A Quantitative Structure–Activity Relationship approach, for classification, was used for the prediction of active/inactive compounds relatively to overall biological activity, antitumor and antibiotic activities using a data set of 1804 compounds from PubChem. Using the best classification models for antibiotic and antitumor activities a data set of marine and microbial natural products from the AntiMarin database were screened—57 and 16 new lead compounds for antibiotic and antitumor drug design were proposed, respectively. All compounds proposed by our approach are classified as non-antibiotic and non-antitumor compounds in the AntiMarin database. Recently several of the lead-like compounds proposed by us were reported as being active in the literature. PMID:24473174
Speeding Up Chemical Searches Using the Inverted Index: the Convergence of Chemoinformatics and Text Search Methods

PubMed Central

Nasr, Ramzi; Vernica, Rares; Li, Chen; Baldi, Pierre

2012-01-01

In ligand-based screening, retrosynthesis, and other chemoinformatics applications, one of-ten seeks to search large databases of molecules in order to retrieve molecules that are similar to a given query. With the expanding size of molecular databases, the efficiency and scalability of data structures and algorithms for chemical searches are becoming increasingly important. Remarkably, both the chemoinformatics and information retrieval communities have converged on similar solutions whereby molecules or documents are represented by binary vectors, or fingerprints, indexing their substructures such as labeled paths for molecules and n-grams for text, with the same Jaccard-Tanimoto similarity measure. As a result, similarity search methods from one field can be adapted to the other. Here we adapt recent, state-of-the-art, inverted index methods from information retrieval to speed up similarity searches in chemoinformatics. Our results show a several-fold speed-up improvement over previous methods for both thresh-old searches and top-K searches. We also provide a mathematical analysis that allows one to predict the level of pruning achieved by the inverted index approach, and validate the quality of these predictions through simulation experiments. All results can be replicated using data freely downloadable from http://cdb.ics.uci.edu/. PMID:22462644
Role of Open Source Tools and Resources in Virtual Screening for Drug Discovery.

PubMed

Karthikeyan, Muthukumarasamy; Vyas, Renu

2015-01-01

Advancement in chemoinformatics research in parallel with availability of high performance computing platform has made handling of large scale multi-dimensional scientific data for high throughput drug discovery easier. In this study we have explored publicly available molecular databases with the help of open-source based integrated in-house molecular informatics tools for virtual screening. The virtual screening literature for past decade has been extensively investigated and thoroughly analyzed to reveal interesting patterns with respect to the drug, target, scaffold and disease space. The review also focuses on the integrated chemoinformatics tools that are capable of harvesting chemical data from textual literature information and transform them into truly computable chemical structures, identification of unique fragments and scaffolds from a class of compounds, automatic generation of focused virtual libraries, computation of molecular descriptors for structure-activity relationship studies, application of conventional filters used in lead discovery along with in-house developed exhaustive PTC (Pharmacophore, Toxicophores and Chemophores) filters and machine learning tools for the design of potential disease specific inhibitors. A case study on kinase inhibitors is provided as an example.
Experimental and Chemoinformatics Study of Tautomerism in a Database of Commercially Available Screening Samples

PubMed Central

2016-01-01

We investigated how many cases of the same chemical sold as different products (at possibly different prices) occurred in a prototypical large aggregated database and simultaneously tested the tautomerism definitions in the chemoinformatics toolkit CACTVS. We applied the standard CACTVS tautomeric transforms plus a set of recently developed ring–chain transforms to the Aldrich Market Select (AMS) database of 6 million screening samples and building blocks. In 30 000 cases, two or more AMS products were found to be just different tautomeric forms of the same compound. We purchased and analyzed 166 such tautomer pairs and triplets by 1H and 13C NMR to determine whether the CACTVS transforms accurately predicted what is the same “stuff in the bottle”. Essentially all prototropic transforms with examples in the AMS were confirmed. Some of the ring–chain transforms were found to be too “aggressive”, i.e. to equate structures with one another that were different compounds. PMID:27669079
Molecular similarity and diversity in chemoinformatics: from theory to applications.

PubMed

Maldonado, Ana G; Doucet, J P; Petitjean, Michel; Fan, Bo-Tao

2006-02-01

This review is dedicated to a survey on molecular similarity and diversity. Key findings reported in recent investigations are selectively highlighted and summarized. Even if this overview is mainly centered in chemoinformatics, applications in other areas (pharmaceutical and medical chemistry, combinatorial chemistry, chemical databases management, etc.) are also introduced. The approaches used to define and describe the concepts of molecular similarity and diversity in the context of chemoinformatics are discussed in the first part of this review. We introduce, in the second and third parts, the descriptions and analyses of different methods and techniques. Finally, current applications and problems are enumerated and discussed in the last part.
Combining chemoinformatics with bioinformatics: in silico prediction of bacterial flavor-forming pathways by a chemical systems biology approach "reverse pathway engineering".

PubMed

Liu, Mengjin; Bienfait, Bruno; Sacher, Oliver; Gasteiger, Johann; Siezen, Roland J; Nauta, Arjen; Geurts, Jan M W

2014-01-01

The incompleteness of genome-scale metabolic models is a major bottleneck for systems biology approaches, which are based on large numbers of metabolites as identified and quantified by metabolomics. Many of the revealed secondary metabolites and/or their derivatives, such as flavor compounds, are non-essential in metabolism, and many of their synthesis pathways are unknown. In this study, we describe a novel approach, Reverse Pathway Engineering (RPE), which combines chemoinformatics and bioinformatics analyses, to predict the "missing links" between compounds of interest and their possible metabolic precursors by providing plausible chemical and/or enzymatic reactions. We demonstrate the added-value of the approach by using flavor-forming pathways in lactic acid bacteria (LAB) as an example. Established metabolic routes leading to the formation of flavor compounds from leucine were successfully replicated. Novel reactions involved in flavor formation, i.e. the conversion of alpha-hydroxy-isocaproate to 3-methylbutanoic acid and the synthesis of dimethyl sulfide, as well as the involved enzymes were successfully predicted. These new insights into the flavor-formation mechanisms in LAB can have a significant impact on improving the control of aroma formation in fermented food products. Since the input reaction databases and compounds are highly flexible, the RPE approach can be easily extended to a broad spectrum of applications, amongst others health/disease biomarker discovery as well as synthetic biology.
Quantifying the Relationships among Drug Classes

PubMed Central

Hert, Jérôme; Keiser, Michael J.; Irwin, John J.; Oprea, Tudor I.; Shoichet, Brian K.

2009-01-01

The similarity of drug targets is typically measured using sequence or structural information. Here, we consider chemo-centric approaches that measure target similarity on the basis of their ligands, asking how chemoinformatics similarities differ from those derived bioinformatically, how stable the ligand networks are to changes in chemoinformatics metrics, and which network is the most reliable for prediction of pharmacology. We calculated the similarities between hundreds of drug targets and their ligands and mapped the relationship between them in a formal network. Bioinformatics networks were based on the BLAST similarity between sequences, while chemoinformatics networks were based on the ligand-set similarities calculated with either the Similarity Ensemble Approach (SEA) or a method derived from Bayesian statistics. By multiple criteria, bioinformatics and chemoinformatics networks differed substantially, and only occasionally did a high sequence similarity correspond to a high ligand-set similarity. In contrast, the chemoinformatics networks were stable to the method used to calculate the ligand-set similarities and to the chemical representation of the ligands. Also, the chemoinformatics networks were more natural and more organized, by network theory, than their bioinformatics counterparts: ligand-based networks were found to be small-world and broad-scale. PMID:18335977
Mining collections of compounds with Screening Assistant 2

PubMed Central

2012-01-01

Background High-throughput screening assays have become the starting point of many drug discovery programs for large pharmaceutical companies as well as academic organisations. Despite the increasing throughput of screening technologies, the almost infinite chemical space remains out of reach, calling for tools dedicated to the analysis and selection of the compound collections intended to be screened. Results We present Screening Assistant 2 (SA2), an open-source JAVA software dedicated to the storage and analysis of small to very large chemical libraries. SA2 stores unique molecules in a MySQL database, and encapsulates several chemoinformatics methods, among which: providers management, interactive visualisation, scaffold analysis, diverse subset creation, descriptors calculation, sub-structure / SMART search, similarity search and filtering. We illustrate the use of SA2 by analysing the composition of a database of 15 million compounds collected from 73 providers, in terms of scaffolds, frameworks, and undesired properties as defined by recently proposed HTS SMARTS filters. We also show how the software can be used to create diverse libraries based on existing ones. Conclusions Screening Assistant 2 is a user-friendly, open-source software that can be used to manage collections of compounds and perform simple to advanced chemoinformatics analyses. Its modular design and growing documentation facilitate the addition of new functionalities, calling for contributions from the community. The software can be downloaded at http://sa2.sourceforge.net/. PMID:23327565
Mining collections of compounds with Screening Assistant 2.

PubMed

Guilloux, Vincent Le; Arrault, Alban; Colliandre, Lionel; Bourg, Stéphane; Vayer, Philippe; Morin-Allory, Luc

2012-08-31

High-throughput screening assays have become the starting point of many drug discovery programs for large pharmaceutical companies as well as academic organisations. Despite the increasing throughput of screening technologies, the almost infinite chemical space remains out of reach, calling for tools dedicated to the analysis and selection of the compound collections intended to be screened. We present Screening Assistant 2 (SA2), an open-source JAVA software dedicated to the storage and analysis of small to very large chemical libraries. SA2 stores unique molecules in a MySQL database, and encapsulates several chemoinformatics methods, among which: providers management, interactive visualisation, scaffold analysis, diverse subset creation, descriptors calculation, sub-structure / SMART search, similarity search and filtering. We illustrate the use of SA2 by analysing the composition of a database of 15 million compounds collected from 73 providers, in terms of scaffolds, frameworks, and undesired properties as defined by recently proposed HTS SMARTS filters. We also show how the software can be used to create diverse libraries based on existing ones. Screening Assistant 2 is a user-friendly, open-source software that can be used to manage collections of compounds and perform simple to advanced chemoinformatics analyses. Its modular design and growing documentation facilitate the addition of new functionalities, calling for contributions from the community. The software can be downloaded at http://sa2.sourceforge.net/.
ChemoPy: freely available python package for computational biology and chemoinformatics.

PubMed

Cao, Dong-Sheng; Xu, Qing-Song; Hu, Qian-Nan; Liang, Yi-Zeng

2013-04-15

Molecular representation for small molecules has been routinely used in QSAR/SAR, virtual screening, database search, ranking, drug ADME/T prediction and other drug discovery processes. To facilitate extensive studies of drug molecules, we developed a freely available, open-source python package called chemoinformatics in python (ChemoPy) for calculating the commonly used structural and physicochemical features. It computes 16 drug feature groups composed of 19 descriptors that include 1135 descriptor values. In addition, it provides seven types of molecular fingerprint systems for drug molecules, including topological fingerprints, electro-topological state (E-state) fingerprints, MACCS keys, FP4 keys, atom pairs fingerprints, topological torsion fingerprints and Morgan/circular fingerprints. By applying a semi-empirical quantum chemistry program MOPAC, ChemoPy can also compute a large number of 3D molecular descriptors conveniently. The python package, ChemoPy, is freely available via http://code.google.com/p/pychem/downloads/list, and it runs on Linux and MS-Windows. Supplementary data are available at Bioinformatics online.
Accurate prediction of personalized olfactory perception from large-scale chemoinformatic features.

PubMed

Li, Hongyang; Panwar, Bharat; Omenn, Gilbert S; Guan, Yuanfang

2018-02-01

The olfactory stimulus-percept problem has been studied for more than a century, yet it is still hard to precisely predict the odor given the large-scale chemoinformatic features of an odorant molecule. A major challenge is that the perceived qualities vary greatly among individuals due to different genetic and cultural backgrounds. Moreover, the combinatorial interactions between multiple odorant receptors and diverse molecules significantly complicate the olfaction prediction. Many attempts have been made to establish structure-odor relationships for intensity and pleasantness, but no models are available to predict the personalized multi-odor attributes of molecules. In this study, we describe our winning algorithm for predicting individual and population perceptual responses to various odorants in the DREAM Olfaction Prediction Challenge. We find that random forest model consisting of multiple decision trees is well suited to this prediction problem, given the large feature spaces and high variability of perceptual ratings among individuals. Integrating both population and individual perceptions into our model effectively reduces the influence of noise and outliers. By analyzing the importance of each chemical feature, we find that a small set of low- and nondegenerative features is sufficient for accurate prediction. Our random forest model successfully predicts personalized odor attributes of structurally diverse molecules. This model together with the top discriminative features has the potential to extend our understanding of olfactory perception mechanisms and provide an alternative for rational odorant design.
FilTer BaSe: A web accessible chemical database for small compound libraries.

PubMed

Kolte, Baban S; Londhe, Sanjay R; Solanki, Bhushan R; Gacche, Rajesh N; Meshram, Rohan J

2018-03-01

Finding novel chemical agents for targeting disease associated drug targets often requires screening of large number of new chemical libraries. In silico methods are generally implemented at initial stages for virtual screening. Filtering of such compound libraries on physicochemical and substructure ground is done to ensure elimination of compounds with undesired chemical properties. Filtering procedure, is redundant, time consuming and requires efficient bioinformatics/computer manpower along with high end software involving huge capital investment that forms a major obstacle in drug discovery projects in academic setup. We present an open source resource, FilTer BaSe- a chemoinformatics platform (http://bioinfo.net.in/filterbase/) that host fully filtered, ready to use compound libraries with workable size. The resource also hosts a database that enables efficient searching the chemical space of around 348,000 compounds on the basis of physicochemical and substructure properties. Ready to use compound libraries and database presented here is expected to aid a helping hand for new drug developers and medicinal chemists. Copyright © 2017 Elsevier Inc. All rights reserved.
Videoconferencing and other distance education techniques in chemoinformatics teaching and research at Indiana University.

PubMed

Wild, David J; Wiggins, Gary D

2006-01-01

At a time when the demand for people with expertise in chemoinformatics is increasing, there is still only a very small number of academic institutions that offer chemoinformatics-related classes and degrees. The distance education (DE) approach allows both learning and research to be carried out at multiple geographic locations and institutions, thus leveraging the few educational offerings that are available. In this paper, distance education techniques and technologies (with emphasis on videoconferencing) are reviewed, and examples of how they are used to increase the accessibility of chemoinformatics education and research at the Indiana University School of Informatics are presented.
Follow up: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer.

PubMed

Hu, Ye; Bajorath, Jürgen

2014-01-01

In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).
Soft Sensors: Chemoinformatic Model for Efficient Control and Operation in Chemical Plants.

PubMed

Funatsu, Kimito

2016-12-01

Soft sensor is statistical model as an essential tool for controlling pharmaceutical, chemical and industrial plants. I introduce soft sensor, the roles, the applications, the problems and the research examples such as adaptive soft sensor, database monitoring and efficient process control. The use of soft sensor enables chemical industrial plants to be operated more effectively and stably. © 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
TOXICO-CHEMINFORMATICS AND QSAR MODELING OF ...

EPA Pesticide Factsheets

This abstract concludes that QSAR approaches combined with toxico-chemoinformatics descriptors can enhance predictive toxicology models. This abstract concludes that QSAR approaches combined with toxico-chemoinformatics descriptors can enhance predictive toxicology models.
Chemical graphs, molecular matrices and topological indices in chemoinformatics and quantitative structure-activity relationships.

PubMed

Ivanciuc, Ovidiu

2013-06-01

Chemical and molecular graphs have fundamental applications in chemoinformatics, quantitative structureproperty relationships (QSPR), quantitative structure-activity relationships (QSAR), virtual screening of chemical libraries, and computational drug design. Chemoinformatics applications of graphs include chemical structure representation and coding, database search and retrieval, and physicochemical property prediction. QSPR, QSAR and virtual screening are based on the structure-property principle, which states that the physicochemical and biological properties of chemical compounds can be predicted from their chemical structure. Such structure-property correlations are usually developed from topological indices and fingerprints computed from the molecular graph and from molecular descriptors computed from the three-dimensional chemical structure. We present here a selection of the most important graph descriptors and topological indices, including molecular matrices, graph spectra, spectral moments, graph polynomials, and vertex topological indices. These graph descriptors are used to define several topological indices based on molecular connectivity, graph distance, reciprocal distance, distance-degree, distance-valency, spectra, polynomials, and information theory concepts. The molecular descriptors and topological indices can be developed with a more general approach, based on molecular graph operators, which define a family of graph indices related by a common formula. Graph descriptors and topological indices for molecules containing heteroatoms and multiple bonds are computed with weighting schemes based on atomic properties, such as the atomic number, covalent radius, or electronegativity. The correlation in QSPR and QSAR models can be improved by optimizing some parameters in the formula of topological indices, as demonstrated for structural descriptors based on atomic connectivity and graph distance.
Blocked inverted indices for exact clustering of large chemical spaces.

PubMed

Thiel, Philipp; Sach-Peltason, Lisa; Ottmann, Christian; Kohlbacher, Oliver

2014-09-22

The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.
Software and resources for computational medicinal chemistry

PubMed Central

Liao, Chenzhong; Sitzmann, Markus; Pugliese, Angelo; Nicklaus, Marc C

2011-01-01

Computer-aided drug design plays a vital role in drug discovery and development and has become an indispensable tool in the pharmaceutical industry. Computational medicinal chemists can take advantage of all kinds of software and resources in the computer-aided drug design field for the purposes of discovering and optimizing biologically active compounds. This article reviews software and other resources related to computer-aided drug design approaches, putting particular emphasis on structure-based drug design, ligand-based drug design, chemical databases and chemoinformatics tools. PMID:21707404
Prioritizing Chemicals for Risk Assessment Using Chemoinformatics: Examples from the IARC Monographs on Pesticides.

PubMed

Guha, Neela; Guyton, Kathryn Z; Loomis, Dana; Barupal, Dinesh Kumar

2016-12-01

Identifying cancer hazards is the first step towards cancer prevention. The International Agency for Research on Cancer (IARC) Monographs Programme, which has evaluated nearly 1,000 agents for their carcinogenic potential since 1971, typically selects agents for hazard identification on the basis of public nominations, expert advice, published data on carcinogenicity, and public health importance. Here, we present a novel and complementary strategy for identifying agents for hazard evaluation using chemoinformatics, database integration, and automated text mining. To inform selection among a broad range of pesticides nominated for evaluation, we identified and screened nearly 6,000 relevant chemical structures, after which we systematically compiled information on 980 pesticides, creating network maps that allowed cluster visualization by chemical similarity, pesticide class, and publicly available information concerning cancer epidemiology, cancer bioassays, and carcinogenic mechanisms. For the IARC Monograph meetings that took place in March and June 2015, this approach supported high-priority evaluation of glyphosate, malathion, parathion, tetrachlorvinphos, diazinon, p,p'-dichlorodiphenyltrichloroethane (DDT), lindane, and 2,4-dichlorophenoxyacetic acid (2,4-D). This systematic approach, accounting for chemical similarity and overlaying multiple data sources, can be used by risk assessors as well as by researchers to systematize, inform, and increase efficiency in selecting and prioritizing agents for hazard identification, risk assessment, regulation, or further investigation. This approach could be extended to an array of outcomes and agents, including occupational carcinogens, drugs, and foods. Citation: Guha N, Guyton KZ, Loomis D, Barupal DK. 2016. Prioritizing chemicals for risk assessment using chemoinformatics: examples from the IARC Monographs on Pesticides. Environ Health Perspect 124:1823-1829; http://dx.doi.org/10.1289/EHP186.

Prioritizing Chemicals for Risk Assessment Using Chemoinformatics: Examples from the IARC Monographs on Pesticides

PubMed Central

Guha, Neela; Guyton, Kathryn Z.; Loomis, Dana; Barupal, Dinesh Kumar

2016-01-01

Background: Identifying cancer hazards is the first step towards cancer prevention. The International Agency for Research on Cancer (IARC) Monographs Programme, which has evaluated nearly 1,000 agents for their carcinogenic potential since 1971, typically selects agents for hazard identification on the basis of public nominations, expert advice, published data on carcinogenicity, and public health importance. Objectives: Here, we present a novel and complementary strategy for identifying agents for hazard evaluation using chemoinformatics, database integration, and automated text mining. Discussion: To inform selection among a broad range of pesticides nominated for evaluation, we identified and screened nearly 6,000 relevant chemical structures, after which we systematically compiled information on 980 pesticides, creating network maps that allowed cluster visualization by chemical similarity, pesticide class, and publicly available information concerning cancer epidemiology, cancer bioassays, and carcinogenic mechanisms. For the IARC Monograph meetings that took place in March and June 2015, this approach supported high-priority evaluation of glyphosate, malathion, parathion, tetrachlorvinphos, diazinon, p,p′-dichlorodiphenyltrichloroethane (DDT), lindane, and 2,4-dichlorophenoxyacetic acid (2,4-D). Conclusions: This systematic approach, accounting for chemical similarity and overlaying multiple data sources, can be used by risk assessors as well as by researchers to systematize, inform, and increase efficiency in selecting and prioritizing agents for hazard identification, risk assessment, regulation, or further investigation. This approach could be extended to an array of outcomes and agents, including occupational carcinogens, drugs, and foods. Citation: Guha N, Guyton KZ, Loomis D, Barupal DK. 2016. Prioritizing chemicals for risk assessment using chemoinformatics: examples from the IARC Monographs on Pesticides. Environ Health Perspect 124:1823–1829; http://dx.doi.org/10.1289/EHP186 PMID:27164621
Systematic search for benzimidazole compounds and derivatives with antileishmanial effects.

PubMed

Sánchez-Salgado, Juan Carlos; Bilbao-Ramos, Pablo; Dea-Ayuela, María Auxiliadora; Hernández-Luis, Francisco; Bolás-Fernández, Francisco; Medina-Franco, José L; Rojas-Aguirre, Yareli

2018-05-10

Leishmaniasis is a neglected tropical disease that currently affects 12 million people, and over 1 billion people are at risk of infection. Current chemotherapeutic approaches used to treat this disease are unsatisfactory, and the limitations of these drugs highlight the necessity to develop treatments with improved efficacy and safety. To inform the rational design and development of more efficient therapies, the present study reports a chemoinformatic approach using the ChEMBL database to retrieve benzimidazole as a target scaffold. Our analysis revealed that a limited number of studies had investigated the antileishmanial effects of benzimidazoles. Among this limited number, L. major was the species most commonly used to evaluate the antileishmanial effects of these compounds, whereas L. amazonensis and L. braziliensis were used least often in the reported studies. The antileishmanial activities of benzimidazole derivatives were notably variable, a fact that may depend on the substitution pattern of the scaffold. In addition, we investigated the effects of a benzimidazole derivative on promastigotes and amastigotes of L. infantum and L. amazonensis using a novel fluorometric method. Significant antileishmanial effects were observed on both species, with L. amazonensis being the most sensitive. To the best of our knowledge, this chemoinformatic analysis represents the first attempt to determine the relevance of benzimidazole scaffolds for antileishmanial drug discovery using the ChEMBL database. The present findings will provide relevant information for future structure-activity relationship studies and for the investigation of benzimidazole-derived drugs as potential treatments for leishmaniasis.
Drug-Like Protein–Protein Interaction Modulators: Challenges and Opportunities for Drug Discovery and Chemical Biology

PubMed Central

Villoutreix, Bruno O; Kuenemann, Melaine A; Poyet, Jean-Luc; Bruzzoni-Giovanelli, Heriberto; Labbé, Céline; Lagorce, David; Sperandio, Olivier; Miteva, Maria A

2014-01-01

Fundamental processes in living cells are largely controlled by macromolecular interactions and among them, protein–protein interactions (PPIs) have a critical role while their dysregulations can contribute to the pathogenesis of numerous diseases. Although PPIs were considered as attractive pharmaceutical targets already some years ago, they have been thus far largely unexploited for therapeutic interventions with low molecular weight compounds. Several limiting factors, from technological hurdles to conceptual barriers, are known, which, taken together, explain why research in this area has been relatively slow. However, this last decade, the scientific community has challenged the dogma and became more enthusiastic about the modulation of PPIs with small drug-like molecules. In fact, several success stories were reported both, at the preclinical and clinical stages. In this review article, written for the 2014 International Summer School in Chemoinformatics (Strasbourg, France), we discuss in silico tools (essentially post 2012) and databases that can assist the design of low molecular weight PPI modulators (these tools can be found at www.vls3d.com). We first introduce the field of protein–protein interaction research, discuss key challenges and comment recently reported in silico packages, protocols and databases dedicated to PPIs. Then, we illustrate how in silico methods can be used and combined with experimental work to identify PPI modulators. PMID:25254076
Combined use of computational chemistry and chemoinformatics methods for chemical discovery

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sugimoto, Manabu, E-mail: sugimoto@kumamoto-u.ac.jp; Institute for Molecular Science, 38 Nishigo-Naka, Myodaiji, Okazaki 444-8585; CREST, Japan Science and Technology Agency, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012

2015-12-31

Data analysis on numerical data by the computational chemistry calculations is carried out to obtain knowledge information of molecules. A molecular database is developed to systematically store chemical, electronic-structure, and knowledge-based information. The database is used to find molecules related to a keyword of “cancer”. Then the electronic-structure calculations are performed to quantitatively evaluate quantum chemical similarity of the molecules. Among the 377 compounds registered in the database, 24 molecules are found to be “cancer”-related. This set of molecules includes both carcinogens and anticancer drugs. The quantum chemical similarity analysis, which is carried out by using numerical results of themore » density-functional theory calculations, shows that, when some energy spectra are referred to, carcinogens are reasonably distinguished from the anticancer drugs. Therefore these spectral properties are considered of as important measures for classification.« less
Kekule.js: An Open Source JavaScript Chemoinformatics Toolkit.

PubMed

Jiang, Chen; Jin, Xi; Dong, Ying; Chen, Ming

2016-06-27

Kekule.js is an open-source, object-oriented JavaScript toolkit for chemoinformatics. It provides methods for many common tasks in molecular informatics, including chemical data input/output (I/O), two- and three-dimensional (2D/3D) rendering of chemical structure, stereo identification, ring perception, structure comparison, and substructure search. Encapsulated widgets to display and edit chemical structures directly in web context are also supplied. Developed with web standards, the toolkit is ideal for building chemoinformatics applications over the Internet. Moreover, it is highly platform-independent and can also be used in desktop or mobile environments. Some initial applications, such as plugins for inputting chemical structures on the web and uses in chemistry education, have been developed based on the toolkit.
Inner and Outer Recursive Neural Networks for Chemoinformatics Applications.

PubMed

Urban, Gregor; Subrahmanya, Niranjan; Baldi, Pierre

2018-02-26

Deep learning methods applied to problems in chemoinformatics often require the use of recursive neural networks to handle data with graphical structure and variable size. We present a useful classification of recursive neural network approaches into two classes, the inner and outer approach. The inner approach uses recursion inside the underlying graph, to essentially "crawl" the edges of the graph, while the outer approach uses recursion outside the underlying graph, to aggregate information over progressively longer distances in an orthogonal direction. We illustrate the inner and outer approaches on several examples. More importantly, we provide open-source implementations [available at www.github.com/Chemoinformatics/InnerOuterRNN and cdb.ics.uci.edu ] for both approaches in Tensorflow which can be used in combination with training data to produce efficient models for predicting the physical, chemical, and biological properties of small molecules.
Simultaneous virtual prediction of anti-Escherichia coli activities and ADMET profiles: A chemoinformatic complementary approach for high-throughput screening.

PubMed

Speck-Planche, Alejandro; Cordeiro, M N D S

2014-02-10

Escherichia coli remains one of the principal pathogens that cause nosocomial infections, medical conditions that are increasingly common in healthcare facilities. E. coli is intrinsically resistant to many antibiotics, and multidrug-resistant strains have emerged recently. Chemoinformatics has been a great ally of experimental methodologies such as high-throughput screening, playing an important role in the discovery of effective antibacterial agents. However, there is no approach that can design safer anti-E. coli agents, because of the multifactorial nature and complexity of bacterial diseases and the lack of desirable ADMET (absorption, distribution, metabolism, elimination, and toxicity) profiles as a major cause of disapproval of drugs. In this work, we introduce the first multitasking model based on quantitative-structure biological effect relationships (mtk-QSBER) for simultaneous virtual prediction of anti-E. coli activities and ADMET properties of drugs and/or chemicals under many experimental conditions. The mtk-QSBER model was developed from a large and heterogeneous data set of more than 37800 cases, exhibiting overall accuracies of >95% in both training and prediction (validation) sets. The utility of our mtk-QSBER model was demonstrated by performing virtual prediction of properties for the investigational drug avarofloxacin (AVX) under 260 different experimental conditions. Results converged with the experimental evidence, confirming the remarkable anti-E. coli activities and safety of AVX. Predictions also showed that our mtk-QSBER model can be a promising computational tool for virtual screening of desirable anti-E. coli agents, and this chemoinformatic approach could be extended to the search for safer drugs with defined pharmacological activities.
Machine learning methods in chemoinformatics

PubMed Central

Mitchell, John B O

2014-01-01

Machine learning algorithms are generally developed in computer science or adjacent disciplines and find their way into chemical modeling by a process of diffusion. Though particular machine learning methods are popular in chemoinformatics and quantitative structure–activity relationships (QSAR), many others exist in the technical literature. This discussion is methods-based and focused on some algorithms that chemoinformatics researchers frequently use. It makes no claim to be exhaustive. We concentrate on methods for supervised learning, predicting the unknown property values of a test set of instances, usually molecules, based on the known values for a training set. Particularly relevant approaches include Artificial Neural Networks, Random Forest, Support Vector Machine, k-Nearest Neighbors and naïve Bayes classifiers. WIREs Comput Mol Sci 2014, 4:468–481. How to cite this article: WIREs Comput Mol Sci 2014, 4:468–481. doi:10.1002/wcms.1183 PMID:25285160
Analysis of commercial and public bioactivity databases.

PubMed

Tiikkainen, Pekka; Franke, Lutz

2012-02-27

Activity data for small molecules are invaluable in chemoinformatics. Various bioactivity databases exist containing detailed information of target proteins and quantitative binding data for small molecules extracted from journals and patents. In the current work, we have merged several public and commercial bioactivity databases into one bioactivity metabase. The molecular presentation, target information, and activity data of the vendor databases were standardized. The main motivation of the work was to create a single relational database which allows fast and simple data retrieval by in-house scientists. Second, we wanted to know the amount of overlap between databases by commercial and public vendors to see whether the former contain data complementing the latter. Third, we quantified the degree of inconsistency between data sources by comparing data points derived from the same scientific article cited by more than one vendor. We found that each data source contains unique data which is due to different scientific articles cited by the vendors. When comparing data derived from the same article we found that inconsistencies between the vendors are common. In conclusion, using databases of different vendors is still useful since the data overlap is not complete. It should be noted that this can be partially explained by the inconsistencies and errors in the source data.
Enumeration of Ring–Chain Tautomers Based on SMIRKS Rules

PubMed Central

2015-01-01

A compound exhibits (prototropic) tautomerism if it can be represented by two or more structures that are related by a formal intramolecular movement of a hydrogen atom from one heavy atom position to another. When the movement of the proton is accompanied by the opening or closing of a ring it is called ring–chain tautomerism. This type of tautomerism is well observed in carbohydrates, but it also occurs in other molecules such as warfarin. In this work, we present an approach that allows for the generation of all ring–chain tautomers of a given chemical structure. Based on Baldwin’s Rules estimating the likelihood of ring closure reactions to occur, we have defined a set of transform rules covering the majority of ring–chain tautomerism cases. The rules automatically detect substructures in a given compound that can undergo a ring–chain tautomeric transformation. Each transformation is encoded in SMIRKS line notation. All work was implemented in the chemoinformatics toolkit CACTVS. We report on the application of our ring–chain tautomerism rules to a large database of commercially available screening samples in order to identify ring–chain tautomers. PMID:25158156
Molecular scaffold analysis of natural products databases in the public domain.

PubMed

Yongye, Austin B; Waddell, Jacob; Medina-Franco, José L

2012-11-01

Natural products represent important sources of bioactive compounds in drug discovery efforts. In this work, we compiled five natural products databases available in the public domain and performed a comprehensive chemoinformatic analysis focused on the content and diversity of the scaffolds with an overview of the diversity based on molecular fingerprints. The natural products databases were compared with each other and with a set of molecules obtained from in-house combinatorial libraries, and with a general screening commercial library. It was found that publicly available natural products databases have different scaffold diversity. In contrast to the common concept that larger libraries have the largest scaffold diversity, the largest natural products collection analyzed in this work was not the most diverse. The general screening library showed, overall, the highest scaffold diversity. However, considering the most frequent scaffolds, the general reference library was the least diverse. In general, natural products databases in the public domain showed low molecule overlap. In addition to benzene and acyclic compounds, flavones, coumarins, and flavanones were identified as the most frequent molecular scaffolds across the different natural products collections. The results of this work have direct implications in the computational and experimental screening of natural product databases for drug discovery. © 2012 John Wiley & Sons A/S.
FTree query construction for virtual screening: a statistical analysis.

PubMed

Gerlach, Christof; Broughton, Howard; Zaliani, Andrea

2008-02-01

FTrees (FT) is a known chemoinformatic tool able to condense molecular descriptions into a graph object and to search for actives in large databases using graph similarity. The query graph is classically derived from a known active molecule, or a set of actives, for which a similar compound has to be found. Recently, FT similarity has been extended to fragment space, widening its capabilities. If a user were able to build a knowledge-based FT query from information other than a known active structure, the similarity search could be combined with other, normally separate, fields like de-novo design or pharmacophore searches. With this aim in mind, we performed a comprehensive analysis of several databases in terms of FT description and provide a basic statistical analysis of the FT spaces so far at hand. Vendors' catalogue collections and MDDR as a source of potential or known "actives", respectively, have been used. With the results reported herein, a set of ranges, mean values and standard deviations for several query parameters are presented in order to set a reference guide for the users. Applications on how to use this information in FT query building are also provided, using a newly built 3D-pharmacophore from 57 5HT-1F agonists and a published one which was used for virtual screening for tRNA-guanine transglycosylase (TGT) inhibitors.
FTree query construction for virtual screening: a statistical analysis

NASA Astrophysics Data System (ADS)

Gerlach, Christof; Broughton, Howard; Zaliani, Andrea

2008-02-01

FTrees (FT) is a known chemoinformatic tool able to condense molecular descriptions into a graph object and to search for actives in large databases using graph similarity. The query graph is classically derived from a known active molecule, or a set of actives, for which a similar compound has to be found. Recently, FT similarity has been extended to fragment space, widening its capabilities. If a user were able to build a knowledge-based FT query from information other than a known active structure, the similarity search could be combined with other, normally separate, fields like de-novo design or pharmacophore searches. With this aim in mind, we performed a comprehensive analysis of several databases in terms of FT description and provide a basic statistical analysis of the FT spaces so far at hand. Vendors' catalogue collections and MDDR as a source of potential or known "actives", respectively, have been used. With the results reported herein, a set of ranges, mean values and standard deviations for several query parameters are presented in order to set a reference guide for the users. Applications on how to use this information in FT query building are also provided, using a newly built 3D-pharmacophore from 57 5HT-1F agonists and a published one which was used for virtual screening for tRNA-guanine transglycosylase (TGT) inhibitors.
Sesquiterpene lactone mix as a diagnostic tool for Asteraceae allergic contact dermatitis: chemical explanation for its poor performance and Sesquiterpene lactone mix II as a proposed improvement.

PubMed

Jacob, Mathias; Brinkmann, Jürgen; Schmidt, Thomas J

2012-05-01

Two preparations are currently in use for the diagnosis of allergic contact dermatitis caused by Asteraceae: (i) Sesquiterpene lactone (SL) mix [three pure sesquiterpene lactones (STLs)], whose use has been questioned, owing to an insufficient rate of true-positive results; and (ii) Compositae mix, consisting of five Asteraceae extracts, which is problematic because of lack of standardization and questionable reproducibility. To analyse the reasons for the narrow sensitivity of SL mix from a chemoinformatic point of view, and to propose a solution by rational selection of alternative constituents for a new SL mix II covering a broader cohort of allergic patients. Structural and biological information on allergenic STLs was retrieved from databases and the literature, and molecular modelling and chemoinformatic computations were performed. An explanation for the insufficient hit rate of SL mix is that the three constituents possess extremely similar molecular structures/properties and do not represent well the structural diversity of allergenic STLs. STLs that are known as constituents of Compositae mix plants show much a wider diversity, which explains the higher positive rate. On the basis of their positions in chemical property space, a new collection of STLs that more evenly cover the overall structural diversity spectrum is proposed. SL mix II is likely to detect a larger number of patients sensitized to Asteraceae. © 2012 John Wiley & Sons A/S.
Potential Impact and Study Considerations of Metabolomics in Cardiovascular Health and Disease: A Scientific Statement From the American Heart Association.

PubMed

Cheng, Susan; Shah, Svati H; Corwin, Elizabeth J; Fiehn, Oliver; Fitzgerald, Robert L; Gerszten, Robert E; Illig, Thomas; Rhee, Eugene P; Srinivas, Pothur R; Wang, Thomas J; Jain, Mohit

2017-04-01

Through the measure of thousands of small-molecule metabolites in diverse biological systems, metabolomics now offers the potential for new insights into the factors that contribute to complex human diseases such as cardiovascular disease. Targeted metabolomics methods have already identified new molecular markers and metabolomic signatures of cardiovascular disease risk (including branched-chain amino acids, select unsaturated lipid species, and trimethylamine- N -oxide), thus in effect linking diverse exposures such as those from dietary intake and the microbiota with cardiometabolic traits. As technologies for metabolomics continue to evolve, the depth and breadth of small-molecule metabolite profiling in complex systems continue to advance rapidly, along with prospects for ongoing discovery. Current challenges facing the field of metabolomics include scaling throughput and technical capacity for metabolomics approaches, bioinformatic and chemoinformatic tools for handling large-scale metabolomics data, methods for elucidating the biochemical structure and function of novel metabolites, and strategies for determining the true clinical relevance of metabolites observed in association with cardiovascular disease outcomes. Progress made in addressing these challenges will allow metabolomics the potential to substantially affect diagnostics and therapeutics in cardiovascular medicine. © 2017 American Heart Association, Inc.
Large-Scale 1:1 Computing Initiatives: An Open Access Database

ERIC Educational Resources Information Center

Richardson, Jayson W.; McLeod, Scott; Flora, Kevin; Sauers, Nick J.; Kannan, Sathiamoorthy; Sincar, Mehmet

2013-01-01

This article details the spread and scope of large-scale 1:1 computing initiatives around the world. What follows is a review of the existing literature around 1:1 programs followed by a description of the large-scale 1:1 database. Main findings include: 1) the XO and the Classmate PC dominate large-scale 1:1 initiatives; 2) if professional…
Large scale study of multiple-molecule queries

PubMed Central

2009-01-01

Background In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family. Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics. Results Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics. Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data. Conclusion Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/. PMID:20298525
Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

PubMed

Mackey, Aaron J; Pearson, William R

2004-10-01

Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Sequential Application of Ligand and Structure Based Modeling Approaches to Index Chemicals for Their hH4R Antagonism

PubMed Central

Basile, Livia; Milardi, Danilo; Zeidan, Mouhammed; Raiyn, Jamal; Guccione, Salvatore; Rayan, Anwar

2014-01-01

The human histamine H4 receptor (hH4R), a member of the G-protein coupled receptors (GPCR) family, is an increasingly attractive drug target. It plays a key role in many cell pathways and many hH4R ligands are studied for the treatment of several inflammatory, allergic and autoimmune disorders, as well as for analgesic activity. Due to the challenging difficulties in the experimental elucidation of hH4R structure, virtual screening campaigns are normally run on homology based models. However, a wealth of information about the chemical properties of GPCR ligands has also accumulated over the last few years and an appropriate combination of these ligand-based knowledge with structure-based molecular modeling studies emerges as a promising strategy for computer-assisted drug design. Here, two chemoinformatics techniques, the Intelligent Learning Engine (ILE) and Iterative Stochastic Elimination (ISE) approach, were used to index chemicals for their hH4R bioactivity. An application of the prediction model on external test set composed of more than 160 hH4R antagonists picked from the chEMBL database gave enrichment factor of 16.4. A virtual high throughput screening on ZINC database was carried out, picking ∼4000 chemicals highly indexed as H4R antagonists' candidates. Next, a series of 3D models of hH4R were generated by molecular modeling and molecular dynamics simulations performed in fully atomistic lipid membranes. The efficacy of the hH4R 3D models in discrimination between actives and non-actives were checked and the 3D model with the best performance was chosen for further docking studies performed on the focused library. The output of these docking studies was a consensus library of 11 highly active scored drug candidates. Our findings suggest that a sequential combination of ligand-based chemoinformatics approaches with structure-based ones has the potential to improve the success rate in discovering new biologically active GPCR drugs and increase the enrichment factors in a synergistic manner. PMID:25330207
Sequential application of ligand and structure based modeling approaches to index chemicals for their hH4R antagonism.

PubMed

Pappalardo, Matteo; Shachaf, Nir; Basile, Livia; Milardi, Danilo; Zeidan, Mouhammed; Raiyn, Jamal; Guccione, Salvatore; Rayan, Anwar

2014-01-01

The human histamine H4 receptor (hH4R), a member of the G-protein coupled receptors (GPCR) family, is an increasingly attractive drug target. It plays a key role in many cell pathways and many hH4R ligands are studied for the treatment of several inflammatory, allergic and autoimmune disorders, as well as for analgesic activity. Due to the challenging difficulties in the experimental elucidation of hH4R structure, virtual screening campaigns are normally run on homology based models. However, a wealth of information about the chemical properties of GPCR ligands has also accumulated over the last few years and an appropriate combination of these ligand-based knowledge with structure-based molecular modeling studies emerges as a promising strategy for computer-assisted drug design. Here, two chemoinformatics techniques, the Intelligent Learning Engine (ILE) and Iterative Stochastic Elimination (ISE) approach, were used to index chemicals for their hH4R bioactivity. An application of the prediction model on external test set composed of more than 160 hH4R antagonists picked from the chEMBL database gave enrichment factor of 16.4. A virtual high throughput screening on ZINC database was carried out, picking ∼ 4000 chemicals highly indexed as H4R antagonists' candidates. Next, a series of 3D models of hH4R were generated by molecular modeling and molecular dynamics simulations performed in fully atomistic lipid membranes. The efficacy of the hH4R 3D models in discrimination between actives and non-actives were checked and the 3D model with the best performance was chosen for further docking studies performed on the focused library. The output of these docking studies was a consensus library of 11 highly active scored drug candidates. Our findings suggest that a sequential combination of ligand-based chemoinformatics approaches with structure-based ones has the potential to improve the success rate in discovering new biologically active GPCR drugs and increase the enrichment factors in a synergistic manner.

Design and implementation of a distributed large-scale spatial database system based on J2EE

NASA Astrophysics Data System (ADS)

Gong, Jianya; Chen, Nengcheng; Zhu, Xinyan; Zhang, Xia

2003-03-01

With the increasing maturity of distributed object technology, CORBA, .NET and EJB are universally used in traditional IT field. However, theories and practices of distributed spatial database need farther improvement in virtue of contradictions between large scale spatial data and limited network bandwidth or between transitory session and long transaction processing. Differences and trends among of CORBA, .NET and EJB are discussed in details, afterwards the concept, architecture and characteristic of distributed large-scale seamless spatial database system based on J2EE is provided, which contains GIS client application, web server, GIS application server and spatial data server. Moreover the design and implementation of components of GIS client application based on JavaBeans, the GIS engine based on servlet, the GIS Application server based on GIS enterprise JavaBeans(contains session bean and entity bean) are explained.Besides, the experiments of relation of spatial data and response time under different conditions are conducted, which proves that distributed spatial database system based on J2EE can be used to manage, distribute and share large scale spatial data on Internet. Lastly, a distributed large-scale seamless image database based on Internet is presented.
Prioritization of anti-malarial hits from nature: chemo-informatic profiling of natural products with in vitro antiplasmodial activities and currently registered anti-malarial drugs.

PubMed

Egieyeh, Samuel Ayodele; Syce, James; Malan, Sarel F; Christoffels, Alan

2016-01-29

A large number of natural products have shown in vitro antiplasmodial activities. Early identification and prioritization of these natural products with potential for novel mechanism of action, desirable pharmacokinetics and likelihood for development into drugs is advantageous. Chemo-informatic profiling of these natural products were conducted and compared to currently registered anti-malarial drugs (CRAD). Natural products with in vitro antiplasmodial activities (NAA) were compiled from various sources. These natural products were sub-divided into four groups based on inhibitory concentration (IC50). Key molecular descriptors and physicochemical properties were computed for these compounds and analysis of variance used to assess statistical significance amongst the sets of compounds. Molecular similarity analysis, estimation of drug-likeness, in silico pharmacokinetic profiling, and exploration of structure-activity landscape were also carried out on these sets of compounds. A total of 1040 natural products were selected and a total of 13 molecular descriptors were analysed. Significant differences were observed among the sub-groups of NAA and CRAD for at least 11 of the molecular descriptors, including number of hydrogen bond donors and acceptors, molecular weight, polar and hydrophobic surface areas, chiral centres, oxygen and nitrogen atoms, and shape index. The remaining molecular descriptors, including clogP, number of rotatable bonds and number of aromatic rings, did not show any significant difference when comparing the two compound sets. Molecular similarity and chemical space analysis identified natural products that were structurally diverse from CRAD. Prediction of the pharmacokinetic properties and drug-likeness of these natural products identified over 50% with desirable drug-like properties. Nearly 70% of all natural products were identified as potentially promiscuous compounds. Structure-activity landscape analysis highlighted compound pairs that form 'activity cliffs'. In all, prioritization strategies for the NAA were proposed. Chemo-informatic profiling of NAA and CRAD have produced a wealth of information that may guide decisions and facilitate anti-malarial drug development from natural products. Articulation of the information provided within an interactive data-mining environment led to a prioritized list of NAA.
Large Scale Landslide Database System Established for the Reservoirs in Southern Taiwan

NASA Astrophysics Data System (ADS)

Tsai, Tsai-Tsung; Tsai, Kuang-Jung; Shieh, Chjeng-Lun

2017-04-01

Typhoon Morakot seriously attack southern Taiwan awaken the public awareness of large scale landslide disasters. Large scale landslide disasters produce large quantity of sediment due to negative effects on the operating functions of reservoirs. In order to reduce the risk of these disasters within the study area, the establishment of a database for hazard mitigation / disaster prevention is necessary. Real time data and numerous archives of engineering data, environment information, photo, and video, will not only help people make appropriate decisions, but also bring the biggest concern for people to process and value added. The study tried to define some basic data formats / standards from collected various types of data about these reservoirs and then provide a management platform based on these formats / standards. Meanwhile, in order to satisfy the practicality and convenience, the large scale landslide disasters database system is built both provide and receive information abilities, which user can use this large scale landslide disasters database system on different type of devices. IT technology progressed extreme quick, the most modern system might be out of date anytime. In order to provide long term service, the system reserved the possibility of user define data format /standard and user define system structure. The system established by this study was based on HTML5 standard language, and use the responsive web design technology. This will make user can easily handle and develop this large scale landslide disasters database system.
Using SQL Databases for Sequence Similarity Searching and Analysis.

PubMed

Pearson, William R; Mackey, Aaron J

2017-09-13

Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information

NASA Astrophysics Data System (ADS)

Sushko, Iurii; Novotarskyi, Sergii; Körner, Robert; Pandey, Anil Kumar; Rupp, Matthias; Teetz, Wolfram; Brandmaier, Stefan; Abdelaziz, Ahmed; Prokopenko, Volodymyr V.; Tanchuk, Vsevolod Y.; Todeschini, Roberto; Varnek, Alexandre; Marcou, Gilles; Ertl, Peter; Potemkin, Vladimir; Grishina, Maria; Gasteiger, Johann; Schwab, Christof; Baskin, Igor I.; Palyulin, Vladimir A.; Radchenko, Eugene V.; Welsh, William J.; Kholodovych, Vladyslav; Chekmarev, Dmitriy; Cherkasov, Artem; Aires-de-Sousa, Joao; Zhang, Qing-You; Bender, Andreas; Nigsch, Florian; Patiny, Luc; Williams, Antony; Tkachenko, Valery; Tetko, Igor V.

2011-06-01

The Online Chemical Modeling Environment is a web-based platform that aims to automate and simplify the typical steps required for QSAR modeling. The platform consists of two major subsystems: the database of experimental measurements and the modeling framework. A user-contributed database contains a set of tools for easy input, search and modification of thousands of records. The OCHEM database is based on the wiki principle and focuses primarily on the quality and verifiability of the data. The database is tightly integrated with the modeling framework, which supports all the steps required to create a predictive model: data search, calculation and selection of a vast variety of molecular descriptors, application of machine learning methods, validation, analysis of the model and assessment of the applicability domain. As compared to other similar systems, OCHEM is not intended to re-implement the existing tools or models but rather to invite the original authors to contribute their results, make them publicly available, share them with other users and to become members of the growing research community. Our intention is to make OCHEM a widely used platform to perform the QSPR/QSAR studies online and share it with other users on the Web. The ultimate goal of OCHEM is collecting all possible chemoinformatics tools within one simple, reliable and user-friendly resource. The OCHEM is free for web users and it is available online at http://www.ochem.eu.
Using Large-Scale Databases in Evaluation: Advances, Opportunities, and Challenges

ERIC Educational Resources Information Center

Penuel, William R.; Means, Barbara

2011-01-01

Major advances in the number, capabilities, and quality of state, national, and transnational databases have opened up new opportunities for evaluators. Both large-scale data sets collected for administrative purposes and those collected by other researchers can provide data for a variety of evaluation-related activities. These include (a)…
Design and Development of ChemInfoCloud: An Integrated Cloud Enabled Platform for Virtual Screening.

PubMed

Karthikeyan, Muthukumarasamy; Pandit, Deepak; Bhavasar, Arvind; Vyas, Renu

2015-01-01

The power of cloud computing and distributed computing has been harnessed to handle vast and heterogeneous data required to be processed in any virtual screening protocol. A cloud computing platorm ChemInfoCloud was built and integrated with several chemoinformatics and bioinformatics tools. The robust engine performs the core chemoinformatics tasks of lead generation, lead optimisation and property prediction in a fast and efficient manner. It has also been provided with some of the bioinformatics functionalities including sequence alignment, active site pose prediction and protein ligand docking. Text mining, NMR chemical shift (1H, 13C) prediction and reaction fingerprint generation modules for efficient lead discovery are also implemented in this platform. We have developed an integrated problem solving cloud environment for virtual screening studies that also provides workflow management, better usability and interaction with end users using container based virtualization, OpenVz.
Orthographic and Phonological Neighborhood Databases across Multiple Languages.

PubMed

Marian, Viorica

2017-01-01

The increased globalization of science and technology and the growing number of bilinguals and multilinguals in the world have made research with multiple languages a mainstay for scholars who study human function and especially those who focus on language, cognition, and the brain. Such research can benefit from large-scale databases and online resources that describe and measure lexical, phonological, orthographic, and semantic information. The present paper discusses currently-available resources and underscores the need for tools that enable measurements both within and across multiple languages. A general review of language databases is followed by a targeted introduction to databases of orthographic and phonological neighborhoods. A specific focus on CLEARPOND illustrates how databases can be used to assess and compare neighborhood information across languages, to develop research materials, and to provide insight into broad questions about language. As an example of how using large-scale databases can answer questions about language, a closer look at neighborhood effects on lexical access reveals that not only orthographic, but also phonological neighborhoods can influence visual lexical access both within and across languages. We conclude that capitalizing upon large-scale linguistic databases can advance, refine, and accelerate scientific discoveries about the human linguistic capacity.
In silico genotoxicity of coumarins: application of the Phenol-Explorer food database to functional food science.

PubMed

Guardado Yordi, E; Matos, M J; Pérez Martínez, A; Tornes, A C; Santana, L; Molina, E; Uriarte, E

2017-08-01

Coumarins are a group of phytochemicals that may be beneficial or harmful to health depending on their type and dosage and the matrix that contains them. Some of these compounds have been proven to display pro-oxidant and clastogenic activities. Therefore, in the current work, we have studied the coumarins that are present in food sources extracted from the Phenol-Explorer database in order to predict their clastogenic activity and identify the structure-activity relationships and genotoxic structural alerts using alternative methods in the field of computational toxicology. It was necessary to compile information on the type and amount of coumarins in different food sources through the analysis of databases of food composition available online. A virtual screening using a clastogenic model and different software, such as MODESLAB, ChemDraw and STATISTIC, was performed. As a result, a table of food composition was prepared and qualitative information from this data was extracted. The virtual screening showed that the esterified substituents inactivate molecules, while the methoxyl and hydroxyl substituents contribute to their activity and constitute, together with the basic structures of the studied subclasses, clastogenic structural alerts. Chemical subclasses of simple coumarins and furocoumarins were classified as active (xanthotoxin, isopimpinellin, esculin, scopoletin, scopolin and bergapten). In silico genotoxicity was mainly predicted for coumarins found in beer, sherry, dried parsley, fresh parsley and raw celery stalks. The results obtained can be interesting for the future design of functional foods and dietary supplements. These studies constitute a reference for the genotoxic chemoinformatic analysis of bioactive compounds present in databases of food composition.
sc-PDB: an annotated database of druggable binding sites from the Protein Data Bank.

PubMed

Kellenberger, Esther; Muller, Pascal; Schalon, Claire; Bret, Guillaume; Foata, Nicolas; Rognan, Didier

2006-01-01

The sc-PDB is a collection of 6 415 three-dimensional structures of binding sites found in the Protein Data Bank (PDB). Binding sites were extracted from all high-resolution crystal structures in which a complex between a protein cavity and a small-molecular-weight ligand could be identified. Importantly, ligands are considered from a pharmacological and not a structural point of view. Therefore, solvents, detergents, and most metal ions are not stored in the sc-PDB. Ligands are classified into four main categories: nucleotides (< 4-mer), peptides (< 9-mer), cofactors, and organic compounds. The corresponding binding site is formed by all protein residues (including amino acids, cofactors, and important metal ions) with at least one atom within 6.5 angstroms of any ligand atom. The database was carefully annotated by browsing several protein databases (PDB, UniProt, and GO) and storing, for every sc-PDB entry, the following features: protein name, function, source, domain and mutations, ligand name, and structure. The repository of ligands has also been archived by diversity analysis of molecular scaffolds, and several chemoinformatics descriptors were computed to better understand the chemical space covered by stored ligands. The sc-PDB may be used for several purposes: (i) screening a collection of binding sites for predicting the most likely target(s) of any ligand, (ii) analyzing the molecular similarity between different cavities, and (iii) deriving rules that describe the relationship between ligand pharmacophoric points and active-site properties. The database is periodically updated and accessible on the web at http://bioinfo-pharma.u-strasbg.fr/scPDB/.
[Privacy and public benefit in using large scale health databases].

PubMed

Yamamoto, Ryuichi

2014-01-01

In Japan, large scale heath databases were constructed in a few years, such as National Claim insurance and health checkup database (NDB) and Japanese Sentinel project. But there are some legal issues for making adequate balance between privacy and public benefit by using such databases. NDB is carried based on the act for elderly person's health care but in this act, nothing is mentioned for using this database for general public benefit. Therefore researchers who use this database are forced to pay much concern about anonymization and information security that may disturb the research work itself. Japanese Sentinel project is a national project to detecting drug adverse reaction using large scale distributed clinical databases of large hospitals. Although patients give the future consent for general such purpose for public good, it is still under discussion using insufficiently anonymized data. Generally speaking, researchers of study for public benefit will not infringe patient's privacy, but vague and complex requirements of legislation about personal data protection may disturb the researches. Medical science does not progress without using clinical information, therefore the adequate legislation that is simple and clear for both researchers and patients is strongly required. In Japan, the specific act for balancing privacy and public benefit is now under discussion. The author recommended the researchers including the field of pharmacology should pay attention to, participate in the discussion of, and make suggestion to such act or regulations.
Simultaneous modeling of antimycobacterial activities and ADMET profiles: a chemoinformatic approach to medicinal chemistry.

PubMed

Speck-Planche, Alejandro; Cordeiro, M N D S

2013-01-01

Mycobacteria represent a group of pathogens which cause serious diseases in mammals, including the lethal tuberculosis (Mycobacterium tuberculosis). Despite the mortality of this community-acquired and nosocomial disease mentioned above, other mycobacteria may cause similar infections, acting as dangerous opportunistic pathogens. Additionally, resistant strains belonging to Mycobacterium spp. have emerged. Thus, the design of novel antimycobacterial agents is a challenge for the scientific community. In this sense, chemoinformatics has played a vital role in drug discovery, helping to rationalize chemical synthesis, as well as the evaluation of pharmacological and ADMET (absorption, distribution, metabolism, excretion, toxicity) profiles in both medicinal and pharmaceutical chemistry. Until now, there is no in silico methodology able to assess antimycobacterial activity and ADMET properties at the same time. This work introduces the first multitasking model based on quantitative-structure biological effect relationships (mtk-QSBER) for simultaneous prediction of antimycobacterial activities and ADMET profiles of drugs/chemicals under diverse experimental conditions. The mtk-QSBER model was constructed by using a large and heterogeneous dataset of compounds (more than 34600 cases), displaying accuracies higher than 90% in both, training and prediction sets. To illustrate the utility of the present model, several molecular fragments were selected and their contributions to different biological effects were calculated and analyzed. Also, many properties of the investigational drug TMC-207 were predicted. Results confirmed that, from one side, TMC-207 can be a promising antimycobacterial drug, and on the other hand, this study demonstrates that the present mtk-QSBER model can be used for virtual screening of safer antimycobacterial agents.
Ice Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel

NASA Technical Reports Server (NTRS)

Broeren, Andy; Potapczuk, Mark; Lee, Sam; Malone, Adam; Paul, Ben; Woodard, Brian

2016-01-01

The design and certification of modern transport airplanes for flight in icing conditions increasing relies on three-dimensional numerical simulation tools for ice accretion prediction. There is currently no publically available, high-quality, ice accretion database upon which to evaluate the performance of icing simulation tools for large-scale swept wings that are representative of modern commercial transport airplanes. The purpose of this presentation is to present the results of a series of icing wind tunnel test campaigns whose aim was to provide an ice accretion database for large-scale, swept wings.
Computational exploration of the chemical structure space of possible reverse tricarboxylic acid cycle constituents.

PubMed

Meringer, Markus; Cleaves, H James

2017-12-13

The reverse tricarboxylic acid (rTCA) cycle has been explored from various standpoints as an idealized primordial metabolic cycle. Its simplicity and apparent ubiquity in diverse organisms across the tree of life have been used to argue for its antiquity and its optimality. In 2000 it was proposed that chemoinformatics approaches support some of these views. Specifically, defined queries of the Beilstein database showed that the molecules of the rTCA are heavily represented in such compound databases. We explore here the chemical structure "space," e.g. the set of organic compounds which possesses some minimal set of defining characteristics, of the rTCA cycle's intermediates using an exhaustive structure generation method. The rTCA's chemical space as defined by the original criteria and explored by our method is some six to seven times larger than originally considered. Acknowledging that each assumption in what is a defining criterion making the rTCA cycle special limits possible generative outcomes, there are many unrealized compounds which fulfill these criteria. That these compounds are unrealized could be due to evolutionary frozen accidents or optimization, though this optimization may also be for systems-level reasons, e.g., the way the pathway and its elements interface with other aspects of metabolism.
Inroads to predict in vivo toxicology-an introduction to the eTOX Project.

PubMed

Briggs, Katharine; Cases, Montserrat; Heard, David J; Pastor, Manuel; Pognan, François; Sanz, Ferran; Schwab, Christof H; Steger-Hartmann, Thomas; Sutter, Andreas; Watson, David K; Wichard, Jörg D

2012-01-01

There is a widespread awareness that the wealth of preclinical toxicity data that the pharmaceutical industry has generated in recent decades is not exploited as efficiently as it could be. Enhanced data availability for compound comparison ("read-across"), or for data mining to build predictive tools, should lead to a more efficient drug development process and contribute to the reduction of animal use (3Rs principle). In order to achieve these goals, a consortium approach, grouping numbers of relevant partners, is required. The eTOX ("electronic toxicity") consortium represents such a project and is a public-private partnership within the framework of the European Innovative Medicines Initiative (IMI). The project aims at the development of in silico prediction systems for organ and in vivo toxicity. The backbone of the project will be a database consisting of preclinical toxicity data for drug compounds or candidates extracted from previously unpublished, legacy reports from thirteen European and European operation-based pharmaceutical companies. The database will be enhanced by incorporation of publically available, high quality toxicology data. Seven academic institutes and five small-to-medium size enterprises (SMEs) contribute with their expertise in data gathering, database curation, data mining, chemoinformatics and predictive systems development. The outcome of the project will be a predictive system contributing to early potential hazard identification and risk assessment during the drug development process. The concept and strategy of the eTOX project is described here, together with current achievements and future deliverables.
Large-scale annotation of small-molecule libraries using public databases.

PubMed

Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

2007-01-01

While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.
Application of Large-Scale Database-Based Online Modeling to Plant State Long-Term Estimation

NASA Astrophysics Data System (ADS)

Ogawa, Masatoshi; Ogai, Harutoshi

Recently, attention has been drawn to the local modeling techniques of a new idea called “Just-In-Time (JIT) modeling”. To apply “JIT modeling” to a large amount of database online, “Large-scale database-based Online Modeling (LOM)” has been proposed. LOM is a technique that makes the retrieval of neighboring data more efficient by using both “stepwise selection” and quantization. In order to predict the long-term state of the plant without using future data of manipulated variables, an Extended Sequential Prediction method of LOM (ESP-LOM) has been proposed. In this paper, the LOM and the ESP-LOM are introduced.
Chemoinformatic expedition of the chemical space of fungal products.

PubMed

González-Medina, Mariana; Prieto-Martínez, Fernando D; Naveja, J Jesús; Méndez-Lucio, Oscar; El-Elimat, Tamam; Pearce, Cedric J; Oberlies, Nicholas H; Figueroa, Mario; Medina-Franco, José L

2016-08-01

Fungi are valuable resources for bioactive secondary metabolites. However, the chemical space of fungal secondary metabolites has been studied only on a limited basis. Herein, we report a comprehensive chemoinformatic analysis of a unique set of 207 fungal metabolites isolated and characterized in a USA National Cancer Institute funded drug discovery project. Comparison of the molecular complexity of the 207 fungal metabolites with approved anticancer and nonanticancer drugs, compounds in clinical studies, general screening compounds and molecules Generally Recognized as Safe revealed that fungal metabolites have high degree of complexity. Molecular fingerprints showed that fungal metabolites are as structurally diverse as other natural products and have, in general, drug-like physicochemical properties. Fungal products represent promising candidates to expand the medicinally relevant chemical space. This work is a significant expansion of an analysis reported years ago for a smaller set of compounds (less than half of the ones included in the present work) from filamentous fungi using different structural properties.
Combining bioinformatics, chemoinformatics and experimental approaches to design chemical probes: Applications in the field of blood coagulation.

PubMed

Villoutreix, B O

2016-07-01

Bioinformatics and chemoinformatics approaches contribute to the discovery of novel targets, chemical probes, hits, leads and medicinal drugs. A vast repertoire of computational methods has indeed been reported over the years and in this review, I will briefly introduce some concepts and approaches, namely the analysis of potential therapeutic target binding pockets, the preparation of compound collections and virtual screening. An example of application is provided for two proteins acting in the blood coagulation system. Overall, in silico methods have been shown to improve R and D productivity in both, academic settings and in the private sector, if they are integrated in a rational manner with experimental approaches. However, integration of tools and pluridisciplinarity are seldom achieved. Efforts should be done in this direction as pluridisciplinarity and a true acknowledgment of all the contributing actors along the value chain could enhance innovation and reduce skyrocketing costs. Copyright © 2016 Académie Nationale de Pharmacie. Published by Elsevier Masson SAS. All rights reserved.
Drug repurposing: translational pharmacology, chemistry, computers and the clinic.

PubMed

Issa, Naiem T; Byers, Stephen W; Dakshanamurthy, Sivanesan

2013-01-01

The process of discovering a pharmacological compound that elicits a desired clinical effect with minimal side effects is a challenge. Prior to the advent of high-performance computing and large-scale screening technologies, drug discovery was largely a serendipitous endeavor, as in the case of thalidomide for erythema nodosum leprosum or cancer drugs in general derived from flora located in far-reaching geographic locations. More recently, de novo drug discovery has become a more rationalized process where drug-target-effect hypotheses are formulated on the basis of already known compounds/protein targets and their structures. Although this approach is hypothesis-driven, the actual success has been very low, contributing to the soaring costs of research and development as well as the diminished pharmaceutical pipeline in the United States. In this review, we discuss the evolution in computational pharmacology as the next generation of successful drug discovery and implementation in the clinic where high-performance computing (HPC) is used to generate and validate drug-target-effect hypotheses completely in silico. The use of HPC would decrease development time and errors while increasing productivity prior to in vitro, animal and human testing. We highlight approaches in chemoinformatics, bioinformatics as well as network biopharmacology to illustrate potential avenues from which to design clinically efficacious drugs. We further discuss the implications of combining these approaches into an integrative methodology for high-accuracy computational predictions within the context of drug repositioning for the efficient streamlining of currently approved drugs back into clinical trials for possible new indications.

Iris indexing based on local intensity order pattern

NASA Astrophysics Data System (ADS)

Emerich, Simina; Malutan, Raul; Crisan, Septimiu; Lefkovits, Laszlo

2017-03-01

In recent years, iris biometric systems have increased in popularity and have been proven that are capable of handling large-scale databases. The main advantage of these systems is accuracy and reliability. A proper iris patterns classification is expected to reduce the matching time in huge databases. This paper presents an iris indexing technique based on Local Intensity Order Pattern. The performance of the present approach is evaluated on UPOL database and is compared with other recent systems designed for iris indexing. The results illustrate the potential of the proposed method for large scale iris identification.
Intra-reach headwater fish assemblage structure

USGS Publications Warehouse

McKenna, James E.

2017-01-01

Large-scale conservation efforts can take advantage of modern large databases and regional modeling and assessment methods. However, these broad-scale efforts often assume uniform average habitat conditions and/or species assemblages within stream reaches.
Use of chemoinformatics tools- nuts and bolts; Challenges in their regulatory application

EPA Science Inventory

Cheminformatics spans a continuum of components from data storage to uncovering new insights that are useful for different decision making contexts. It covers the input, retrieval of data, the manipulation and integration of data through to the visualisation and analysis to trans...
Development of a database system for mapping insertional mutations onto the mouse genome with large-scale experimental data

PubMed Central

2009-01-01

Background Insertional mutagenesis is an effective method for functional genomic studies in various organisms. It can rapidly generate easily tractable mutations. A large-scale insertional mutagenesis with the piggyBac (PB) transposon is currently performed in mice at the Institute of Developmental Biology and Molecular Medicine (IDM), Fudan University in Shanghai, China. This project is carried out via collaborations among multiple groups overseeing interconnected experimental steps and generates a large volume of experimental data continuously. Therefore, the project calls for an efficient database system for recording, management, statistical analysis, and information exchange. Results This paper presents a database application called MP-PBmice (insertional mutation mapping system of PB Mutagenesis Information Center), which is developed to serve the on-going large-scale PB insertional mutagenesis project. A lightweight enterprise-level development framework Struts-Spring-Hibernate is used here to ensure constructive and flexible support to the application. The MP-PBmice database system has three major features: strict access-control, efficient workflow control, and good expandability. It supports the collaboration among different groups that enter data and exchange information on daily basis, and is capable of providing real time progress reports for the whole project. MP-PBmice can be easily adapted for other large-scale insertional mutation mapping projects and the source code of this software is freely available at http://www.idmshanghai.cn/PBmice. Conclusion MP-PBmice is a web-based application for large-scale insertional mutation mapping onto the mouse genome, implemented with the widely used framework Struts-Spring-Hibernate. This system is already in use by the on-going genome-wide PB insertional mutation mapping project at IDM, Fudan University. PMID:19958505
A Review of Stellar Abundance Databases and the Hypatia Catalog Database

NASA Astrophysics Data System (ADS)

Hinkel, Natalie Rose

2018-01-01

The astronomical community is interested in elements from lithium to thorium, from solar twins to peculiarities of stellar evolution, because they give insight into different regimes of star formation and evolution. However, while some trends between elements and other stellar or planetary properties are well known, many other trends are not as obvious and are a point of conflict. For example, stars that host giant planets are found to be consistently enriched in iron, but the same cannot be definitively said for any other element. Therefore, it is time to take advantage of large stellar abundance databases in order to better understand not only the large-scale patterns, but also the more subtle, small-scale trends within the data.In this overview to the special session, I will present a review of large stellar abundance databases that are both currently available (i.e. RAVE, APOGEE) and those that will soon be online (i.e. Gaia-ESO, GALAH). Additionally, I will discuss the Hypatia Catalog Database (www.hypatiacatalog.com) -- which includes abundances from individual literature sources that observed stars within 150pc. The Hypatia Catalog currently contains 72 elements as measured within ~6000 stars, with a total of ~240,000 unique abundance determinations. The online database offers a variety of solar normalizations, stellar properties, and planetary properties (where applicable) that can all be viewed through multiple interactive plotting interfaces as well as in a tabular format. By analyzing stellar abundances for large populations of stars and from a variety of different perspectives, a wealth of information can be revealed on both large and small scales.
Tomato functional genomics database (TFGD): a comprehensive collection and analysis package for tomato functional genomics

USDA-ARS?s Scientific Manuscript database

Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...
DEXTER: Disease-Expression Relation Extraction from Text.

PubMed

Gupta, Samir; Dingerdissen, Hayley; Ross, Karen E; Hu, Yu; Wu, Cathy H; Mazumder, Raja; Vijay-Shanker, K

2018-01-01

Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression-disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL: http://biotm.cis.udel.edu/DEXTER.
Scaling up the diversity-resilience relationship with trait databases and remote sensing data: the recovery of productivity after wildfire.

PubMed

Spasojevic, Marko J; Bahlai, Christie A; Bradley, Bethany A; Butterfield, Bradley J; Tuanmu, Mao-Ning; Sistla, Seeta; Wiederholt, Ruscena; Suding, Katharine N

2016-04-01

Understanding the mechanisms underlying ecosystem resilience - why some systems have an irreversible response to disturbances while others recover - is critical for conserving biodiversity and ecosystem function in the face of global change. Despite the widespread acceptance of a positive relationship between biodiversity and resilience, empirical evidence for this relationship remains fairly limited in scope and localized in scale. Assessing resilience at the large landscape and regional scales most relevant to land management and conservation practices has been limited by the ability to measure both diversity and resilience over large spatial scales. Here, we combined tools used in large-scale studies of biodiversity (remote sensing and trait databases) with theoretical advances developed from small-scale experiments to ask whether the functional diversity within a range of woodland and forest ecosystems influences the recovery of productivity after wildfires across the four-corner region of the United States. We additionally asked how environmental variation (topography, macroclimate) across this geographic region influences such resilience, either directly or indirectly via changes in functional diversity. Using path analysis, we found that functional diversity in regeneration traits (fire tolerance, fire resistance, resprout ability) was a stronger predictor of the recovery of productivity after wildfire than the functional diversity of seed mass or species richness. Moreover, slope, elevation, and aspect either directly or indirectly influenced the recovery of productivity, likely via their effect on microclimate, while macroclimate had no direct or indirect effects. Our study provides some of the first direct empirical evidence for functional diversity increasing resilience at large spatial scales. Our approach highlights the power of combining theory based on local-scale studies with tools used in studies at large spatial scales and trait databases to understand pressing environmental issues. © 2015 John Wiley & Sons Ltd.
The EpiSLI Database: A Publicly Available Database on Speech and Language

ERIC Educational Resources Information Center

Tomblin, J. Bruce

2010-01-01

Purpose: This article describes a database that was created in the process of conducting a large-scale epidemiologic study of specific language impairment (SLI). As such, this database will be referred to as the EpiSLI database. Children with SLI have unexpected and unexplained difficulties learning and using spoken language. Although there is no…
Pattern-based, multi-scale segmentation and regionalization of EOSD land cover

NASA Astrophysics Data System (ADS)

Niesterowicz, Jacek; Stepinski, Tomasz F.

2017-10-01

The Earth Observation for Sustainable Development of Forests (EOSD) map is a 25 m resolution thematic map of Canadian forests. Because of its large spatial extent and relatively high resolution the EOSD is difficult to analyze using standard GIS methods. In this paper we propose multi-scale segmentation and regionalization of EOSD as new methods for analyzing EOSD on large spatial scales. Segments, which we refer to as forest land units (FLUs), are delineated as tracts of forest characterized by cohesive patterns of EOSD categories; we delineated from 727 to 91,885 FLUs within the spatial extent of EOSD depending on the selected scale of a pattern. Pattern of EOSD's categories within each FLU is described by 1037 landscape metrics. A shapefile containing boundaries of all FLUs together with an attribute table listing landscape metrics make up an SQL-searchable spatial database providing detailed information on composition and pattern of land cover types in Canadian forest. Shapefile format and extensive attribute table pertaining to the entire legend of EOSD are designed to facilitate broad range of investigations in which assessment of composition and pattern of forest over large areas is needed. We calculated four such databases using different spatial scales of pattern. We illustrate the use of FLU database for producing forest regionalization maps of two Canadian provinces, Quebec and Ontario. Such maps capture the broad scale variability of forest at the spatial scale of the entire province. We also demonstrate how FLU database can be used to map variability of landscape metrics, and thus the character of landscape, over the entire Canada.
Relational Databases: A Transparent Framework for Encouraging Biology Students to Think Informatically

ERIC Educational Resources Information Center

Rice, Michael; Gladstone, William; Weir, Michael

2004-01-01

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a…
Multiresource inventories incorporating GIS, GPS, and database management systems

Treesearch

Loukas G. Arvanitis; Balaji Ramachandran; Daniel P. Brackett; Hesham Abd-El Rasol; Xuesong Du

2000-01-01

Large-scale natural resource inventories generate enormous data sets. Their effective handling requires a sophisticated database management system. Such a system must be robust enough to efficiently store large amounts of data and flexible enough to allow users to manipulate a wide variety of information. In a pilot project, related to a multiresource inventory of the...
Pharmacometabolomics Informs Quantitative Radiomics for Glioblastoma Diagnostic Innovation.

PubMed

Katsila, Theodora; Matsoukas, Minos-Timotheos; Patrinos, George P; Kardamakis, Dimitrios

2017-08-01

Applications of omics systems biology technologies have enormous promise for radiology and diagnostics in surgical fields. In this context, the emerging fields of radiomics (a systems scale approach to radiology using a host of technologies, including omics) and pharmacometabolomics (use of metabolomics for patient and disease stratification and guiding precision medicine) offer much synergy for diagnostic innovation in surgery, particularly in neurosurgery. This synthesis of omics fields and applications is timely because diagnostic accuracy in central nervous system tumors still challenges decision-making. Considering the vast heterogeneity in brain tumors, disease phenotypes, and interindividual variability in surgical and chemotherapy outcomes, we believe that diagnostic accuracy can be markedly improved by quantitative radiomics coupled to pharmacometabolomics and related health information technologies while optimizing economic costs of traditional diagnostics. In this expert review, we present an innovation analysis on a systems-level multi-omics approach toward diagnostic accuracy in central nervous system tumors. For this, we suggest that glioblastomas serve as a useful application paradigm. We performed a literature search on PubMed for articles published in English between 2006 and 2016. We used the search terms "radiomics," "glioblastoma," "biomarkers," "pharmacogenomics," "pharmacometabolomics," "pharmacometabonomics/pharmacometabolomics," "collaborative informatics," and "precision medicine." A list of the top 4 insights we derived from this literature analysis is presented in this study. For example, we found that (i) tumor grading needs to be better refined, (ii) diagnostic precision should be improved, (iii) standardization in radiomics is lacking, and (iv) quantitative radiomics needs to prove clinical implementation. We conclude with an interdisciplinary call to the metabolomics, pharmacy/pharmacology, radiology, and surgery communities that pharmacometabolomics coupled to information technologies (chemoinformatics tools, databases, collaborative systems) can inform quantitative radiomics, thus translating Big Data and information growth to knowledge growth, rational drug development and diagnostics innovation for glioblastomas, and possibly in other brain tumors.
Predicting human olfactory perception from chemical features of odor molecules.

PubMed

Keller, Andreas; Gerkin, Richard C; Guan, Yuanfang; Dhurandhar, Amit; Turu, Gabor; Szalai, Bence; Mainland, Joel D; Ihara, Yusuke; Yu, Chung Wen; Wolfinger, Russ; Vens, Celine; Schietgat, Leander; De Grave, Kurt; Norel, Raquel; Stolovitzky, Gustavo; Cecchi, Guillermo A; Vosshall, Leslie B; Meyer, Pablo

2017-02-24

It is still not possible to predict whether a given molecule will have a perceived odor or what olfactory percept it will produce. We therefore organized the crowd-sourced DREAM Olfaction Prediction Challenge. Using a large olfactory psychophysical data set, teams developed machine-learning algorithms to predict sensory attributes of molecules based on their chemoinformatic features. The resulting models accurately predicted odor intensity and pleasantness and also successfully predicted 8 among 19 rated semantic descriptors ("garlic," "fish," "sweet," "fruit," "burnt," "spices," "flower," and "sour"). Regularized linear models performed nearly as well as random forest-based ones, with a predictive accuracy that closely approaches a key theoretical limit. These models help to predict the perceptual qualities of virtually any molecule with high accuracy and also reverse-engineer the smell of a molecule. Copyright © 2017, American Association for the Advancement of Science.
Task Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing Techniques

ERIC Educational Resources Information Center

Alexopoulou, Theodora; Michel, Marije; Murakami, Akira; Meurers, Detmar

2017-01-01

Large-scale learner corpora collected from online language learning platforms, such as the EF-Cambridge Open Language Database (EFCAMDAT), provide opportunities to analyze learner data at an unprecedented scale. However, interpreting the learner language in such corpora requires a precise understanding of tasks: How does the prompt and input of a…
PrenDB, a Substrate Prediction Database to Enable Biocatalytic Use of Prenyltransferases.

PubMed

Gunera, Jakub; Kindinger, Florian; Li, Shu-Ming; Kolb, Peter

2017-03-10

Prenyltransferases of the dimethylallyltryptophan synthase (DMATS) superfamily catalyze the attachment of prenyl or prenyl-like moieties to diverse acceptor compounds. These acceptor molecules are generally aromatic in nature and mostly indole or indole-like. Their catalytic transformation represents a major skeletal diversification step in the biosynthesis of secondary metabolites, including the indole alkaloids. DMATS enzymes thus contribute significantly to the biological and pharmacological diversity of small molecule metabolites. Understanding the substrate specificity of these enzymes could create opportunities for their biocatalytic use in preparing complex synthetic scaffolds. However, there has been no framework to achieve this in a rational way. Here, we report a chemoinformatic pipeline to enable prenyltransferase substrate prediction. We systematically catalogued 32 unique prenyltransferases and 167 unique substrates to create possible reaction matrices and compiled these data into a browsable database named PrenDB. We then used a newly developed algorithm based on molecular fragmentation to automatically extract reactive chemical epitopes. The analysis of the collected data sheds light on the thus far explored substrate space of DMATS enzymes. To assess the predictive performance of our virtual reaction extraction tool, 38 potential substrates were tested as prenyl acceptors in assays with three prenyltransferases, and we were able to detect turnover in >55% of the cases. The database, PrenDB (www.kolblab.org/prendb.php), enables the prediction of potential substrates for chemoenzymatic synthesis through substructure similarity and virtual chemical transformation techniques. It aims at making prenyltransferases and their highly regio- and stereoselective reactions accessible to the research community for integration in synthetic work flows. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
LSD: Large Survey Database framework

NASA Astrophysics Data System (ADS)

Juric, Mario

2012-09-01

The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures.
Active Exploration of Large 3D Model Repositories.

PubMed

Gao, Lin; Cao, Yan-Pei; Lai, Yu-Kun; Huang, Hao-Zhi; Kobbelt, Leif; Hu, Shi-Min

2015-12-01

With broader availability of large-scale 3D model repositories, the need for efficient and effective exploration becomes more and more urgent. Existing model retrieval techniques do not scale well with the size of the database since often a large number of very similar objects are returned for a query, and the possibilities to refine the search are quite limited. We propose an interactive approach where the user feeds an active learning procedure by labeling either entire models or parts of them as "like" or "dislike" such that the system can automatically update an active set of recommended models. To provide an intuitive user interface, candidate models are presented based on their estimated relevance for the current query. From the methodological point of view, our main contribution is to exploit not only the similarity between a query and the database models but also the similarities among the database models themselves. We achieve this by an offline pre-processing stage, where global and local shape descriptors are computed for each model and a sparse distance metric is derived that can be evaluated efficiently even for very large databases. We demonstrate the effectiveness of our method by interactively exploring a repository containing over 100 K models.
Fast Updating National Geo-Spatial Databases with High Resolution Imagery: China's Methodology and Experience

NASA Astrophysics Data System (ADS)

Chen, J.; Wang, D.; Zhao, R. L.; Zhang, H.; Liao, A.; Jiu, J.

2014-04-01

Geospatial databases are irreplaceable national treasure of immense importance. Their up-to-dateness referring to its consistency with respect to the real world plays a critical role in its value and applications. The continuous updating of map databases at 1:50,000 scales is a massive and difficult task for larger countries of the size of more than several million's kilometer squares. This paper presents the research and technological development to support the national map updating at 1:50,000 scales in China, including the development of updating models and methods, production tools and systems for large-scale and rapid updating, as well as the design and implementation of the continuous updating workflow. The use of many data sources and the integration of these data to form a high accuracy, quality checked product were required. It had in turn required up to date techniques of image matching, semantic integration, generalization, data base management and conflict resolution. Design and develop specific software tools and packages to support the large-scale updating production with high resolution imagery and large-scale data generalization, such as map generalization, GIS-supported change interpretation from imagery, DEM interpolation, image matching-based orthophoto generation, data control at different levels. A national 1:50,000 databases updating strategy and its production workflow were designed, including a full coverage updating pattern characterized by all element topographic data modeling, change detection in all related areas, and whole process data quality controlling, a series of technical production specifications, and a network of updating production units in different geographic places in the country.
Data-based discharge extrapolation: estimating annual discharge for a partially gauged large river basin from its small sub-basins

NASA Astrophysics Data System (ADS)

Gong, L.

2013-12-01

Large-scale hydrological models and land surface models are by far the only tools for accessing future water resources in climate change impact studies. Those models estimate discharge with large uncertainties, due to the complex interaction between climate and hydrology, the limited quality and availability of data, as well as model uncertainties. A new purely data-based scale-extrapolation method is proposed, to estimate water resources for a large basin solely from selected small sub-basins, which are typically two-orders-of-magnitude smaller than the large basin. Those small sub-basins contain sufficient information, not only on climate and land surface, but also on hydrological characteristics for the large basin In the Baltic Sea drainage basin, best discharge estimation for the gauged area was achieved with sub-basins that cover 2-4% of the gauged area. There exist multiple sets of sub-basins that resemble the climate and hydrology of the basin equally well. Those multiple sets estimate annual discharge for gauged area consistently well with 5% average error. The scale-extrapolation method is completely data-based; therefore it does not force any modelling error into the prediction. The multiple predictions are expected to bracket the inherent variations and uncertainties of the climate and hydrology of the basin. The method can be applied in both un-gauged basins and un-gauged periods with uncertainty estimation.

Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book.

PubMed

Sadygov, Rovshan G; Cociorva, Daniel; Yates, John R

2004-12-01

Database searching is an essential element of large-scale proteomics. Because these methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and sequence: descriptive, interpretative, stochastic and probability-based matching. We review the basic concepts used by most search algorithms, the computational modeling of peptide identification and current challenges and limitations of this approach for protein identification.
Predicting activity approach based on new atoms similarity kernel function.

PubMed

Abu El-Atta, Ahmed H; Moussa, M I; Hassanien, Aboul Ella

2015-07-01

Drug design is a high cost and long term process. To reduce time and costs for drugs discoveries, new techniques are needed. Chemoinformatics field implements the informational techniques and computer science like machine learning and graph theory to discover the chemical compounds properties, such as toxicity or biological activity. This is done through analyzing their molecular structure (molecular graph). To overcome this problem there is an increasing need for algorithms to analyze and classify graph data to predict the activity of molecules. Kernels methods provide a powerful framework which combines machine learning with graph theory techniques. These kernels methods have led to impressive performance results in many several chemoinformatics problems like biological activity prediction. This paper presents a new approach based on kernel functions to solve activity prediction problem for chemical compounds. First we encode all atoms depending on their neighbors then we use these codes to find a relationship between those atoms each other. Then we use relation between different atoms to find similarity between chemical compounds. The proposed approach was compared with many other classification methods and the results show competitive accuracy with these methods. Copyright © 2015 Elsevier Inc. All rights reserved.
Digital geomorphological landslide hazard mapping of the Alpago area, Italy

NASA Astrophysics Data System (ADS)

van Westen, Cees J.; Soeters, Rob; Sijmons, Koert

Large-scale geomorphological maps of mountainous areas are traditionally made using complex symbol-based legends. They can serve as excellent "geomorphological databases", from which an experienced geomorphologist can extract a large amount of information for hazard mapping. However, these maps are not designed to be used in combination with a GIS, due to their complex cartographic structure. In this paper, two methods are presented for digital geomorphological mapping at large scales using GIS and digital cartographic software. The methods are applied to an area with a complex geomorphological setting on the Borsoia catchment, located in the Alpago region, near Belluno in the Italian Alps. The GIS database set-up is presented with an overview of the data layers that have been generated and how they are interrelated. The GIS database was also converted into a paper map, using a digital cartographic package. The resulting largescale geomorphological hazard map is attached. The resulting GIS database and cartographic product can be used to analyse the hazard type and hazard degree for each polygon, and to find the reasons for the hazard classification.
Mappability of drug-like space: towards a polypharmacologically competent map of drug-relevant compounds

NASA Astrophysics Data System (ADS)

Sidorov, Pavel; Gaspar, Helena; Marcou, Gilles; Varnek, Alexandre; Horvath, Dragos

2015-12-01

Intuitive, visual rendering—mapping—of high-dimensional chemical spaces (CS), is an important topic in chemoinformatics. Such maps were so far dedicated to specific compound collections—either limited series of known activities, or large, even exhaustive enumerations of molecules, but without associated property data. Typically, they were challenged to answer some classification problem with respect to those same molecules, admired for their aesthetical virtues and then forgotten—because they were set-specific constructs. This work wishes to address the question whether a general, compound set-independent map can be generated, and the claim of "universality" quantitatively justified, with respect to all the structure-activity information available so far—or, more realistically, an exploitable but significant fraction thereof. The "universal" CS map is expected to project molecules from the initial CS into a lower-dimensional space that is neighborhood behavior-compliant with respect to a large panel of ligand properties. Such map should be able to discriminate actives from inactives, or even support quantitative neighborhood-based, parameter-free property prediction (regression) models, for a wide panel of targets and target families. It should be polypharmacologically competent, without requiring any target-specific parameter fitting. This work describes an evolutionary growth procedure of such maps, based on generative topographic mapping, followed by the validation of their polypharmacological competence. Validation was achieved with respect to a maximum of exploitable structure-activity information, covering all of Homo sapiens proteins of the ChEMBL database, antiparasitic and antiviral data, etc. Five evolved maps satisfactorily solved hundreds of activity-based ligand classification challenges for targets, and even in vivo properties independent from training data. They also stood chemogenomics-related challenges, as cumulated responsibility vectors obtained by mapping of target-specific ligand collections were shown to represent validated target descriptors, complying with currently accepted target classification in biology. Therefore, they represent, in our opinion, a robust and well documented answer to the key question "What is a good CS map?"
Mappability of drug-like space: towards a polypharmacologically competent map of drug-relevant compounds.

PubMed

Sidorov, Pavel; Gaspar, Helena; Marcou, Gilles; Varnek, Alexandre; Horvath, Dragos

2015-12-01

Intuitive, visual rendering--mapping--of high-dimensional chemical spaces (CS), is an important topic in chemoinformatics. Such maps were so far dedicated to specific compound collections--either limited series of known activities, or large, even exhaustive enumerations of molecules, but without associated property data. Typically, they were challenged to answer some classification problem with respect to those same molecules, admired for their aesthetical virtues and then forgotten--because they were set-specific constructs. This work wishes to address the question whether a general, compound set-independent map can be generated, and the claim of "universality" quantitatively justified, with respect to all the structure-activity information available so far--or, more realistically, an exploitable but significant fraction thereof. The "universal" CS map is expected to project molecules from the initial CS into a lower-dimensional space that is neighborhood behavior-compliant with respect to a large panel of ligand properties. Such map should be able to discriminate actives from inactives, or even support quantitative neighborhood-based, parameter-free property prediction (regression) models, for a wide panel of targets and target families. It should be polypharmacologically competent, without requiring any target-specific parameter fitting. This work describes an evolutionary growth procedure of such maps, based on generative topographic mapping, followed by the validation of their polypharmacological competence. Validation was achieved with respect to a maximum of exploitable structure-activity information, covering all of Homo sapiens proteins of the ChEMBL database, antiparasitic and antiviral data, etc. Five evolved maps satisfactorily solved hundreds of activity-based ligand classification challenges for targets, and even in vivo properties independent from training data. They also stood chemogenomics-related challenges, as cumulated responsibility vectors obtained by mapping of target-specific ligand collections were shown to represent validated target descriptors, complying with currently accepted target classification in biology. Therefore, they represent, in our opinion, a robust and well documented answer to the key question "What is a good CS map?"
High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joslyn, Cliff A.; Adolf, Robert D.; Al-Saffar, Sinan

2010-10-04

As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors.« less
Large-Scale medical image analytics: Recent methodologies, applications and Future directions.

PubMed

Zhang, Shaoting; Metaxas, Dimitris

2016-10-01

Despite the ever-increasing amount and complexity of annotated medical image data, the development of large-scale medical image analysis algorithms has not kept pace with the need for methods that bridge the semantic gap between images and diagnoses. The goal of this position paper is to discuss and explore innovative and large-scale data science techniques in medical image analytics, which will benefit clinical decision-making and facilitate efficient medical data management. Particularly, we advocate that the scale of image retrieval systems should be significantly increased at which interactive systems can be effective for knowledge discovery in potentially large databases of medical images. For clinical relevance, such systems should return results in real-time, incorporate expert feedback, and be able to cope with the size, quality, and variety of the medical images and their associated metadata for a particular domain. The design, development, and testing of the such framework can significantly impact interactive mining in medical image databases that are growing rapidly in size and complexity and enable novel methods of analysis at much larger scales in an efficient, integrated fashion. Copyright © 2016. Published by Elsevier B.V.
Content Is King: Databases Preserve the Collective Information of Science.

PubMed

Yates, John R

2018-04-01

Databases store sequence information experimentally gathered to create resources that further science. In the last 20 years databases have become critical components of fields like proteomics where they provide the basis for large-scale and high-throughput proteomic informatics. Amos Bairoch, winner of the Association of Biomolecular Resource Facilities Frederick Sanger Award, has created some of the important databases proteomic research depends upon for accurate interpretation of data.
Rational drug design for anti-cancer chemotherapy: multi-target QSAR models for the in silico discovery of anti-colorectal cancer agents.

PubMed

Speck-Planche, Alejandro; Kleandrova, Valeria V; Luan, Feng; Cordeiro, M Natália D S

2012-08-01

The discovery of new and more potent anti-cancer agents constitutes one of the most active fields of research in chemotherapy. Colorectal cancer (CRC) is one of the most studied cancers because of its high prevalence and number of deaths. In the current pharmaceutical design of more efficient anti-CRC drugs, the use of methodologies based on Chemoinformatics has played a decisive role, including Quantitative-Structure-Activity Relationship (QSAR) techniques. However, until now, there is no methodology able to predict anti-CRC activity of compounds against more than one CRC cell line, which should constitute the principal goal. In an attempt to overcome this problem we develop here the first multi-target (mt) approach for the virtual screening and rational in silico discovery of anti-CRC agents against ten cell lines. Here, two mt-QSAR classification models were constructed using a large and heterogeneous database of compounds. The first model was based on linear discriminant analysis (mt-QSAR-LDA) employing fragment-based descriptors while the second model was obtained using artificial neural networks (mt-QSAR-ANN) with global 2D descriptors. Both models correctly classified more than 90% of active and inactive compounds in training and prediction sets. Some fragments were extracted from the molecules and their contributions to anti-CRC activity were calculated using mt-QSAR-LDA model. Several fragments were identified as potential substructural features responsible for the anti-CRC activity and new molecules designed from those fragments with positive contributions were suggested and correctly predicted by the two models as possible potent and versatile anti-CRC agents. Copyright © 2012 Elsevier Ltd. All rights reserved.
Combinatorial phenotypic screen uncovers unrecognized family of extended thiourea inhibitors with copper-dependent anti-staphylococcal activity.

PubMed

Dalecki, Alex G; Malalasekera, Aruni P; Schaaf, Kaitlyn; Kutsch, Olaf; Bossmann, Stefan H; Wolschendorf, Frank

2016-04-01

The continuous rise of multi-drug resistant pathogenic bacteria has become a significant challenge for the health care system. In particular, novel drugs to treat infections of methicillin-resistant Staphylococcus aureus strains (MRSA) are needed, but traditional drug discovery campaigns have largely failed to deliver clinically suitable antibiotics. More than simply new drugs, new drug discovery approaches are needed to combat bacterial resistance. The recently described phenomenon of copper-dependent inhibitors has galvanized research exploring the use of metal-coordinating molecules to harness copper's natural antibacterial properties for therapeutic purposes. Here, we describe the results of the first concerted screening effort to identify copper-dependent inhibitors of Staphylococcus aureus. A standard library of 10 000 compounds was assayed for anti-staphylococcal activity, with hits defined as those compounds with a strict copper-dependent inhibitory activity. A total of 53 copper-dependent hit molecules were uncovered, similar to the copper independent hit rate of a traditionally executed campaign conducted in parallel on the same library. Most prominent was a hit family with an extended thiourea core structure, termed the NNSN motif. This motif resulted in copper-dependent and copper-specific S. aureus inhibition, while simultaneously being well tolerated by eukaryotic cells. Importantly, we could demonstrate that copper binding by the NNSN motif is highly unusual and likely responsible for the promising biological qualities of these compounds. A subsequent chemoinformatic meta-analysis of the ChEMBL chemical database confirmed the NNSNs as an unrecognized staphylococcal inhibitor, despite the family's presence in many chemical screening libraries. Thus, our copper-biased screen has proven able to discover inhibitors within previously screened libraries, offering a mechanism to reinvigorate exhausted molecular collections.
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

NASA Astrophysics Data System (ADS)

Dykstra, Dave

2012-12-01

One of the main attractions of non-relational “NoSQL” databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It also compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dykstra, Dave

One of the main attractions of non-relational NoSQL databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It alsomore » compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.« less
Large-scale feature searches of collections of medical imagery

NASA Astrophysics Data System (ADS)

Hedgcock, Marcus W.; Karshat, Walter B.; Levitt, Tod S.; Vosky, D. N.

1993-09-01

Large scale feature searches of accumulated collections of medical imagery are required for multiple purposes, including clinical studies, administrative planning, epidemiology, teaching, quality improvement, and research. To perform a feature search of large collections of medical imagery, one can either search text descriptors of the imagery in the collection (usually the interpretation), or (if the imagery is in digital format) the imagery itself. At our institution, text interpretations of medical imagery are all available in our VA Hospital Information System. These are downloaded daily into an off-line computer. The text descriptors of most medical imagery are usually formatted as free text, and so require a user friendly database search tool to make searches quick and easy for any user to design and execute. We are tailoring such a database search tool (Liveview), developed by one of the authors (Karshat). To further facilitate search construction, we are constructing (from our accumulated interpretation data) a dictionary of medical and radiological terms and synonyms. If the imagery database is digital, the imagery which the search discovers is easily retrieved from the computer archive. We describe our database search user interface, with examples, and compare the efficacy of computer assisted imagery searches from a clinical text database with manual searches. Our initial work on direct feature searches of digital medical imagery is outlined.
Open source database of images DEIMOS: extension for large-scale subjective image quality assessment

NASA Astrophysics Data System (ADS)

Vítek, Stanislav

2014-09-01

DEIMOS (Database of Images: Open Source) is an open-source database of images and video sequences for testing, verification and comparison of various image and/or video processing techniques such as compression, reconstruction and enhancement. This paper deals with extension of the database allowing performing large-scale web-based subjective image quality assessment. Extension implements both administrative and client interface. The proposed system is aimed mainly at mobile communication devices, taking into account advantages of HTML5 technology; it means that participants don't need to install any application and assessment could be performed using web browser. The assessment campaign administrator can select images from the large database and then apply rules defined by various test procedure recommendations. The standard test procedures may be fully customized and saved as a template. Alternatively the administrator can define a custom test, using images from the pool and other components, such as evaluating forms and ongoing questionnaires. Image sequence is delivered to the online client, e.g. smartphone or tablet, as a fully automated assessment sequence or viewer can decide on timing of the assessment if required. Environmental data and viewing conditions (e.g. illumination, vibrations, GPS coordinates, etc.), may be collected and subsequently analyzed.
Molecular property diagnostic suite (MPDS): Development of disease-specific open source web portals for drug discovery.

PubMed

Nagamani, S; Gaur, A S; Tanneeru, K; Muneeswaran, G; Madugula, S S; Consortium, Mpds; Druzhilovskiy, D; Poroikov, V V; Sastry, G N

2017-11-01

Molecular property diagnostic suite (MPDS) is a Galaxy-based open source drug discovery and development platform. MPDS web portals are designed for several diseases, such as tuberculosis, diabetes mellitus, and other metabolic disorders, specifically aimed to evaluate and estimate the drug-likeness of a given molecule. MPDS consists of three modules, namely data libraries, data processing, and data analysis tools which are configured and interconnected to assist drug discovery for specific diseases. The data library module encompasses vast information on chemical space, wherein the MPDS compound library comprises 110.31 million unique molecules generated from public domain databases. Every molecule is assigned with a unique ID and card, which provides complete information for the molecule. Some of the modules in the MPDS are specific to the diseases, while others are non-specific. Importantly, a suitably altered protocol can be effectively generated for another disease-specific MPDS web portal by modifying some of the modules. Thus, the MPDS suite of web portals shows great promise to emerge as disease-specific portals of great value, integrating chemoinformatics, bioinformatics, molecular modelling, and structure- and analogue-based drug discovery approaches.
Chemoinformatic Analysis of Combinatorial Libraries, Drugs, Natural Products and Molecular Libraries Small Molecule Repository

PubMed Central

Singh, Narender; Guha, Rajarshi; Giulianotti, Marc; Pinilla, Clemencia; Houghten, Richard; Medina-Franco, Jose L.

2009-01-01

A multiple criteria approach is presented, that is used to perform a comparative analysis of four recently developed combinatorial libraries to drugs, Molecular Libraries Small Molecule Repository (MLSMR) and natural products. The compound databases were assessed in terms of physicochemical properties, scaffolds and fingerprints. The approach enables the analysis of property space coverage, degree of overlap between collections, scaffold and structural diversity and overall structural novelty. The degree of overlap between combinatorial libraries and drugs was assessed using the R-NN curve methodology, which measures the density of chemical space around a query molecule embedded in the chemical space of a target collection. The combinatorial libraries studied in this work exhibit scaffolds that were not observed in the drug, MLSMR and natural products collections. The fingerprint-based comparisons indicate that these combinatorial libraries are structurally different to current drugs. The R-NN curve methodology revealed that a proportion of molecules in the combinatorial libraries are located within the property space of the drugs. However, the R-NN analysis also showed that there are a significant number of molecules in several combinatorial libraries that are located in sparse regions of the drug space. PMID:19301827
Large-scale silviculture experiments of western Oregon and Washington.

Treesearch

Nathan J. Poage; Paul D. Anderson

2007-01-01

We review 12 large-scale silviculture experiments (LSSEs) in western Washington and Oregon with which the Pacific Northwest Research Station of the USDA Forest Service is substantially involved. We compiled and arrayed information about the LSSEs as a series of matrices in a relational database, which is included on the compact disc published with this report and...
Intelligent Interfaces for Mining Large-Scale RNAi-HCS Image Databases

PubMed Central

Lin, Chen; Mak, Wayne; Hong, Pengyu; Sepp, Katharine; Perrimon, Norbert

2010-01-01

Recently, High-content screening (HCS) has been combined with RNA interference (RNAi) to become an essential image-based high-throughput method for studying genes and biological networks through RNAi-induced cellular phenotype analyses. However, a genome-wide RNAi-HCS screen typically generates tens of thousands of images, most of which remain uncategorized due to the inadequacies of existing HCS image analysis tools. Until now, it still requires highly trained scientists to browse a prohibitively large RNAi-HCS image database and produce only a handful of qualitative results regarding cellular morphological phenotypes. For this reason we have developed intelligent interfaces to facilitate the application of the HCS technology in biomedical research. Our new interfaces empower biologists with computational power not only to effectively and efficiently explore large-scale RNAi-HCS image databases, but also to apply their knowledge and experience to interactive mining of cellular phenotypes using Content-Based Image Retrieval (CBIR) with Relevance Feedback (RF) techniques. PMID:21278820
Use of electronic healthcare records in large-scale simple randomized trials at the point of care for the documentation of value-based medicine.

PubMed

van Staa, T-P; Klungel, O; Smeeth, L

2014-06-01

A solid foundation of evidence of the effects of an intervention is a prerequisite of evidence-based medicine. The best source of such evidence is considered to be randomized trials, which are able to avoid confounding. However, they may not always estimate effectiveness in clinical practice. Databases that collate anonymized electronic health records (EHRs) from different clinical centres have been widely used for many years in observational studies. Randomized point-of-care trials have been initiated recently to recruit and follow patients using the data from EHR databases. In this review, we describe how EHR databases can be used for conducting large-scale simple trials and discuss the advantages and disadvantages of their use. © 2014 The Association for the Publication of the Journal of Internal Medicine.
High performance semantic factoring of giga-scale semantic graph databases.

DOE Office of Scientific and Technical Information (OSTI.GOV)

al-Saffar, Sinan; Adolf, Bob; Haglin, David

2010-10-01

As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.« less

Uniform standards for genome databases in forest and fruit trees

USDA-ARS?s Scientific Manuscript database

TreeGenes and tfGDR serve the international forestry and fruit tree genomics research communities, respectively. These databases hold similar sequence data and provide resources for the submission and recovery of this information in order to enable comparative genomics research. Large-scale genotype...
Very large database of lipids: rationale and design.

PubMed

Martin, Seth S; Blaha, Michael J; Toth, Peter P; Joshi, Parag H; McEvoy, John W; Ahmed, Haitham M; Elshazly, Mohamed B; Swiger, Kristopher J; Michos, Erin D; Kwiterovich, Peter O; Kulkarni, Krishnaji R; Chimera, Joseph; Cannon, Christopher P; Blumenthal, Roger S; Jones, Steven R

2013-11-01

Blood lipids have major cardiovascular and public health implications. Lipid-lowering drugs are prescribed based in part on categorization of patients into normal or abnormal lipid metabolism, yet relatively little emphasis has been placed on: (1) the accuracy of current lipid measures used in clinical practice, (2) the reliability of current categorizations of dyslipidemia states, and (3) the relationship of advanced lipid characterization to other cardiovascular disease biomarkers. To these ends, we developed the Very Large Database of Lipids (NCT01698489), an ongoing database protocol that harnesses deidentified data from the daily operations of a commercial lipid laboratory. The database includes individuals who were referred for clinical purposes for a Vertical Auto Profile (Atherotech Inc., Birmingham, AL), which directly measures cholesterol concentrations of low-density lipoprotein, very low-density lipoprotein, intermediate-density lipoprotein, high-density lipoprotein, their subclasses, and lipoprotein(a). Individual Very Large Database of Lipids studies, ranging from studies of measurement accuracy, to dyslipidemia categorization, to biomarker associations, to characterization of rare lipid disorders, are investigator-initiated and utilize peer-reviewed statistical analysis plans to address a priori hypotheses/aims. In the first database harvest (Very Large Database of Lipids 1.0) from 2009 to 2011, there were 1 340 614 adult and 10 294 pediatric patients; the adult sample had a median age of 59 years (interquartile range, 49-70 years) with even representation by sex. Lipid distributions closely matched those from the population-representative National Health and Nutrition Examination Survey. The second harvest of the database (Very Large Database of Lipids 2.0) is underway. Overall, the Very Large Database of Lipids database provides an opportunity for collaboration and new knowledge generation through careful examination of granular lipid data on a large scale. © 2013 Wiley Periodicals, Inc.
Exploring Natural Products from the Biodiversity of Pakistan for Computational Drug Discovery Studies: Collection, Optimization, Design and Development of A Chemical Database (ChemDP).

PubMed

Mirza, Shaher Bano; Bokhari, Habib; Fatmi, Muhammad Qaiser

2015-01-01

Pakistan possesses a rich and vast source of natural products (NPs). Some of these secondary metabolites have been identified as potent therapeutic agents. However, the medicinal usage of most of these compounds has not yet been fully explored. The discoveries for new scaffolds of NPs as inhibitors of certain enzymes or receptors using advanced computational drug discovery approaches are also limited due to the unavailability of accurate 3D structures of NPs. An organized database incorporating all relevant information, therefore, can facilitate to explore the medicinal importance of the metabolites from Pakistani Biodiversity. The Chemical Database of Pakistan (ChemDP; release 01) is a fully-referenced, evolving, web-based, virtual database which has been designed and developed to introduce natural products (NPs) and their derivatives from the biodiversity of Pakistan to Global scientific communities. The prime aim is to provide quality structures of compounds with relevant information for computer-aided drug discovery studies. For this purpose, over 1000 NPs have been identified from more than 400 published articles, for which 2D and 3D molecular structures have been generated with a special focus on their stereochemistry, where applicable. The PM7 semiempirical quantum chemistry method has been used to energy optimize the 3D structure of NPs. The 2D and 3D structures can be downloaded as .sdf, .mol, .sybyl, .mol2, and .pdb files - readable formats by many chemoinformatics/bioinformatics software packages. Each entry in ChemDP contains over 100 data fields representing various molecular, biological, physico-chemical and pharmacological properties, which have been properly documented in the database for end users. These pieces of information have been either manually extracted from the literatures or computationally calculated using various computational tools. Cross referencing to a major data repository i.e. ChemSpider has been made available for overlapping compounds. An android application of ChemDP is available at its website. The ChemDP is freely accessible at www.chemdp.com.
Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing

PubMed Central

2013-01-01

Background A large-scale, highly accurate, machine-understandable drug-disease treatment relationship knowledge base is important for computational approaches to drug repurposing. The large body of published biomedical research articles and clinical case reports available on MEDLINE is a rich source of FDA-approved drug-disease indication as well as drug-repurposing knowledge that is crucial for applying FDA-approved drugs for new diseases. However, much of this information is buried in free text and not captured in any existing databases. The goal of this study is to extract a large number of accurate drug-disease treatment pairs from published literature. Results In this study, we developed a simple but highly accurate pattern-learning approach to extract treatment-specific drug-disease pairs from 20 million biomedical abstracts available on MEDLINE. We extracted a total of 34,305 unique drug-disease treatment pairs, the majority of which are not included in existing structured databases. Our algorithm achieved a precision of 0.904 and a recall of 0.131 in extracting all pairs, and a precision of 0.904 and a recall of 0.842 in extracting frequent pairs. In addition, we have shown that the extracted pairs strongly correlate with both drug target genes and therapeutic classes, therefore may have high potential in drug discovery. Conclusions We demonstrated that our simple pattern-learning relationship extraction algorithm is able to accurately extract many drug-disease pairs from the free text of biomedical literature that are not captured in structured databases. The large-scale, accurate, machine-understandable drug-disease treatment knowledge base that is resultant of our study, in combination with pairs from structured databases, will have high potential in computational drug repurposing tasks. PMID:23742147
Post-16 Physics and Chemistry Uptake: Combining Large-Scale Secondary Analysis with In-Depth Qualitative Methods

ERIC Educational Resources Information Center

Hampden-Thompson, Gillian; Lubben, Fred; Bennett, Judith

2011-01-01

Quantitative secondary analysis of large-scale data can be combined with in-depth qualitative methods. In this paper, we discuss the role of this combined methods approach in examining the uptake of physics and chemistry in post compulsory schooling for students in England. The secondary data analysis of the National Pupil Database (NPD) served…
Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency.

PubMed

Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio

2015-01-01

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.
The T1R2/T1R3 sweet receptor and TRPM5 ion channel taste targets with therapeutic potential.

PubMed

Sprous, Dennis; Palmer, Kyle R

2010-01-01

Taste signaling is a critical determinant of ingestive behaviors and thereby linked to obesity and related metabolic dysfunctions. Recent evidence of taste signaling pathways in the gut suggests the link to be more direct, raising the possibility that taste receptor systems could be regarded as therapeutic targets. T1R2/T1R3, the G protein coupled receptor that mediates sweet taste, and the TRPM5 ion channel have been the focus of discovery programs seeking novel compounds that could be useful in modifying taste. We review in this chapter the hypothesis of gastrointestinal taste signaling and discuss the potential for T1R2/T1R3 and TRPM5 as targets of therapeutic intervention in obesity and diabetes. Critical to the development of a drug discovery program is the creation of libraries that enhance the likelihood of identifying novel compounds that modulate the target of interest. We advocate a computer-based chemoinformatic approach for assembling natural and synthetic compound libraries as well as for supporting optimization of structure activity relationships. Strategies for discovering modulators of T1R2/T1R3 and TRPM5 using methods of chemoinformatics are presented herein. Copyright 2010 Elsevier Inc. All rights reserved.
Bio-AIMS Collection of Chemoinformatics Web Tools based on Molecular Graph Information and Artificial Intelligence Models.

PubMed

Munteanu, Cristian R; Gonzalez-Diaz, Humberto; Garcia, Rafael; Loza, Mabel; Pazos, Alejandro

2015-01-01

The molecular information encoding into molecular descriptors is the first step into in silico Chemoinformatics methods in Drug Design. The Machine Learning methods are a complex solution to find prediction models for specific biological properties of molecules. These models connect the molecular structure information such as atom connectivity (molecular graphs) or physical-chemical properties of an atom/group of atoms to the molecular activity (Quantitative Structure - Activity Relationship, QSAR). Due to the complexity of the proteins, the prediction of their activity is a complicated task and the interpretation of the models is more difficult. The current review presents a series of 11 prediction models for proteins, implemented as free Web tools on an Artificial Intelligence Model Server in Biosciences, Bio-AIMS (http://bio-aims.udc.es/TargetPred.php). Six tools predict protein activity, two models evaluate drug - protein target interactions and the other three calculate protein - protein interactions. The input information is based on the protein 3D structure for nine models, 1D peptide amino acid sequence for three tools and drug SMILES formulas for two servers. The molecular graph descriptor-based Machine Learning models could be useful tools for in silico screening of new peptides/proteins as future drug targets for specific treatments.
Advancing the large-scale CCS database for metabolomics and lipidomics at the machine-learning era.

PubMed

Zhou, Zhiwei; Tu, Jia; Zhu, Zheng-Jiang

2018-02-01

Metabolomics and lipidomics aim to comprehensively measure the dynamic changes of all metabolites and lipids that are present in biological systems. The use of ion mobility-mass spectrometry (IM-MS) for metabolomics and lipidomics has facilitated the separation and the identification of metabolites and lipids in complex biological samples. The collision cross-section (CCS) value derived from IM-MS is a valuable physiochemical property for the unambiguous identification of metabolites and lipids. However, CCS values obtained from experimental measurement and computational modeling are limited available, which significantly restricts the application of IM-MS. In this review, we will discuss the recently developed machine-learning based prediction approach, which could efficiently generate precise CCS databases in a large scale. We will also highlight the applications of CCS databases to support metabolomics and lipidomics. Copyright © 2017 Elsevier Ltd. All rights reserved.
Competitive code-based fast palmprint identification using a set of cover trees

NASA Astrophysics Data System (ADS)

Yue, Feng; Zuo, Wangmeng; Zhang, David; Wang, Kuanquan

2009-06-01

A palmprint identification system recognizes a query palmprint image by searching for its nearest neighbor from among all the templates in a database. When applied on a large-scale identification system, it is often necessary to speed up the nearest-neighbor searching process. We use competitive code, which has very fast feature extraction and matching speed, for palmprint identification. To speed up the identification process, we extend the cover tree method and propose to use a set of cover trees to facilitate the fast and accurate nearest-neighbor searching. We can use the cover tree method because, as we show, the angular distance used in competitive code can be decomposed into a set of metrics. Using the Hong Kong PolyU palmprint database (version 2) and a large-scale palmprint database, our experimental results show that the proposed method searches for nearest neighbors faster than brute force searching.
Validation of a common data model for active safety surveillance research

PubMed Central

Ryan, Patrick B; Reich, Christian G; Hartzema, Abraham G; Stang, Paul E

2011-01-01

Objective Systematic analysis of observational medical databases for active safety surveillance is hindered by the variation in data models and coding systems. Data analysts often find robust clinical data models difficult to understand and ill suited to support their analytic approaches. Further, some models do not facilitate the computations required for systematic analysis across many interventions and outcomes for large datasets. Translating the data from these idiosyncratic data models to a common data model (CDM) could facilitate both the analysts' understanding and the suitability for large-scale systematic analysis. In addition to facilitating analysis, a suitable CDM has to faithfully represent the source observational database. Before beginning to use the Observational Medical Outcomes Partnership (OMOP) CDM and a related dictionary of standardized terminologies for a study of large-scale systematic active safety surveillance, the authors validated the model's suitability for this use by example. Validation by example To validate the OMOP CDM, the model was instantiated into a relational database, data from 10 different observational healthcare databases were loaded into separate instances, a comprehensive array of analytic methods that operate on the data model was created, and these methods were executed against the databases to measure performance. Conclusion There was acceptable representation of the data from 10 observational databases in the OMOP CDM using the standardized terminologies selected, and a range of analytic methods was developed and executed with sufficient performance to be useful for active safety surveillance. PMID:22037893
Computational Thermochemistry: Scale Factor Databases and Scale Factors for Vibrational Frequencies Obtained from Electronic Model Chemistries.

PubMed

Alecu, I M; Zheng, Jingjing; Zhao, Yan; Truhlar, Donald G

2010-09-14

Optimized scale factors for calculating vibrational harmonic and fundamental frequencies and zero-point energies have been determined for 145 electronic model chemistries, including 119 based on approximate functionals depending on occupied orbitals, 19 based on single-level wave function theory, three based on the neglect-of-diatomic-differential-overlap, two based on doubly hybrid density functional theory, and two based on multicoefficient correlation methods. Forty of the scale factors are obtained from large databases, which are also used to derive two universal scale factor ratios that can be used to interconvert between scale factors optimized for various properties, enabling the derivation of three key scale factors at the effort of optimizing only one of them. A reduced scale factor optimization model is formulated in order to further reduce the cost of optimizing scale factors, and the reduced model is illustrated by using it to obtain 105 additional scale factors. Using root-mean-square errors from the values in the large databases, we find that scaling reduces errors in zero-point energies by a factor of 2.3 and errors in fundamental vibrational frequencies by a factor of 3.0, but it reduces errors in harmonic vibrational frequencies by only a factor of 1.3. It is shown that, upon scaling, the balanced multicoefficient correlation method based on coupled cluster theory with single and double excitations (BMC-CCSD) can lead to very accurate predictions of vibrational frequencies. With a polarized, minimally augmented basis set, the density functionals with zero-point energy scale factors closest to unity are MPWLYP1M (1.009), τHCTHhyb (0.989), BB95 (1.012), BLYP (1.013), BP86 (1.014), B3LYP (0.986), MPW3LYP (0.986), and VSXC (0.986).
WikiPEATia - a web based platform for assembling peatland data through ‘crowd sourcing’

NASA Astrophysics Data System (ADS)

Wisser, D.; Glidden, S.; Fieseher, C.; Treat, C. C.; Routhier, M.; Frolking, S. E.

2009-12-01

The Earth System Science community is realizing that peatlands are an important and unique terrestrial ecosystem that has not yet been well-integrated into large-scale earth system analyses. A major hurdle is the lack of accessible, geospatial data of peatland distribution, coupled with data on peatland properties (e.g., vegetation composition, peat depth, basal dates, soil chemistry, peatland class) at the global scale. This data, however, is available at the local scale. Although a comprehensive global database on peatlands probably lags similar data on more economically important ecosystems such as forests, grasslands, croplands, a large amount of field data have been collected over the past several decades. A few efforts have been made to map peatlands at large scales but existing data have not been assembled into a single geospatial database that is publicly accessible or do not depict data with a level of detail that is needed in the Earth System Science Community. A global peatland database would contribute to advances in a number of research fields such as hydrology, vegetation and ecosystem modeling, permafrost modeling, and earth system modeling. We present a Web 2.0 approach that uses state-of-the-art webserver and innovative online mapping technologies and is designed to create such a global database through ‘crowd-sourcing’. Primary functions of the online system include form-driven textual user input of peatland research metadata, spatial data input of peatland areas via a mapping interface, database editing and querying editing capabilities, as well as advanced visualization and data analysis tools. WikiPEATia provides an integrated information technology platform for assembling, integrating, and posting peatland-related geospatial datasets facilitates and encourages research community involvement. A successful effort will make existing peatland data much more useful to the research community, and will help to identify significant data gaps.
Integrated Database And Knowledge Base For Genomic Prospective Cohort Study In Tohoku Medical Megabank Toward Personalized Prevention And Medicine.

PubMed

Ogishima, Soichi; Takai, Takako; Shimokawa, Kazuro; Nagaie, Satoshi; Tanaka, Hiroshi; Nakaya, Jun

2015-01-01

The Tohoku Medical Megabank project is a national project to revitalization of the disaster area in the Tohoku region by the Great East Japan Earthquake, and have conducted large-scale prospective genome-cohort study. Along with prospective genome-cohort study, we have developed integrated database and knowledge base which will be key database for realizing personalized prevention and medicine.
The HUPO PSI's molecular interaction format--a community standard for the representation of protein interaction data.

PubMed

Hermjakob, Henning; Montecchi-Palazzi, Luisa; Bader, Gary; Wojcik, Jérôme; Salwinski, Lukasz; Ceol, Arnaud; Moore, Susan; Orchard, Sandra; Sarkans, Ugis; von Mering, Christian; Roechert, Bernd; Poux, Sylvain; Jung, Eva; Mersch, Henning; Kersey, Paul; Lappe, Michael; Li, Yixue; Zeng, Rong; Rana, Debashis; Nikolski, Macha; Husi, Holger; Brun, Christine; Shanker, K; Grant, Seth G N; Sander, Chris; Bork, Peer; Zhu, Weimin; Pandey, Akhilesh; Brazma, Alvis; Jacq, Bernard; Vidal, Marc; Sherman, David; Legrain, Pierre; Cesareni, Gianni; Xenarios, Ioannis; Eisenberg, David; Steipe, Boris; Hogue, Chris; Apweiler, Rolf

2004-02-01

A major goal of proteomics is the complete description of the protein interaction network underlying cell physiology. A large number of small scale and, more recently, large-scale experiments have contributed to expanding our understanding of the nature of the interaction network. However, the necessary data integration across experiments is currently hampered by the fragmentation of publicly available protein interaction data, which exists in different formats in databases, on authors' websites or sometimes only in print publications. Here, we propose a community standard data model for the representation and exchange of protein interaction data. This data model has been jointly developed by members of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome Organization (HUPO), and is supported by major protein interaction data providers, in particular the Biomolecular Interaction Network Database (BIND), Cellzome (Heidelberg, Germany), the Database of Interacting Proteins (DIP), Dana Farber Cancer Institute (Boston, MA, USA), the Human Protein Reference Database (HPRD), Hybrigenics (Paris, France), the European Bioinformatics Institute's (EMBL-EBI, Hinxton, UK) IntAct, the Molecular Interactions (MINT, Rome, Italy) database, the Protein-Protein Interaction Database (PPID, Edinburgh, UK) and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, EMBL, Heidelberg, Germany).
Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

PubMed

Soranno, Patricia A; Bissell, Edward G; Cheruvelil, Kendra S; Christel, Samuel T; Collins, Sarah M; Fergus, C Emi; Filstrup, Christopher T; Lapierre, Jean-Francois; Lottig, Noah R; Oliver, Samantha K; Scott, Caren E; Smith, Nicole J; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A; Gries, Corinna; Henry, Emily N; Skaff, Nick K; Stanley, Emily H; Stow, Craig A; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E

2015-01-01

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science through data reuse

USGS Publications Warehouse

Soranno, Patricia A.; Bissell, E.G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Fergus, C. Emi; Filstrup, Christopher T.; Lapierre, Jean-Francois; Lotting, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E.

2015-01-01

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
Scale out databases for CERN use cases

NASA Astrophysics Data System (ADS)

Baranowski, Zbigniew; Grzybek, Maciej; Canali, Luca; Lanza Garcia, Daniel; Surdy, Kacper

2015-12-01

Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.
Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency

PubMed Central

Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio

2015-01-01

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB. PMID:26558254
Search for 5'-leader regulatory RNA structures based on gene annotation aided by the RiboGap database.

PubMed

Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan

2017-03-15

The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation. Copyright © 2017 Elsevier Inc. All rights reserved.

Design of a decentralized reusable research database architecture to support data acquisition in large research projects.

PubMed

Iavindrasana, Jimison; Depeursinge, Adrien; Ruch, Patrick; Spahni, Stéphane; Geissbuhler, Antoine; Müller, Henning

2007-01-01

The diagnostic and therapeutic processes, as well as the development of new treatments, are hindered by the fragmentation of information which underlies them. In a multi-institutional research study database, the clinical information system (CIS) contains the primary data input. An important part of the money of large scale clinical studies is often paid for data creation and maintenance. The objective of this work is to design a decentralized, scalable, reusable database architecture with lower maintenance costs for managing and integrating distributed heterogeneous data required as basis for a large-scale research project. Technical and legal aspects are taken into account based on various use case scenarios. The architecture contains 4 layers: data storage and access are decentralized at their production source, a connector as a proxy between the CIS and the external world, an information mediator as a data access point and the client side. The proposed design will be implemented inside six clinical centers participating in the @neurIST project as part of a larger system on data integration and reuse for aneurism treatment.
Multiple Object Retrieval in Image Databases Using Hierarchical Segmentation Tree

ERIC Educational Resources Information Center

Chen, Wei-Bang

2012-01-01

The purpose of this research is to develop a new visual information analysis, representation, and retrieval framework for automatic discovery of salient objects of user's interest in large-scale image databases. In particular, this dissertation describes a content-based image retrieval framework which supports multiple-object retrieval. The…
Visual Attention Modeling for Stereoscopic Video: A Benchmark and Computational Model.

PubMed

Fang, Yuming; Zhang, Chi; Li, Jing; Lei, Jianjun; Perreira Da Silva, Matthieu; Le Callet, Patrick

2017-10-01

In this paper, we investigate the visual attention modeling for stereoscopic video from the following two aspects. First, we build one large-scale eye tracking database as the benchmark of visual attention modeling for stereoscopic video. The database includes 47 video sequences and their corresponding eye fixation data. Second, we propose a novel computational model of visual attention for stereoscopic video based on Gestalt theory. In the proposed model, we extract the low-level features, including luminance, color, texture, and depth, from discrete cosine transform coefficients, which are used to calculate feature contrast for the spatial saliency computation. The temporal saliency is calculated by the motion contrast from the planar and depth motion features in the stereoscopic video sequences. The final saliency is estimated by fusing the spatial and temporal saliency with uncertainty weighting, which is estimated by the laws of proximity, continuity, and common fate in Gestalt theory. Experimental results show that the proposed method outperforms the state-of-the-art stereoscopic video saliency detection models on our built large-scale eye tracking database and one other database (DML-ITRACK-3D).
Temporal and Fine-Grained Pedestrian Action Recognition on Driving Recorder Database

PubMed Central

Satoh, Yutaka; Aoki, Yoshimitsu; Oikawa, Shoko; Matsui, Yasuhiro

2018-01-01

The paper presents an emerging issue of fine-grained pedestrian action recognition that induces an advanced pre-crush safety to estimate a pedestrian intention in advance. The fine-grained pedestrian actions include visually slight differences (e.g., walking straight and crossing), which are difficult to distinguish from each other. It is believed that the fine-grained action recognition induces a pedestrian intention estimation for a helpful advanced driver-assistance systems (ADAS). The following difficulties have been studied to achieve a fine-grained and accurate pedestrian action recognition: (i) In order to analyze the fine-grained motion of a pedestrian appearance in the vehicle-mounted drive recorder, a method to describe subtle change of motion characteristics occurring in a short time is necessary; (ii) even when the background moves greatly due to the driving of the vehicle, it is necessary to detect changes in subtle motion of the pedestrian; (iii) the collection of large-scale fine-grained actions is very difficult, and therefore a relatively small database should be focused. We find out how to learn an effective recognition model with only a small-scale database. Here, we have thoroughly evaluated several types of configurations to explore an effective approach in fine-grained pedestrian action recognition without a large-scale database. Moreover, two different datasets have been collected in order to raise the issue. Finally, our proposal attained 91.01% on National Traffic Science and Environment Laboratory database (NTSEL) and 53.23% on the near-miss driving recorder database (NDRDB). The paper has improved +8.28% and +6.53% from baseline two-stream fusion convnets. PMID:29461473
The statistical power to detect cross-scale interactions at macroscales

USGS Publications Warehouse

Wagner, Tyler; Fergus, C. Emi; Stow, Craig A.; Cheruvelil, Kendra S.; Soranno, Patricia A.

2016-01-01

Macroscale studies of ecological phenomena are increasingly common because stressors such as climate and land-use change operate at large spatial and temporal scales. Cross-scale interactions (CSIs), where ecological processes operating at one spatial or temporal scale interact with processes operating at another scale, have been documented in a variety of ecosystems and contribute to complex system dynamics. However, studies investigating CSIs are often dependent on compiling multiple data sets from different sources to create multithematic, multiscaled data sets, which results in structurally complex, and sometimes incomplete data sets. The statistical power to detect CSIs needs to be evaluated because of their importance and the challenge of quantifying CSIs using data sets with complex structures and missing observations. We studied this problem using a spatially hierarchical model that measures CSIs between regional agriculture and its effects on the relationship between lake nutrients and lake productivity. We used an existing large multithematic, multiscaled database, LAke multiscaled GeOSpatial, and temporal database (LAGOS), to parameterize the power analysis simulations. We found that the power to detect CSIs was more strongly related to the number of regions in the study rather than the number of lakes nested within each region. CSI power analyses will not only help ecologists design large-scale studies aimed at detecting CSIs, but will also focus attention on CSI effect sizes and the degree to which they are ecologically relevant and detectable with large data sets.
The Sequenced Angiosperm Genomes and Genome Databases.

PubMed

Chen, Fei; Dong, Wei; Zhang, Jiawei; Guo, Xinyue; Chen, Junhao; Wang, Zhengjia; Lin, Zhenguo; Tang, Haibao; Zhang, Liangsheng

2018-01-01

Angiosperms, the flowering plants, provide the essential resources for human life, such as food, energy, oxygen, and materials. They also promoted the evolution of human, animals, and the planet earth. Despite the numerous advances in genome reports or sequencing technologies, no review covers all the released angiosperm genomes and the genome databases for data sharing. Based on the rapid advances and innovations in the database reconstruction in the last few years, here we provide a comprehensive review for three major types of angiosperm genome databases, including databases for a single species, for a specific angiosperm clade, and for multiple angiosperm species. The scope, tools, and data of each type of databases and their features are concisely discussed. The genome databases for a single species or a clade of species are especially popular for specific group of researchers, while a timely-updated comprehensive database is more powerful for address of major scientific mysteries at the genome scale. Considering the low coverage of flowering plants in any available database, we propose construction of a comprehensive database to facilitate large-scale comparative studies of angiosperm genomes and to promote the collaborative studies of important questions in plant biology.
The Sequenced Angiosperm Genomes and Genome Databases

PubMed Central

Chen, Fei; Dong, Wei; Zhang, Jiawei; Guo, Xinyue; Chen, Junhao; Wang, Zhengjia; Lin, Zhenguo; Tang, Haibao; Zhang, Liangsheng

2018-01-01

Angiosperms, the flowering plants, provide the essential resources for human life, such as food, energy, oxygen, and materials. They also promoted the evolution of human, animals, and the planet earth. Despite the numerous advances in genome reports or sequencing technologies, no review covers all the released angiosperm genomes and the genome databases for data sharing. Based on the rapid advances and innovations in the database reconstruction in the last few years, here we provide a comprehensive review for three major types of angiosperm genome databases, including databases for a single species, for a specific angiosperm clade, and for multiple angiosperm species. The scope, tools, and data of each type of databases and their features are concisely discussed. The genome databases for a single species or a clade of species are especially popular for specific group of researchers, while a timely-updated comprehensive database is more powerful for address of major scientific mysteries at the genome scale. Considering the low coverage of flowering plants in any available database, we propose construction of a comprehensive database to facilitate large-scale comparative studies of angiosperm genomes and to promote the collaborative studies of important questions in plant biology. PMID:29706973
EPA'S LANDSCAPE SCIENCES RESEARCH: NUTRIENT POLLUTION, FLOODING, AND HABITAT

EPA Science Inventory

There is a growing need to understand the pattern of landscape change at regional scales and to determine how such changes affect environmental values. Key to conducting these assessments is the development of land-cover databases that permit large-scale analyses, such as an exam...
Identification of Small-Molecule Frequent Hitters of Glutathione S-Transferase-Glutathione Interaction.

PubMed

Brenke, Jara K; Salmina, Elena S; Ringelstetter, Larissa; Dornauer, Scarlett; Kuzikov, Maria; Rothenaigner, Ina; Schorpp, Kenji; Giehler, Fabian; Gopalakrishnan, Jay; Kieser, Arnd; Gul, Sheraz; Tetko, Igor V; Hadian, Kamyar

2016-07-01

In high-throughput screening (HTS) campaigns, the binding of glutathione S-transferase (GST) to glutathione (GSH) is used for detection of GST-tagged proteins in protein-protein interactions or enzyme assays. However, many false-positives, so-called frequent hitters (FH), arise that either prevent GST/GSH interaction or interfere with assay signal generation or detection. To identify GST-FH compounds, we analyzed the data of five independent AlphaScreen-based screening campaigns to classify compounds that inhibit the GST/GSH interaction. We identified 53 compounds affecting GST/GSH binding but not influencing His-tag/Ni(2+)-NTA interaction and general AlphaScreen signals. The structures of these 53 experimentally identified GST-FHs were analyzed in chemoinformatic studies to categorize substructural features that promote interference with GST/GSH binding. Here, we confirmed several existing chemoinformatic filters and more importantly extended them as well as added novel filters that specify compounds with anti-GST/GSH activity. Selected compounds were also tested using different antibody-based GST detection technologies and exhibited no interference clearly demonstrating specificity toward their GST/GSH interaction. Thus, these newly described GST-FH will further contribute to the identification of FH compounds containing promiscuous substructures. The developed filters were uploaded to the OCHEM website (http://ochem.eu) and are publicly accessible for analysis of future HTS results. © 2016 Society for Laboratory Automation and Screening.
C-SPADE: a web-tool for interactive analysis and visualization of drug screening experiments through compound-specific bioactivity dendrograms

PubMed Central

Alam, Zaid; Peddinti, Gopal

2017-01-01

Abstract The advent of polypharmacology paradigm in drug discovery calls for novel chemoinformatic tools for analyzing compounds’ multi-targeting activities. Such tools should provide an intuitive representation of the chemical space through capturing and visualizing underlying patterns of compound similarities linked to their polypharmacological effects. Most of the existing compound-centric chemoinformatics tools lack interactive options and user interfaces that are critical for the real-time needs of chemical biologists carrying out compound screening experiments. Toward that end, we introduce C-SPADE, an open-source exploratory web-tool for interactive analysis and visualization of drug profiling assays (biochemical, cell-based or cell-free) using compound-centric similarity clustering. C-SPADE allows the users to visually map the chemical diversity of a screening panel, explore investigational compounds in terms of their similarity to the screening panel, perform polypharmacological analyses and guide drug-target interaction predictions. C-SPADE requires only the raw drug profiling data as input, and it automatically retrieves the structural information and constructs the compound clusters in real-time, thereby reducing the time required for manual analysis in drug development or repurposing applications. The web-tool provides a customizable visual workspace that can either be downloaded as figure or Newick tree file or shared as a hyperlink with other users. C-SPADE is freely available at http://cspade.fimm.fi/. PMID:28472495
In silico environmental chemical science: properties and processes from statistical and computational modelling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tratnyek, Paul G.; Bylaska, Eric J.; Weber, Eric J.

2017-01-01

Quantitative structure–activity relationships (QSARs) have long been used in the environmental sciences. More recently, molecular modeling and chemoinformatic methods have become widespread. These methods have the potential to expand and accelerate advances in environmental chemistry because they complement observational and experimental data with “in silico” results and analysis. The opportunities and challenges that arise at the intersection between statistical and theoretical in silico methods are most apparent in the context of properties that determine the environmental fate and effects of chemical contaminants (degradation rate constants, partition coefficients, toxicities, etc.). The main example of this is the calibration of QSARs usingmore » descriptor variable data calculated from molecular modeling, which can make QSARs more useful for predicting property data that are unavailable, but also can make them more powerful tools for diagnosis of fate determining pathways and mechanisms. Emerging opportunities for “in silico environmental chemical science” are to move beyond the calculation of specific chemical properties using statistical models and toward more fully in silico models, prediction of transformation pathways and products, incorporation of environmental factors into model predictions, integration of databases and predictive models into more comprehensive and efficient tools for exposure assessment, and extending the applicability of all the above from chemicals to biologicals and materials.« less
Imbalance in chemical space: How to facilitate the identification of protein-protein interaction inhibitors.

PubMed

Kuenemann, Mélaine A; Labbé, Céline M; Cerdan, Adrien H; Sperandio, Olivier

2016-04-01

Protein-protein interactions (PPIs) play vital roles in life and provide new opportunities for therapeutic interventions. In this large data analysis, 3,300 inhibitors of PPIs (iPPIs) were compared to 17 reference datasets of collectively ~566,000 compounds (including natural compounds, existing drugs, active compounds on conventional targets, etc.) using a chemoinformatics approach. Using this procedure, we showed that comparable classes of PPI targets can be formed using either the similarity of their ligands or the shared properties of their binding cavities, constituting a proof-of-concept that not only can binding pockets be used to group PPI targets, but that these pockets certainly condition the properties of their corresponding ligands. These results demonstrate that matching regions in both chemical space and target space can be found. Such identified classes of targets could lead to the design of PPI-class-specific chemical libraries and therefore facilitate the development of iPPIs to the stage of drug candidates.
Identification and prioritization of novel anti-Wolbachia chemotypes from screening a 10,000-compound diversity library

PubMed Central

Johnston, Kelly L.; Cook, Darren A. N.; Berry, Neil G.; David Hong, W.; Clare, Rachel H.; Goddard, Megan; Ford, Louise; Nixon, Gemma L.; O’Neill, Paul M.; Ward, Stephen A.; Taylor, Mark J.

2017-01-01

Lymphatic filariasis and onchocerciasis are two important neglected tropical diseases (NTDs) that cause severe disability. Control efforts are hindered by the lack of a safe macrofilaricidal drug. Targeting the Wolbachia bacterial endosymbionts in these parasites with doxycycline leads to a macrofilaricidal outcome, but protracted treatment regimens and contraindications restrict its widespread implementation. The Anti-Wolbachia consortium aims to develop improved anti-Wolbachia drugs to overcome these barriers. We describe the first screening of a large, diverse compound library against Wolbachia. This whole-organism screen, streamlined to reduce bottlenecks, produced a hit rate of 0.5%. Chemoinformatic analysis of the top 50 hits led to the identification of six structurally diverse chemotypes, the disclosure of which could offer interesting avenues of investigation to other researchers active in this field. An example of hit-to-lead optimization is described to further demonstrate the potential of developing these high-quality hit series as safe, efficacious, and selective anti-Wolbachia macrofilaricides. PMID:28959730
Possibility of Database Research as a Means of Pharmacovigilance in Japan Based on a Comparison with Sertraline Postmarketing Surveillance.

PubMed

Hirano, Yoko; Asami, Yuko; Kuribayashi, Kazuhiko; Kitazaki, Shigeru; Yamamoto, Yuji; Fujimoto, Yoko

2018-05-01

Many pharmacoepidemiologic studies using large-scale databases have recently been utilized to evaluate the safety and effectiveness of drugs in Western countries. In Japan, however, conventional methodology has been applied to postmarketing surveillance (PMS) to collect safety and effectiveness information on new drugs to meet regulatory requirements. Conventional PMS entails enormous costs and resources despite being an uncontrolled observational study method. This study is aimed at examining the possibility of database research as a more efficient pharmacovigilance approach by comparing a health care claims database and PMS with regard to the characteristics and safety profiles of sertraline-prescribed patients. The characteristics of sertraline-prescribed patients recorded in a large-scale Japanese health insurance claims database developed by MinaCare Co. Ltd. were scanned and compared with the PMS results. We also explored the possibility of detecting signals indicative of adverse reactions based on the claims database by using sequence symmetry analysis. Diabetes mellitus, hyperlipidemia, and hyperthyroidism served as exploratory events, and their detection criteria for the claims database were reported by the Pharmaceuticals and Medical Devices Agency in Japan. Most of the characteristics of sertraline-prescribed patients in the claims database did not differ markedly from those in the PMS. There was no tendency for higher risks of the exploratory events after exposure to sertraline, and this was consistent with sertraline's known safety profile. Our results support the concept of using database research as a cost-effective pharmacovigilance tool that is free of selection bias . Further investigation using database research is required to confirm our preliminary observations. Copyright © 2018. Published by Elsevier Inc.
In silico polypharmacology of natural products.

PubMed

Fang, Jiansong; Liu, Chuang; Wang, Qi; Lin, Ping; Cheng, Feixiong

2017-04-27

Natural products with polypharmacological profiles have demonstrated promise as novel therapeutics for various complex diseases, including cancer. Currently, many gaps exist in our knowledge of which compounds interact with which targets, and experimentally testing all possible interactions is infeasible. Recent advances and developments of systems pharmacology and computational (in silico) approaches provide powerful tools for exploring the polypharmacological profiles of natural products. In this review, we introduce recent progresses and advances of computational tools and systems pharmacology approaches for identifying drug targets of natural products by focusing on the development of targeted cancer therapy. We survey the polypharmacological and systems immunology profiles of five representative natural products that are being considered as cancer therapies. We summarize various chemoinformatics, bioinformatics and systems biology resources for reconstructing drug-target networks of natural products. We then review currently available computational approaches and tools for prediction of drug-target interactions by focusing on five domains: target-based, ligand-based, chemogenomics-based, network-based and omics-based systems biology approaches. In addition, we describe a practical example of the application of systems pharmacology approaches by integrating the polypharmacology of natural products and large-scale cancer genomics data for the development of precision oncology under the systems biology framework. Finally, we highlight the promise of cancer immunotherapies and combination therapies that target tumor ecosystems (e.g. clones or 'selfish' sub-clones) via exploiting the immunological and inflammatory 'side' effects of natural products in the cancer post-genomics era. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Consistency Analysis of Genome-Scale Models of Bacterial Metabolism: A Metamodel Approach

PubMed Central

Ponce-de-Leon, Miguel; Calle-Espinosa, Jorge; Peretó, Juli; Montero, Francisco

2015-01-01

Genome-scale metabolic models usually contain inconsistencies that manifest as blocked reactions and gap metabolites. With the purpose to detect recurrent inconsistencies in metabolic models, a large-scale analysis was performed using a previously published dataset of 130 genome-scale models. The results showed that a large number of reactions (~22%) are blocked in all the models where they are present. To unravel the nature of such inconsistencies a metamodel was construed by joining the 130 models in a single network. This metamodel was manually curated using the unconnected modules approach, and then, it was used as a reference network to perform a gap-filling on each individual genome-scale model. Finally, a set of 36 models that had not been considered during the construction of the metamodel was used, as a proof of concept, to extend the metamodel with new biochemical information, and to assess its impact on gap-filling results. The analysis performed on the metamodel allowed to conclude: 1) the recurrent inconsistencies found in the models were already present in the metabolic database used during the reconstructions process; 2) the presence of inconsistencies in a metabolic database can be propagated to the reconstructed models; 3) there are reactions not manifested as blocked which are active as a consequence of some classes of artifacts, and; 4) the results of an automatic gap-filling are highly dependent on the consistency and completeness of the metamodel or metabolic database used as the reference network. In conclusion the consistency analysis should be applied to metabolic databases in order to detect and fill gaps as well as to detect and remove artifacts and redundant information. PMID:26629901
Future of applied watershed science at regional scales

Treesearch

Lee Benda; Daniel Miller; Steve Lanigan; Gordon Reeves

2009-01-01

Resource managers must deal increasingly with land use and conservation plans applied at large spatial scales (watersheds, landscapes, states, regions) involving multiple interacting federal agencies and stakeholders. Access to a geographically focused and application-oriented database would allow users in different locations and with different concerns to quickly...
Distributed database kriging for adaptive sampling (D²KAS)

DOE PAGES

Roehm, Dominic; Pavel, Robert S.; Barros, Kipton; ...

2015-03-18

We present an adaptive sampling method supplemented by a distributed database and a prediction method for multiscale simulations using the Heterogeneous Multiscale Method. A finite-volume scheme integrates the macro-scale conservation laws for elastodynamics, which are closed by momentum and energy fluxes evaluated at the micro-scale. In the original approach, molecular dynamics (MD) simulations are launched for every macro-scale volume element. Our adaptive sampling scheme replaces a large fraction of costly micro-scale MD simulations with fast table lookup and prediction. The cloud database Redis provides the plain table lookup, and with locality aware hashing we gather input data for our predictionmore » scheme. For the latter we use kriging, which estimates an unknown value and its uncertainty (error) at a specific location in parameter space by using weighted averages of the neighboring points. We find that our adaptive scheme significantly improves simulation performance by a factor of 2.5 to 25, while retaining high accuracy for various choices of the algorithm parameters.« less
QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors.

PubMed

Tarasova, Olga A; Urusova, Aleksandra F; Filimonov, Dmitry A; Nicklaus, Marc C; Zakharov, Alexey V; Poroikov, Vladimir V

2015-07-27

Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.
GLAD: a system for developing and deploying large-scale bioinformatics grid.

PubMed

Teo, Yong-Meng; Wang, Xianbing; Ng, Yew-Kwong

2005-03-01

Grid computing is used to solve large-scale bioinformatics problems with gigabytes database by distributing the computation across multiple platforms. Until now in developing bioinformatics grid applications, it is extremely tedious to design and implement the component algorithms and parallelization techniques for different classes of problems, and to access remotely located sequence database files of varying formats across the grid. In this study, we propose a grid programming toolkit, GLAD (Grid Life sciences Applications Developer), which facilitates the development and deployment of bioinformatics applications on a grid. GLAD has been developed using ALiCE (Adaptive scaLable Internet-based Computing Engine), a Java-based grid middleware, which exploits the task-based parallelism. Two bioinformatics benchmark applications, such as distributed sequence comparison and distributed progressive multiple sequence alignment, have been developed using GLAD.

What do data used to develop ground-motion prediction equations tell us about motions near faults?

USGS Publications Warehouse

Boore, David M.

2014-01-01

A large database of ground motions from shallow earthquakes occurring in active tectonic regions around the world, recently developed in the Pacific Earthquake Engineering Center’s NGA-West2 project, has been used to investigate what such a database can say about the properties and processes of crustal fault zones. There are a relatively small number of near-rupture records, implying that few recordings in the database are within crustal fault zones, but the records that do exist emphasize the complexity of ground-motion amplitudes and polarization close to individual faults. On average over the whole data set, however, the scaling of ground motions with magnitude at a fixed distance, and the distance dependence of the ground motions, seem to be largely consistent with simple seismological models of source scaling, path propagation effects, and local site amplification. The data show that ground motions close to large faults, as measured by elastic response spectra, tend to saturate and become essentially constant for short periods. This saturation seems to be primarily a geometrical effect, due to the increasing size of the rupture surface with magnitude, and not due to a breakdown in self similarity.
What Do Data Used to Develop Ground-Motion Prediction Equations Tell Us About Motions Near Faults?

NASA Astrophysics Data System (ADS)

Boore, David M.

2014-11-01

A large database of ground motions from shallow earthquakes occurring in active tectonic regions around the world, recently developed in the Pacific Earthquake Engineering Center's NGA-West2 project, has been used to investigate what such a database can say about the properties and processes of crustal fault zones. There are a relatively small number of near-rupture records, implying that few recordings in the database are within crustal fault zones, but the records that do exist emphasize the complexity of ground-motion amplitudes and polarization close to individual faults. On average over the whole data set, however, the scaling of ground motions with magnitude at a fixed distance, and the distance dependence of the ground motions, seem to be largely consistent with simple seismological models of source scaling, path propagation effects, and local site amplification. The data show that ground motions close to large faults, as measured by elastic response spectra, tend to saturate and become essentially constant for short periods. This saturation seems to be primarily a geometrical effect, due to the increasing size of the rupture surface with magnitude, and not due to a breakdown in self similarity.
HTS-DB: an online resource to publish and query data from functional genomics high-throughput siRNA screening projects.

PubMed

Saunders, Rebecca E; Instrell, Rachael; Rispoli, Rossella; Jiang, Ming; Howell, Michael

2013-01-01

High-throughput screening (HTS) uses technologies such as RNA interference to generate loss-of-function phenotypes on a genomic scale. As these technologies become more popular, many research institutes have established core facilities of expertise to deal with the challenges of large-scale HTS experiments. As the efforts of core facility screening projects come to fruition, focus has shifted towards managing the results of these experiments and making them available in a useful format that can be further mined for phenotypic discovery. The HTS-DB database provides a public view of data from screening projects undertaken by the HTS core facility at the CRUK London Research Institute. All projects and screens are described with comprehensive assay protocols, and datasets are provided with complete descriptions of analysis techniques. This format allows users to browse and search data from large-scale studies in an informative and intuitive way. It also provides a repository for additional measurements obtained from screens that were not the focus of the project, such as cell viability, and groups these data so that it can provide a gene-centric summary across several different cell lines and conditions. All datasets from our screens that can be made available can be viewed interactively and mined for further hit lists. We believe that in this format, the database provides researchers with rapid access to results of large-scale experiments that might facilitate their understanding of genes/compounds identified in their own research. DATABASE URL: http://hts.cancerresearchuk.org/db/public.
Environmental Education Organizations and Programs in Texas: Identifying Patterns through a Database and Survey Approach for Establishing Frameworks for Assessment and Progress

ERIC Educational Resources Information Center

Lloyd-Strovas, Jenny D.; Arsuffi, Thomas L.

2016-01-01

We examined the diversity of environmental education (EE) in Texas, USA, by developing a framework to assess EE organizations and programs at a large scale: the Environmental Education Database of Organizations and Programs (EEDOP). This framework consisted of the following characteristics: organization/visitor demographics, pedagogy/curriculum,…
Construction of a robust, large-scale, collaborative database for raw data in computational chemistry: the Collaborative Chemistry Database Tool (CCDBT).

PubMed

Chen, Mingyang; Stott, Amanda C; Li, Shenggang; Dixon, David A

2012-04-01

A robust metadata database called the Collaborative Chemistry Database Tool (CCDBT) for massive amounts of computational chemistry raw data has been designed and implemented. It performs data synchronization and simultaneously extracts the metadata. Computational chemistry data in various formats from different computing sources, software packages, and users can be parsed into uniform metadata for storage in a MySQL database. Parsing is performed by a parsing pyramid, including parsers written for different levels of data types and sets created by the parser loader after loading parser engines and configurations. Copyright Â© 2011 Elsevier Inc. All rights reserved.
[Status of libraries and databases for natural products at abroad].

PubMed

Zhao, Li-Mei; Tan, Ning-Hua

2015-01-01

For natural products are one of the important sources for drug discovery, libraries and databases of natural products are significant for the development and research of natural products. At present, most of compound libraries at abroad are synthetic or combinatorial synthetic molecules, resulting to access natural products difficult; for information of natural products are scattered with different standards, it is difficult to construct convenient, comprehensive and large-scale databases for natural products. This paper reviewed the status of current accessing libraries and databases for natural products at abroad and provided some important information for the development of libraries and database for natural products.
MouseNet database: digital management of a large-scale mutagenesis project.

PubMed

Pargent, W; Heffner, S; Schäble, K F; Soewarto, D; Fuchs, H; Hrabé de Angelis, M

2000-07-01

The Munich ENU Mouse Mutagenesis Screen is a large-scale mutant production, phenotyping, and mapping project. It encompasses two animal breeding facilities and a number of screening groups located in the general area of Munich. A central database is required to manage and process the immense amount of data generated by the mutagenesis project. This database, which we named MouseNet(c), runs on a Sybase platform and will finally store and process all data from the entire project. In addition, the system comprises a portfolio of functions needed to support the workflow management of the core facility and the screening groups. MouseNet(c) will make all of the data available to the participating screening groups, and later to the international scientific community. MouseNet(c) will consist of three major software components:* Animal Management System (AMS)* Sample Tracking System (STS)* Result Documentation System (RDS)MouseNet(c) provides the following major advantages:* being accessible from different client platforms via the Internet* being a full-featured multi-user system (including access restriction and data locking mechanisms)* relying on a professional RDBMS (relational database management system) which runs on a UNIX server platform* supplying workflow functions and a variety of plausibility checks.
Hierarchical Data Distribution Scheme for Peer-to-Peer Networks

NASA Astrophysics Data System (ADS)

Bhushan, Shashi; Dave, M.; Patel, R. B.

2010-11-01

In the past few years, peer-to-peer (P2P) networks have become an extremely popular mechanism for large-scale content sharing. P2P systems have focused on specific application domains (e.g. music files, video files) or on providing file system like capabilities. P2P is a powerful paradigm, which provides a large-scale and cost-effective mechanism for data sharing. P2P system may be used for storing data globally. Can we implement a conventional database on P2P system? But successful implementation of conventional databases on the P2P systems is yet to be reported. In this paper we have presented the mathematical model for the replication of the partitions and presented a hierarchical based data distribution scheme for the P2P networks. We have also analyzed the resource utilization and throughput of the P2P system with respect to the availability, when a conventional database is implemented over the P2P system with variable query rate. Simulation results show that database partitions placed on the peers with higher availability factor perform better. Degradation index, throughput, resource utilization are the parameters evaluated with respect to the availability factor.
Icing Simulation Research Supporting the Ice-Accretion Testing of Large-Scale Swept-Wing Models

NASA Technical Reports Server (NTRS)

Yadlin, Yoram; Monnig, Jaime T.; Malone, Adam M.; Paul, Bernard P.

2018-01-01

The work summarized in this report is a continuation of NASA's Large-Scale, Swept-Wing Test Articles Fabrication; Research and Test Support for NASA IRT contract (NNC10BA05 -NNC14TA36T) performed by Boeing under the NASA Research and Technology for Aerospace Propulsion Systems (RTAPS) contract. In the study conducted under RTAPS, a series of icing tests in the Icing Research Tunnel (IRT) have been conducted to characterize ice formations on large-scale swept wings representative of modern commercial transport airplanes. The outcome of that campaign was a large database of ice-accretion geometries that can be used for subsequent aerodynamic evaluation in other experimental facilities and for validation of ice-accretion prediction codes.
Computerization of Library and Information Services in Mainland China.

ERIC Educational Resources Information Center

Lin, Sharon Chien

1994-01-01

Describes two phases of the automation of library and information services in mainland China. From 1974-86, much effort was concentrated on developing computer systems, databases, online retrieval, and networking. From 1986 to the present, practical progress became possible largely because of CD-ROM technology; and large scale networking for…
Neural Network Modeling of UH-60A Pilot Vibration

NASA Technical Reports Server (NTRS)

Kottapalli, Sesi

2003-01-01

Full-scale flight-test pilot floor vibration is modeled using neural networks and full-scale wind tunnel test data for low speed level flight conditions. Neural network connections between the wind tunnel test data and the tlxee flight test pilot vibration components (vertical, lateral, and longitudinal) are studied. Two full-scale UH-60A Black Hawk databases are used. The first database is the NASMArmy UH-60A Airloads Program flight test database. The second database is the UH-60A rotor-only wind tunnel database that was acquired in the NASA Ames SO- by 120- Foot Wind Tunnel with the Large Rotor Test Apparatus (LRTA). Using neural networks, the flight-test pilot vibration is modeled using the wind tunnel rotating system hub accelerations, and separately, using the hub loads. The results show that the wind tunnel rotating system hub accelerations and the operating parameters can represent the flight test pilot vibration. The six components of the wind tunnel N/rev balance-system hub loads and the operating parameters can also represent the flight test pilot vibration. The present neural network connections can significandy increase the value of wind tunnel testing.
Architectural Implications for Spatial Object Association Algorithms*

PubMed Central

Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste

2013-01-01

Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244
Large scale database scrubbing using object oriented software components.

PubMed

Herting, R L; Barnes, M R

1998-01-01

Now that case managers, quality improvement teams, and researchers use medical databases extensively, the ability to share and disseminate such databases while maintaining patient confidentiality is paramount. A process called scrubbing addresses this problem by removing personally identifying information while keeping the integrity of the medical information intact. Scrubbing entire databases, containing multiple tables, requires that the implicit relationships between data elements in different tables of the database be maintained. To address this issue we developed DBScrub, a Java program that interfaces with any JDBC compliant database and scrubs the database while maintaining the implicit relationships within it. DBScrub uses a small number of highly configurable object-oriented software components to carry out the scrubbing. We describe the structure of these software components and how they maintain the implicit relationships within the database.
Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization

PubMed Central

Wei, Chih-Hsuan; Hakala, Kai; Pyysalo, Sampo; Ananiadou, Sophia; Kao, Hung-Yu; Lu, Zhiyong; Salakoski, Tapio; Van de Peer, Yves; Ginter, Filip

2013-01-01

Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license. PMID:23613707
Classification of time series patterns from complex dynamic systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schryver, J.C.; Rao, N.

1998-07-01

An increasing availability of high-performance computing and data storage media at decreasing cost is making possible the proliferation of large-scale numerical databases and data warehouses. Numeric warehousing enterprises on the order of hundreds of gigabytes to terabytes are a reality in many fields such as finance, retail sales, process systems monitoring, biomedical monitoring, surveillance and transportation. Large-scale databases are becoming more accessible to larger user communities through the internet, web-based applications and database connectivity. Consequently, most researchers now have access to a variety of massive datasets. This trend will probably only continue to grow over the next several years. Unfortunately,more » the availability of integrated tools to explore, analyze and understand the data warehoused in these archives is lagging far behind the ability to gain access to the same data. In particular, locating and identifying patterns of interest in numerical time series data is an increasingly important problem for which there are few available techniques. Temporal pattern recognition poses many interesting problems in classification, segmentation, prediction, diagnosis and anomaly detection. This research focuses on the problem of classification or characterization of numerical time series data. Highway vehicles and their drivers are examples of complex dynamic systems (CDS) which are being used by transportation agencies for field testing to generate large-scale time series datasets. Tools for effective analysis of numerical time series in databases generated by highway vehicle systems are not yet available, or have not been adapted to the target problem domain. However, analysis tools from similar domains may be adapted to the problem of classification of numerical time series data.« less
Efficient hemodynamic event detection utilizing relational databases and wavelet analysis

NASA Technical Reports Server (NTRS)

Saeed, M.; Mark, R. G.

2001-01-01

Development of a temporal query framework for time-oriented medical databases has hitherto been a challenging problem. We describe a novel method for the detection of hemodynamic events in multiparameter trends utilizing wavelet coefficients in a MySQL relational database. Storage of the wavelet coefficients allowed for a compact representation of the trends, and provided robust descriptors for the dynamics of the parameter time series. A data model was developed to allow for simplified queries along several dimensions and time scales. Of particular importance, the data model and wavelet framework allowed for queries to be processed with minimal table-join operations. A web-based search engine was developed to allow for user-defined queries. Typical queries required between 0.01 and 0.02 seconds, with at least two orders of magnitude improvement in speed over conventional queries. This powerful and innovative structure will facilitate research on large-scale time-oriented medical databases.
A blue carbon soil database: Tidal wetland stocks for the US National Greenhouse Gas Inventory

NASA Astrophysics Data System (ADS)

Feagin, R. A.; Eriksson, M.; Hinson, A.; Najjar, R. G.; Kroeger, K. D.; Herrmann, M.; Holmquist, J. R.; Windham-Myers, L.; MacDonald, G. M.; Brown, L. N.; Bianchi, T. S.

2015-12-01

Coastal wetlands contain large reservoirs of carbon, and in 2015 the US National Greenhouse Gas Inventory began the work of placing blue carbon within the national regulatory context. The potential value of a wetland carbon stock, in relation to its location, soon could be influential in determining governmental policy and management activities, or in stimulating market-based CO2 sequestration projects. To meet the national need for high-resolution maps, a blue carbon stock database was developed linking National Wetlands Inventory datasets with the USDA Soil Survey Geographic Database. Users of the database can identify the economic potential for carbon conservation or restoration projects within specific estuarine basins, states, wetland types, physical parameters, and land management activities. The database is geared towards both national-level assessments and local-level inquiries. Spatial analysis of the stocks show high variance within individual estuarine basins, largely dependent on geomorphic position on the landscape, though there are continental scale trends to the carbon distribution as well. Future plans including linking this database with a sedimentary accretion database to predict carbon flux in US tidal wetlands.
Robust Optical Recognition of Cursive Pashto Script Using Scale, Rotation and Location Invariant Approach

PubMed Central

Ahmad, Riaz; Naz, Saeeda; Afzal, Muhammad Zeshan; Amin, Sayed Hassan; Breuel, Thomas

2015-01-01

The presence of a large number of unique shapes called ligatures in cursive languages, along with variations due to scaling, orientation and location provides one of the most challenging pattern recognition problems. Recognition of the large number of ligatures is often a complicated task in oriental languages such as Pashto, Urdu, Persian and Arabic. Research on cursive script recognition often ignores the fact that scaling, orientation, location and font variations are common in printed cursive text. Therefore, these variations are not included in image databases and in experimental evaluations. This research uncovers challenges faced by Arabic cursive script recognition in a holistic framework by considering Pashto as a test case, because Pashto language has larger alphabet set than Arabic, Persian and Urdu. A database containing 8000 images of 1000 unique ligatures having scaling, orientation and location variations is introduced. In this article, a feature space based on scale invariant feature transform (SIFT) along with a segmentation framework has been proposed for overcoming the above mentioned challenges. The experimental results show a significantly improved performance of proposed scheme over traditional feature extraction techniques such as principal component analysis (PCA). PMID:26368566
Interactive Exploration for Continuously Expanding Neuron Databases.

PubMed

Li, Zhongyu; Metaxas, Dimitris N; Lu, Aidong; Zhang, Shaoting

2017-02-15

This paper proposes a novel framework to help biologists explore and analyze neurons based on retrieval of data from neuron morphological databases. In recent years, the continuously expanding neuron databases provide a rich source of information to associate neuronal morphologies with their functional properties. We design a coarse-to-fine framework for efficient and effective data retrieval from large-scale neuron databases. In the coarse-level, for efficiency in large-scale, we employ a binary coding method to compress morphological features into binary codes of tens of bits. Short binary codes allow for real-time similarity searching in Hamming space. Because the neuron databases are continuously expanding, it is inefficient to re-train the binary coding model from scratch when adding new neurons. To solve this problem, we extend binary coding with online updating schemes, which only considers the newly added neurons and update the model on-the-fly, without accessing the whole neuron databases. In the fine-grained level, we introduce domain experts/users in the framework, which can give relevance feedback for the binary coding based retrieval results. This interactive strategy can improve the retrieval performance through re-ranking the above coarse results, where we design a new similarity measure and take the feedback into account. Our framework is validated on more than 17,000 neuron cells, showing promising retrieval accuracy and efficiency. Moreover, we demonstrate its use case in assisting biologists to identify and explore unknown neurons. Copyright © 2017 Elsevier Inc. All rights reserved.
Spatial distribution of GRBs and large scale structure of the Universe

NASA Astrophysics Data System (ADS)

Bagoly, Zsolt; Rácz, István I.; Balázs, Lajos G.; Tóth, L. Viktor; Horváth, István

We studied the space distribution of the starburst galaxies from Millennium XXL database at z = 0.82. We examined the starburst distribution in the classical Millennium I (De Lucia et al. (2006)) using a semi-analytical model for the genesis of the galaxies. We simulated a starburst galaxies sample with Markov Chain Monte Carlo method. The connection between the large scale structures homogenous and starburst groups distribution (Kofman and Shandarin 1998), Suhhonenko et al. (2011), Liivamägi et al. (2012), Park et al. (2012), Horvath et al. (2014), Horvath et al. (2015)) on a defined scale were checked too.

[Adverse Effect Predictions Based on Computational Toxicology Techniques and Large-scale Databases].

PubMed

Uesawa, Yoshihiro

2018-01-01

　Understanding the features of chemical structures related to the adverse effects of drugs is useful for identifying potential adverse effects of new drugs. This can be based on the limited information available from post-marketing surveillance, assessment of the potential toxicities of metabolites and illegal drugs with unclear characteristics, screening of lead compounds at the drug discovery stage, and identification of leads for the discovery of new pharmacological mechanisms. This present paper describes techniques used in computational toxicology to investigate the content of large-scale spontaneous report databases of adverse effects, and it is illustrated with examples. Furthermore, volcano plotting, a new visualization method for clarifying the relationships between drugs and adverse effects via comprehensive analyses, will be introduced. These analyses may produce a great amount of data that can be applied to drug repositioning.
Databases for multilevel biophysiology research available at Physiome.jp.

PubMed

Asai, Yoshiyuki; Abe, Takeshi; Li, Li; Oka, Hideki; Nomura, Taishin; Kitano, Hiroaki

2015-01-01

Physiome.jp (http://physiome.jp) is a portal site inaugurated in 2007 to support model-based research in physiome and systems biology. At Physiome.jp, several tools and databases are available to support construction of physiological, multi-hierarchical, large-scale models. There are three databases in Physiome.jp, housing mathematical models, morphological data, and time-series data. In late 2013, the site was fully renovated, and in May 2015, new functions were implemented to provide information infrastructure to support collaborative activities for developing models and performing simulations within the database framework. This article describes updates to the databases implemented since 2013, including cooperation among the three databases, interactive model browsing, user management, version management of models, management of parameter sets, and interoperability with applications.
Visual Systems for Interactive Exploration and Mining of Large-Scale Neuroimaging Data Archives

PubMed Central

Bowman, Ian; Joshi, Shantanu H.; Van Horn, John D.

2012-01-01

While technological advancements in neuroimaging scanner engineering have improved the efficiency of data acquisition, electronic data capture methods will likewise significantly expedite the populating of large-scale neuroimaging databases. As they do and these archives grow in size, a particular challenge lies in examining and interacting with the information that these resources contain through the development of compelling, user-driven approaches for data exploration and mining. In this article, we introduce the informatics visualization for neuroimaging (INVIZIAN) framework for the graphical rendering of, and dynamic interaction with the contents of large-scale neuroimaging data sets. We describe the rationale behind INVIZIAN, detail its development, and demonstrate its usage in examining a collection of over 900 T1-anatomical magnetic resonance imaging (MRI) image volumes from across a diverse set of clinical neuroimaging studies drawn from a leading neuroimaging database. Using a collection of cortical surface metrics and means for examining brain similarity, INVIZIAN graphically displays brain surfaces as points in a coordinate space and enables classification of clusters of neuroanatomically similar MRI images and data mining. As an initial step toward addressing the need for such user-friendly tools, INVIZIAN provides a highly unique means to interact with large quantities of electronic brain imaging archives in ways suitable for hypothesis generation and data mining. PMID:22536181
Resources for Functional Genomics Studies in Drosophila melanogaster

PubMed Central

Mohr, Stephanie E.; Hu, Yanhui; Kim, Kevin; Housden, Benjamin E.; Perrimon, Norbert

2014-01-01

Drosophila melanogaster has become a system of choice for functional genomic studies. Many resources, including online databases and software tools, are now available to support design or identification of relevant fly stocks and reagents or analysis and mining of existing functional genomic, transcriptomic, proteomic, etc. datasets. These include large community collections of fly stocks and plasmid clones, “meta” information sites like FlyBase and FlyMine, and an increasing number of more specialized reagents, databases, and online tools. Here, we introduce key resources useful to plan large-scale functional genomics studies in Drosophila and to analyze, integrate, and mine the results of those studies in ways that facilitate identification of highest-confidence results and generation of new hypotheses. We also discuss ways in which existing resources can be used and might be improved and suggest a few areas of future development that would further support large- and small-scale studies in Drosophila and facilitate use of Drosophila information by the research community more generally. PMID:24653003
Factors Affecting Volunteering among Older Rural and City Dwelling Adults in Australia

ERIC Educational Resources Information Center

Warburton, Jeni; Stirling, Christine

2007-01-01

In the absence of large scale Australian studies of volunteering among older adults, this study compared the relevance of two theoretical approaches--social capital theory and sociostructural resources theory--to predict voluntary activity in relation to a large national database. The paper explores volunteering by older people (aged 55+) in order…
Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data.

PubMed

Su, Xiaoquan; Xu, Jian; Ning, Kang

2012-10-01

It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database. In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods. Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples. ningkang@qibebt.ac.cn Supplementary data are available at Bioinformatics online.
Image segmentation evaluation for very-large datasets

NASA Astrophysics Data System (ADS)

Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

2016-03-01

With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.
Epidemiological considerations for the use of databases in transfusion research: a Scandinavian perspective.

PubMed

Edgren, Gustaf; Hjalgrim, Henrik

2010-11-01

At current safety levels, with adverse events from transfusions being relatively rare, further progress in risk reductions will require large-scale investigations. Thus, truly prospective studies may prove unfeasible and other alternatives deserve consideration. In this review, we will try to give an overview of recent and historical developments in the use of blood donation and transfusion databases in research. In addition, we will go over important methodological issues. There are at least three nationwide or near-nationwide donation/transfusion databases with the possibility for long-term follow-up of donors and recipients. During the past few years, a large number of reports have been published utilizing such data sources to investigate transfusion-associated risks. In addition, numerous clinics systematically collect and use such data on a smaller scale. Combining systematically recorded donation and transfusion data with long-term health follow-up opens up exciting opportunities for transfusion medicine research. However, the correct analysis of such data requires close attention to methodological issues, especially including the indication for transfusion and reverse causality.
Remote visual analysis of large turbulence databases at multiple scales

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pulido, Jesus; Livescu, Daniel; Kanov, Kalin

The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
Remote visual analysis of large turbulence databases at multiple scales

DOE PAGES

Pulido, Jesus; Livescu, Daniel; Kanov, Kalin; ...

2018-06-15

The remote analysis and visualization of raw large turbulence datasets is challenging. Current accurate direct numerical simulations (DNS) of turbulent flows generate datasets with billions of points per time-step and several thousand time-steps per simulation. Until recently, the analysis and visualization of such datasets was restricted to scientists with access to large supercomputers. The public Johns Hopkins Turbulence database simplifies access to multi-terabyte turbulence datasets and facilitates the computation of statistics and extraction of features through the use of commodity hardware. In this paper, we present a framework designed around wavelet-based compression for high-speed visualization of large datasets and methodsmore » supporting multi-resolution analysis of turbulence. By integrating common technologies, this framework enables remote access to tools available on supercomputers and over 230 terabytes of DNS data over the Web. Finally, the database toolset is expanded by providing access to exploratory data analysis tools, such as wavelet decomposition capabilities and coherent feature extraction.« less
How does the size and shape of local populations in China compare to general anthropometric surveys currently used for product design?

PubMed

Daniell, Nathan; Fraysse, François; Paul, Gunther

2012-01-01

Anthropometry has long been used for a range of ergonomic applications & product design. Although products are often designed for specific cohorts, anthropometric data are typically sourced from large scale surveys representative of the general population. Additionally, few data are available for emerging markets like China and India. This study measured 80 Chinese males that were representative of a specific cohort targeted for the design of a new product. Thirteen anthropometric measurements were recorded and compared to two large databases that represented a general population, a Chinese database and a Western database. Substantial differences were identified between the Chinese males measured in this study and both databases. The subjects were substantially taller, heavier and broader than subjects in the older Chinese database. However, they were still substantially smaller, lighter and thinner than Western males. Data from current Western anthropometric surveys are unlikely to accurately represent the target population for product designers and manufacturers in emerging markets like China.
Architectural Implications for Spatial Object Association Algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, V S; Kurc, T; Saltz, J

2009-01-29

Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation providesmore » insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).« less
The future of medical diagnostics: large digitized databases.

PubMed

Kerr, Wesley T; Lau, Edward P; Owens, Gwen E; Trefler, Aaron

2012-09-01

The electronic health record mandate within the American Recovery and Reinvestment Act of 2009 will have a far-reaching affect on medicine. In this article, we provide an in-depth analysis of how this mandate is expected to stimulate the production of large-scale, digitized databases of patient information. There is evidence to suggest that millions of patients and the National Institutes of Health will fully support the mining of such databases to better understand the process of diagnosing patients. This data mining likely will reaffirm and quantify known risk factors for many diagnoses. This quantification may be leveraged to further develop computer-aided diagnostic tools that weigh risk factors and provide decision support for health care providers. We expect that creation of these databases will stimulate the development of computer-aided diagnostic support tools that will become an integral part of modern medicine.
Scale-Up of GRCop: From Laboratory to Rocket Engines

NASA Technical Reports Server (NTRS)

Ellis, David L.

2016-01-01

GRCop is a high temperature, high thermal conductivity copper-based series of alloys designed primarily for use in regeneratively cooled rocket engine liners. It began with laboratory-level production of a few grams of ribbon produced by chill block melt spinning and has grown to commercial-scale production of large-scale rocket engine liners. Along the way, a variety of methods of consolidating and working the alloy were examined, a database of properties was developed and a variety of commercial and government applications were considered. This talk will briefly address the basic material properties used for selection of compositions to scale up, the methods used to go from simple ribbon to rocket engines, the need to develop a suitable database, and the issues related to getting the alloy into a rocket engine or other application.
FishTraits Database

USGS Publications Warehouse

Angermeier, Paul L.; Frimpong, Emmanuel A.

2009-01-01

The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. FishTraits is a database of >100 traits for 809 (731 native and 78 exotic) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database contains information on four major categories of traits: (1) trophic ecology, (2) body size and reproductive ecology (life history), (3) habitat associations, and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status is also included. Together, we refer to the traits, distribution, and conservation status information as attributes. Descriptions of attributes are available here. Many sources were consulted to compile attributes, including state and regional species accounts and other databases.
bpRNA: large-scale automated annotation and analysis of RNA secondary structure.

PubMed

Danaee, Padideh; Rouches, Mason; Wiley, Michelle; Deng, Dezhong; Huang, Liang; Hendrix, David

2018-05-09

While RNA secondary structure prediction from sequence data has made remarkable progress, there is a need for improved strategies for annotating the features of RNA secondary structures. Here, we present bpRNA, a novel annotation tool capable of parsing RNA structures, including complex pseudoknot-containing RNAs, to yield an objective, precise, compact, unambiguous, easily-interpretable description of all loops, stems, and pseudoknots, along with the positions, sequence, and flanking base pairs of each such structural feature. We also introduce several new informative representations of RNA structure types to improve structure visualization and interpretation. We have further used bpRNA to generate a web-accessible meta-database, 'bpRNA-1m', of over 100 000 single-molecule, known secondary structures; this is both more fully and accurately annotated and over 20-times larger than existing databases. We use a subset of the database with highly similar (≥90% identical) sequences filtered out to report on statistical trends in sequence, flanking base pairs, and length. Both the bpRNA method and the bpRNA-1m database will be valuable resources both for specific analysis of individual RNA molecules and large-scale analyses such as are useful for updating RNA energy parameters for computational thermodynamic predictions, improving machine learning models for structure prediction, and for benchmarking structure-prediction algorithms.
Chloroplast 2010: A Database for Large-Scale Phenotypic Screening of Arabidopsis Mutants1[W][OA

PubMed Central

Lu, Yan; Savage, Linda J.; Larson, Matthew D.; Wilkerson, Curtis G.; Last, Robert L.

2011-01-01

Large-scale phenotypic screening presents challenges and opportunities not encountered in typical forward or reverse genetics projects. We describe a modular database and laboratory information management system that was implemented in support of the Chloroplast 2010 Project, an Arabidopsis (Arabidopsis thaliana) reverse genetics phenotypic screen of more than 5,000 mutants (http://bioinfo.bch.msu.edu/2010_LIMS; www.plastid.msu.edu). The software and laboratory work environment were designed to minimize operator error and detect systematic process errors. The database uses Ruby on Rails and Flash technologies to present complex quantitative and qualitative data and pedigree information in a flexible user interface. Examples are presented where the database was used to find opportunities for process changes that improved data quality. We also describe the use of the data-analysis tools to discover mutants defective in enzymes of leucine catabolism (heteromeric mitochondrial 3-methylcrotonyl-coenzyme A carboxylase [At1g03090 and At4g34030] and putative hydroxymethylglutaryl-coenzyme A lyase [At2g26800]) based upon a syndrome of pleiotropic seed amino acid phenotypes that resembles previously described isovaleryl coenzyme A dehydrogenase (At3g45300) mutants. In vitro assay results support the computational annotation of At2g26800 as hydroxymethylglutaryl-coenzyme A lyase. PMID:21224340
Lessons Learned from Managing a Petabyte

DOE Office of Scientific and Technical Information (OSTI.GOV)

Becla, J

2005-01-20

The amount of data collected and stored by the average business doubles each year. Many commercial databases are already approaching hundreds of terabytes, and at this rate, will soon be managing petabytes. More data enables new functionality and capability, but the larger scale reveals new problems and issues hidden in ''smaller'' terascale environments. This paper presents some of these new problems along with implemented solutions in the framework of a petabyte dataset for a large High Energy Physics experiment. Through experience with two persistence technologies, a commercial database and a file-based approach, we expose format-independent concepts and issues prevalent atmore » this new scale of computing.« less
SureChEMBL: a large-scale, chemically annotated patent document database.

PubMed

Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P

2016-01-04

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
A Priori Analysis of Subgrid-Scale Models for Large Eddy Simulations of Supercritical Binary-Species Mixing Layers

NASA Technical Reports Server (NTRS)

Okong'o, Nora; Bellan, Josette

2005-01-01

Models for large eddy simulation (LES) are assessed on a database obtained from direct numerical simulations (DNS) of supercritical binary-species temporal mixing layers. The analysis is performed at the DNS transitional states for heptane/nitrogen, oxygen/hydrogen and oxygen/helium mixing layers. The incorporation of simplifying assumptions that are validated on the DNS database leads to a set of LES equations that requires only models for the subgrid scale (SGS) fluxes, which arise from filtering the convective terms in the DNS equations. Constant-coefficient versions of three different models for the SGS fluxes are assessed and calibrated. The Smagorinsky SGS-flux model shows poor correlations with the SGS fluxes, while the Gradient and Similarity models have high correlations, as well as good quantitative agreement with the SGS fluxes when the calibrated coefficients are used.

SureChEMBL: a large-scale, chemically annotated patent document database

PubMed Central

Papadatos, George; Davies, Mark; Dedman, Nathan; Chambers, Jon; Gaulton, Anna; Siddle, James; Koks, Richard; Irvine, Sean A.; Pettersson, Joe; Goncharoff, Nicko; Hersey, Anne; Overington, John P.

2016-01-01

SureChEMBL is a publicly available large-scale resource containing compounds extracted from the full text, images and attachments of patent documents. The data are extracted from the patent literature according to an automated text and image-mining pipeline on a daily basis. SureChEMBL provides access to a previously unavailable, open and timely set of annotated compound-patent associations, complemented with sophisticated combined structure and keyword-based search capabilities against the compound repository and patent document corpus; given the wealth of knowledge hidden in patent documents, analysis of SureChEMBL data has immediate applications in drug discovery, medicinal chemistry and other commercial areas of chemical science. Currently, the database contains 17 million compounds extracted from 14 million patent documents. Access is available through a dedicated web-based interface and data downloads at: https://www.surechembl.org/. PMID:26582922
Data-Mining Techniques in Detecting Factors Linked to Academic Achievement

ERIC Educational Resources Information Center

Martínez Abad, Fernando; Chaparro Caso López, Alicia A.

2017-01-01

In light of the emergence of statistical analysis techniques based on data mining in education sciences, and the potential they offer to detect non-trivial information in large databases, this paper presents a procedure used to detect factors linked to academic achievement in large-scale assessments. The study is based on a non-experimental,…
CLAST: CUDA implemented large-scale alignment search tool.

PubMed

Yano, Masahiro; Mori, Hiroshi; Akiyama, Yutaka; Yamada, Takuji; Kurokawa, Ken

2014-12-11

Metagenomics is a powerful methodology to study microbial communities, but it is highly dependent on nucleotide sequence similarity searching against sequence databases. Metagenomic analyses with next-generation sequencing technologies produce enormous numbers of reads from microbial communities, and many reads are derived from microbes whose genomes have not yet been sequenced, limiting the usefulness of existing sequence similarity search tools. Therefore, there is a clear need for a sequence similarity search tool that can rapidly detect weak similarity in large datasets. We developed a tool, which we named CLAST (CUDA implemented large-scale alignment search tool), that enables analyses of millions of reads and thousands of reference genome sequences, and runs on NVIDIA Fermi architecture graphics processing units. CLAST has four main advantages over existing alignment tools. First, CLAST was capable of identifying sequence similarities ~80.8 times faster than BLAST and 9.6 times faster than BLAT. Second, CLAST executes global alignment as the default (local alignment is also an option), enabling CLAST to assign reads to taxonomic and functional groups based on evolutionarily distant nucleotide sequences with high accuracy. Third, CLAST does not need a preprocessed sequence database like Burrows-Wheeler Transform-based tools, and this enables CLAST to incorporate large, frequently updated sequence databases. Fourth, CLAST requires <2 GB of main memory, making it possible to run CLAST on a standard desktop computer or server node. CLAST achieved very high speed (similar to the Burrows-Wheeler Transform-based Bowtie 2 for long reads) and sensitivity (equal to BLAST, BLAT, and FR-HIT) without the need for extensive database preprocessing or a specialized computing platform. Our results demonstrate that CLAST has the potential to be one of the most powerful and realistic approaches to analyze the massive amount of sequence data from next-generation sequencing technologies.
A rotation-translation invariant molecular descriptor of partial charges and its use in ligand-based virtual screening

PubMed Central

2014-01-01

Background Measures of similarity for chemical molecules have been developed since the dawn of chemoinformatics. Molecular similarity has been measured by a variety of methods including molecular descriptor based similarity, common molecular fragments, graph matching and 3D methods such as shape matching. Similarity measures are widespread in practice and have proven to be useful in drug discovery. Because of our interest in electrostatics and high throughput ligand-based virtual screening, we sought to exploit the information contained in atomic coordinates and partial charges of a molecule. Results A new molecular descriptor based on partial charges is proposed. It uses the autocorrelation function and linear binning to encode all atoms of a molecule into two rotation-translation invariant vectors. Combined with a scoring function, the descriptor allows to rank-order a database of compounds versus a query molecule. The proposed implementation is called ACPC (AutoCorrelation of Partial Charges) and released in open source. Extensive retrospective ligand-based virtual screening experiments were performed and other methods were compared with in order to validate the method and associated protocol. Conclusions While it is a simple method, it performed remarkably well in experiments. At an average speed of 1649 molecules per second, it reached an average median area under the curve of 0.81 on 40 different targets; hence validating the proposed protocol and implementation. PMID:24887178
Explorations into Chemical Reactions and Biochemical Pathways.

PubMed

Gasteiger, Johann

2016-12-01

A brief overview of the work in the research group of the present author on extracting knowledge from chemical reaction data is presented. Methods have been developed to calculate physicochemical effects at the reaction site. It is shown that these physicochemical effects can quite favourably be used to derive equations for the calculation of data on gas phase reactions and on reactions in solution such as aqueous acidity of alcohols or carboxylic acids or the hydrolysis of amides. Furthermore, it is shown that these physicochemical effects are quite effective for assigning reactions into reaction classes that correspond to chemical knowledge. Biochemical reactions constitute a particularly interesting and challenging task for increasing our understanding of living species. The BioPath.Database is a rich source of information on biochemical reactions and has been used for a variety of applications of chemical, biological, or medicinal interests. Thus, it was shown that biochemical reactions can be assigned by the physicochemical effects into classes that correspond to the classification of enzymes by the EC numbers. Furthermore, 3D models of reaction intermediates can be used for searching for novel enzyme inhibitors. It was shown in a combined application of chemoinformatics and bioinformatics that essential pathways of diseases can be uncovered. Furthermore, a study showed that bacterial flavor-forming pathways can be discovered. © 2016 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Identifying mechanism-of-action targets for drugs and probes

PubMed Central

Gregori-Puigjané, Elisabet; Setola, Vincent; Hert, Jérôme; Crews, Brenda A.; Irwin, John J.; Lounkine, Eugen; Marnett, Lawrence; Roth, Bryan L.; Shoichet, Brian K.

2012-01-01

Notwithstanding their key roles in therapy and as biological probes, 7% of approved drugs are purported to have no known primary target, and up to 18% lack a well-defined mechanism of action. Using a chemoinformatics approach, we sought to “de-orphanize” drugs that lack primary targets. Surprisingly, targets could be easily predicted for many: Whereas these targets were not known to us nor to the common databases, most could be confirmed by literature search, leaving only 13 Food and Drug Administration—approved drugs with unknown targets; the number of drugs without molecular targets likely is far fewer than reported. The number of worldwide drugs without reasonable molecular targets similarly dropped, from 352 (25%) to 44 (4%). Nevertheless, there remained at least seven drugs for which reasonable mechanism-of-action targets were unknown but could be predicted, including the antitussives clemastine, cloperastine, and nepinalone; the antiemetic benzquinamide; the muscle relaxant cyclobenzaprine; the analgesic nefopam; and the immunomodulator lobenzarit. For each, predicted targets were confirmed experimentally, with affinities within their physiological concentration ranges. Turning this question on its head, we next asked which drugs were specific enough to act as chemical probes. Over 100 drugs met the standard criteria for probes, and 40 did so by more stringent criteria. A chemical information approach to drug-target association can guide therapeutic development and reveal applications to probe biology, a focus of much current interest. PMID:22711801
Development and in silico evaluation of large-scale metabolite identification methods using functional group detection for metabolomics

PubMed Central

Mitchell, Joshua M.; Fan, Teresa W.-M.; Lane, Andrew N.; Moseley, Hunter N. B.

2014-01-01

Large-scale identification of metabolites is key to elucidating and modeling metabolism at the systems level. Advances in metabolomics technologies, particularly ultra-high resolution mass spectrometry (MS) enable comprehensive and rapid analysis of metabolites. However, a significant barrier to meaningful data interpretation is the identification of a wide range of metabolites including unknowns and the determination of their role(s) in various metabolic networks. Chemoselective (CS) probes to tag metabolite functional groups combined with high mass accuracy provide additional structural constraints for metabolite identification and quantification. We have developed a novel algorithm, Chemically Aware Substructure Search (CASS) that efficiently detects functional groups within existing metabolite databases, allowing for combined molecular formula and functional group (from CS tagging) queries to aid in metabolite identification without a priori knowledge. Analysis of the isomeric compounds in both Human Metabolome Database (HMDB) and KEGG Ligand demonstrated a high percentage of isomeric molecular formulae (43 and 28%, respectively), indicating the necessity for techniques such as CS-tagging. Furthermore, these two databases have only moderate overlap in molecular formulae. Thus, it is prudent to use multiple databases in metabolite assignment, since each major metabolite database represents different portions of metabolism within the biosphere. In silico analysis of various CS-tagging strategies under different conditions for adduct formation demonstrate that combined FT-MS derived molecular formulae and CS-tagging can uniquely identify up to 71% of KEGG and 37% of the combined KEGG/HMDB database vs. 41 and 17%, respectively without adduct formation. This difference between database isomer disambiguation highlights the strength of CS-tagging for non-lipid metabolite identification. However, unique identification of complex lipids still needs additional information. PMID:25120557
Large-scale Health Information Database and Privacy Protection.

PubMed

Yamamoto, Ryuichi

2016-09-01

Japan was once progressive in the digitalization of healthcare fields but unfortunately has fallen behind in terms of the secondary use of data for public interest. There has recently been a trend to establish large-scale health databases in the nation, and a conflict between data use for public interest and privacy protection has surfaced as this trend has progressed. Databases for health insurance claims or for specific health checkups and guidance services were created according to the law that aims to ensure healthcare for the elderly; however, there is no mention in the act about using these databases for public interest in general. Thus, an initiative for such use must proceed carefully and attentively. The PMDA projects that collect a large amount of medical record information from large hospitals and the health database development project that the Ministry of Health, Labour and Welfare (MHLW) is working on will soon begin to operate according to a general consensus; however, the validity of this consensus can be questioned if issues of anonymity arise. The likelihood that researchers conducting a study for public interest would intentionally invade the privacy of their subjects is slim. However, patients could develop a sense of distrust about their data being used since legal requirements are ambiguous. Nevertheless, without using patients' medical records for public interest, progress in medicine will grind to a halt. Proper legislation that is clear for both researchers and patients will therefore be highly desirable. A revision of the Act on the Protection of Personal Information is currently in progress. In reality, however, privacy is not something that laws alone can protect; it will also require guidelines and self-discipline. We now live in an information capitalization age. I will introduce the trends in legal reform regarding healthcare information and discuss some basics to help people properly face the issue of health big data and privacy protection with a sense of ownership.
Large-scale Health Information Database and Privacy Protection*1

PubMed Central

YAMAMOTO, Ryuichi

2016-01-01

Japan was once progressive in the digitalization of healthcare fields but unfortunately has fallen behind in terms of the secondary use of data for public interest. There has recently been a trend to establish large-scale health databases in the nation, and a conflict between data use for public interest and privacy protection has surfaced as this trend has progressed. Databases for health insurance claims or for specific health checkups and guidance services were created according to the law that aims to ensure healthcare for the elderly; however, there is no mention in the act about using these databases for public interest in general. Thus, an initiative for such use must proceed carefully and attentively. The PMDA*2 projects that collect a large amount of medical record information from large hospitals and the health database development project that the Ministry of Health, Labour and Welfare (MHLW) is working on will soon begin to operate according to a general consensus; however, the validity of this consensus can be questioned if issues of anonymity arise. The likelihood that researchers conducting a study for public interest would intentionally invade the privacy of their subjects is slim. However, patients could develop a sense of distrust about their data being used since legal requirements are ambiguous. Nevertheless, without using patients’ medical records for public interest, progress in medicine will grind to a halt. Proper legislation that is clear for both researchers and patients will therefore be highly desirable. A revision of the Act on the Protection of Personal Information is currently in progress. In reality, however, privacy is not something that laws alone can protect; it will also require guidelines and self-discipline. We now live in an information capitalization age. I will introduce the trends in legal reform regarding healthcare information and discuss some basics to help people properly face the issue of health big data and privacy protection with a sense of ownership. PMID:28299244
Reduced graphs and their applications in chemoinformatics.

PubMed

Birchall, Kristian; Gillet, Valerie J

2011-01-01

Reduced graphs provide summary representations of chemical structures by collapsing groups of connected atoms into single nodes while preserving the topology of the original structures. This chapter reviews the extensive work that has been carried out on reduced graphs at The University of Sheffield and includes discussion of their application to the representation and search of Markush structures in patents, the varied approaches that have been implemented for similarity searching, their use in cluster representation, the different ways in which they have been applied to extract structure-activity relationships and their use in encoding bioisosteres.
Entering new publication territory in chemoinformatics and chemical information science.

PubMed

Bajorath, Jürgen

2015-01-01

The F1000Research publishing platform offers the opportunity to launch themed article collections as a part of its dynamic publication environment. The idea of article collections is further expanded through the generation of publication channels that focus on specific scientific areas or disciplines. This editorial introduces the Chemical Information Science channel of F1000Research designed to collate high-quality publications and foster a culture of open peer review. Articles will be selected by guest editor(s) and a group of experts, the channel Editorial Board, and subjected to open peer review.
Optical/IR from ground

NASA Technical Reports Server (NTRS)

Strom, Stephen; Sargent, Wallace L. W.; Wolff, Sidney; Ahearn, Michael F.; Angel, J. Roger; Beckwith, Steven V. W.; Carney, Bruce W.; Conti, Peter S.; Edwards, Suzan; Grasdalen, Gary

1991-01-01

Optical/infrared (O/IR) astronomy in the 1990's is reviewed. The following subject areas are included: research environment; science opportunities; technical development of the 1980's and opportunities for the 1990's; and ground-based O/IR astronomy outside the U.S. Recommendations are presented for: (1) large scale programs (Priority 1: a coordinated program for large O/IR telescopes); (2) medium scale programs (Priority 1: a coordinated program for high angular resolution; Priority 2: a new generation of 4-m class telescopes); (3) small scale programs (Priority 1: near-IR and optical all-sky surveys; Priority 2: a National Astrometric Facility); and (4) infrastructure issues (develop, purchase, and distribute optical CCDs and infrared arrays; a program to support large optics technology; a new generation of large filled aperture telescopes; a program to archive and disseminate astronomical databases; and a program for training new instrumentalists)
Cloud-Based Distributed Control of Unmanned Systems

DTIC Science & Technology

2015-04-01

during mission execution. At best, the data is saved onto hard-drives and is accessible only by the local team. Data history in a form available and...following open source technologies: GeoServer, OpenLayers, PostgreSQL , and PostGIS are chosen to implement the back-end database and server. A brief...geospatial map data. 3. PostgreSQL : An SQL-compliant object-relational database that easily scales to accommodate large amounts of data - upwards to
Effect of missing data on multitask prediction methods.

PubMed

de la Vega de León, Antonio; Chen, Beining; Gillet, Valerie J

2018-05-22

There has been a growing interest in multitask prediction in chemoinformatics, helped by the increasing use of deep neural networks in this field. This technique is applied to multitarget data sets, where compounds have been tested against different targets, with the aim of developing models to predict a profile of biological activities for a given compound. However, multitarget data sets tend to be sparse; i.e., not all compound-target combinations have experimental values. There has been little research on the effect of missing data on the performance of multitask methods. We have used two complete data sets to simulate sparseness by removing data from the training set. Different models to remove the data were compared. These sparse sets were used to train two different multitask methods, deep neural networks and Macau, which is a Bayesian probabilistic matrix factorization technique. Results from both methods were remarkably similar and showed that the performance decrease because of missing data is at first small before accelerating after large amounts of data are removed. This work provides a first approximation to assess how much data is required to produce good performance in multitask prediction exercises.
CheS-Mapper - Chemical Space Mapping and Visualization in 3D.

PubMed

Gütlein, Martin; Karwath, Andreas; Kramer, Stefan

2012-03-17

Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis.
CheS-Mapper - Chemical Space Mapping and Visualization in 3D

PubMed Central

2012-01-01

Analyzing chemical datasets is a challenging task for scientific researchers in the field of chemoinformatics. It is important, yet difficult to understand the relationship between the structure of chemical compounds, their physico-chemical properties, and biological or toxic effects. To that respect, visualization tools can help to better comprehend the underlying correlations. Our recently developed 3D molecular viewer CheS-Mapper (Chemical Space Mapper) divides large datasets into clusters of similar compounds and consequently arranges them in 3D space, such that their spatial proximity reflects their similarity. The user can indirectly determine similarity, by selecting which features to employ in the process. The tool can use and calculate different kind of features, like structural fragments as well as quantitative chemical descriptors. These features can be highlighted within CheS-Mapper, which aids the chemist to better understand patterns and regularities and relate the observations to established scientific knowledge. As a final function, the tool can also be used to select and export specific subsets of a given dataset for further analysis. PMID:22424447
ProteinInferencer: Confident protein identification and multiple experiment comparison for large scale proteomics projects.

PubMed

Zhang, Yaoyang; Xu, Tao; Shan, Bing; Hart, Jonathan; Aslanian, Aaron; Han, Xuemei; Zong, Nobel; Li, Haomin; Choi, Howard; Wang, Dong; Acharya, Lipi; Du, Lisa; Vogt, Peter K; Ping, Peipei; Yates, John R

2015-11-03

Shotgun proteomics generates valuable information from large-scale and target protein characterizations, including protein expression, protein quantification, protein post-translational modifications (PTMs), protein localization, and protein-protein interactions. Typically, peptides derived from proteolytic digestion, rather than intact proteins, are analyzed by mass spectrometers because peptides are more readily separated, ionized and fragmented. The amino acid sequences of peptides can be interpreted by matching the observed tandem mass spectra to theoretical spectra derived from a protein sequence database. Identified peptides serve as surrogates for their proteins and are often used to establish what proteins were present in the original mixture and to quantify protein abundance. Two major issues exist for assigning peptides to their originating protein. The first issue is maintaining a desired false discovery rate (FDR) when comparing or combining multiple large datasets generated by shotgun analysis and the second issue is properly assigning peptides to proteins when homologous proteins are present in the database. Herein we demonstrate a new computational tool, ProteinInferencer, which can be used for protein inference with both small- or large-scale data sets to produce a well-controlled protein FDR. In addition, ProteinInferencer introduces confidence scoring for individual proteins, which makes protein identifications evaluable. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015. Published by Elsevier B.V.
Tropical Cyclone Information System

NASA Technical Reports Server (NTRS)

Li, P. Peggy; Knosp, Brian W.; Vu, Quoc A.; Yi, Chao; Hristova-Veleva, Svetla M.

2009-01-01

The JPL Tropical Cyclone Infor ma tion System (TCIS) is a Web portal (http://tropicalcyclone.jpl.nasa.gov) that provides researchers with an extensive set of observed hurricane parameters together with large-scale and convection resolving model outputs. It provides a comprehensive set of high-resolution satellite (see figure), airborne, and in-situ observations in both image and data formats. Large-scale datasets depict the surrounding environmental parameters such as SST (Sea Surface Temperature) and aerosol loading. Model outputs and analysis tools are provided to evaluate model performance and compare observations from different platforms. The system pertains to the thermodynamic and microphysical structure of the storm, the air-sea interaction processes, and the larger-scale environment as depicted by ocean heat content and the aerosol loading of the environment. Currently, the TCIS is populated with satellite observations of all tropical cyclones observed globally during 2005. There is a plan to extend the database both forward in time till present as well as backward to 1998. The portal is powered by a MySQL database and an Apache/Tomcat Web server on a Linux system. The interactive graphic user interface is provided by Google Map.
Large-scale mapping of hard-rock aquifer properties applied to Burkina Faso.

PubMed

Courtois, Nathalie; Lachassagne, Patrick; Wyns, Robert; Blanchin, Raymonde; Bougaïré, Francis D; Somé, Sylvain; Tapsoba, Aïssata

2010-01-01

A country-scale (1:1,000,000) methodology has been developed for hydrogeologic mapping of hard-rock aquifers (granitic and metamorphic rocks) of the type that underlie a large part of the African continent. The method is based on quantifying the "useful thickness" and hydrodynamic properties of such aquifers and uses a recent conceptual model developed for this hydrogeologic context. This model links hydrodynamic parameters (transmissivity, storativity) to lithology and the geometry of the various layers constituting a weathering profile. The country-scale hydrogeological mapping was implemented in Burkina Faso, where a recent 1:1,000,000-scale digital geological map and a database of some 16,000 water wells were used to evaluate the methodology.
In-Memory Graph Databases for Web-Scale Data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Castellana, Vito G.; Morari, Alessandro; Weaver, Jesse R.

RDF databases have emerged as one of the most relevant way for organizing, integrating, and managing expo- nentially growing, often heterogeneous, and not rigidly structured data for a variety of scientific and commercial fields. In this paper we discuss the solutions integrated in GEMS (Graph database Engine for Multithreaded Systems), a software framework for implementing RDF databases on commodity, distributed-memory high-performance clusters. Unlike the majority of current RDF databases, GEMS has been designed from the ground up to primarily employ graph-based methods. This is reflected in all the layers of its stack. The GEMS framework is composed of: a SPARQL-to-C++more » compiler, a library of data structures and related methods to access and modify them, and a custom runtime providing lightweight software multithreading, network messages aggregation and a partitioned global address space. We provide an overview of the framework, detailing its component and how they have been closely designed and customized to address issues of graph methods applied to large-scale datasets on clusters. We discuss in details the principles that enable automatic translation of the queries (expressed in SPARQL, the query language of choice for RDF databases) to graph methods, and identify differences with respect to other RDF databases.« less

Molecular signatures database (MSigDB) 3.0.

PubMed

Liberzon, Arthur; Subramanian, Aravind; Pinchback, Reid; Thorvaldsdóttir, Helga; Tamayo, Pablo; Mesirov, Jill P

2011-06-15

Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets. We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site. MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.
Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

PubMed

Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip

2016-01-01

Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.
Pharmacogenomic agreement between two cancer cell line data sets.

PubMed

2015-12-03

Large cancer cell line collections broadly capture the genomic diversity of human cancers and provide valuable insight into anti-cancer drug response. Here we show substantial agreement and biological consilience between drug sensitivity measurements and their associated genomic predictors from two publicly available large-scale pharmacogenomics resources: The Cancer Cell Line Encyclopedia and the Genomics of Drug Sensitivity in Cancer databases.
Mining large heterogeneous data sets in drug discovery.

PubMed

Wild, David J

2009-10-01

Increasingly, effective drug discovery involves the searching and data mining of large volumes of information from many sources covering the domains of chemistry, biology and pharmacology amongst others. This has led to a proliferation of databases and data sources relevant to drug discovery. This paper provides a review of the publicly-available large-scale databases relevant to drug discovery, describes the kinds of data mining approaches that can be applied to them and discusses recent work in integrative data mining that looks for associations that pan multiple sources, including the use of Semantic Web techniques. The future of mining large data sets for drug discovery requires intelligent, semantic aggregation of information from all of the data sources described in this review, along with the application of advanced methods such as intelligent agents and inference engines in client applications.
Transformation of social networks in the late pre-Hispanic US Southwest.

PubMed

Mills, Barbara J; Clark, Jeffery J; Peeples, Matthew A; Haas, W R; Roberts, John M; Hill, J Brett; Huntley, Deborah L; Borck, Lewis; Breiger, Ronald L; Clauset, Aaron; Shackley, M Steven

2013-04-09

The late pre-Hispanic period in the US Southwest (A.D. 1200-1450) was characterized by large-scale demographic changes, including long-distance migration and population aggregation. To reconstruct how these processes reshaped social networks, we compiled a comprehensive artifact database from major sites dating to this interval in the western Southwest. We combine social network analysis with geographic information systems approaches to reconstruct network dynamics over 250 y. We show how social networks were transformed across the region at previously undocumented spatial, temporal, and social scales. Using well-dated decorated ceramics, we track changes in network topology at 50-y intervals to show a dramatic shift in network density and settlement centrality from the northern to the southern Southwest after A.D. 1300. Both obsidian sourcing and ceramic data demonstrate that long-distance network relationships also shifted from north to south after migration. Surprisingly, social distance does not always correlate with spatial distance because of the presence of network relationships spanning long geographic distances. Our research shows how a large network in the southern Southwest grew and then collapsed, whereas networks became more fragmented in the northern Southwest but persisted. The study also illustrates how formal social network analysis may be applied to large-scale databases of material culture to illustrate multigenerational changes in network structure.
Transformation of social networks in the late pre-Hispanic US Southwest

PubMed Central

Mills, Barbara J.; Clark, Jeffery J.; Peeples, Matthew A.; Haas, W. R.; Roberts, John M.; Hill, J. Brett; Huntley, Deborah L.; Borck, Lewis; Breiger, Ronald L.; Clauset, Aaron; Shackley, M. Steven

2013-01-01

The late pre-Hispanic period in the US Southwest (A.D. 1200–1450) was characterized by large-scale demographic changes, including long-distance migration and population aggregation. To reconstruct how these processes reshaped social networks, we compiled a comprehensive artifact database from major sites dating to this interval in the western Southwest. We combine social network analysis with geographic information systems approaches to reconstruct network dynamics over 250 y. We show how social networks were transformed across the region at previously undocumented spatial, temporal, and social scales. Using well-dated decorated ceramics, we track changes in network topology at 50-y intervals to show a dramatic shift in network density and settlement centrality from the northern to the southern Southwest after A.D. 1300. Both obsidian sourcing and ceramic data demonstrate that long-distance network relationships also shifted from north to south after migration. Surprisingly, social distance does not always correlate with spatial distance because of the presence of network relationships spanning long geographic distances. Our research shows how a large network in the southern Southwest grew and then collapsed, whereas networks became more fragmented in the northern Southwest but persisted. The study also illustrates how formal social network analysis may be applied to large-scale databases of material culture to illustrate multigenerational changes in network structure. PMID:23530201
Relational databases: a transparent framework for encouraging biology students to think informatically.

PubMed

Rice, Michael; Gladstone, William; Weir, Michael

2004-01-01

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills.
Relational Databases: A Transparent Framework for Encouraging Biology Students To Think Informatically

PubMed Central

2004-01-01

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills. PMID:15592597
The impact of large-scale, long-term optical surveys on pulsating star research

NASA Astrophysics Data System (ADS)

Soszyński, Igor

2017-09-01

The era of large-scale photometric variability surveys began a quarter of a century ago, when three microlensing projects - EROS, MACHO, and OGLE - started their operation. These surveys initiated a revolution in the field of variable stars and in the next years they inspired many new observational projects. Large-scale optical surveys multiplied the number of variable stars known in the Universe. The huge, homogeneous and complete catalogs of pulsating stars, such as Cepheids, RR Lyrae stars, or long-period variables, offer an unprecedented opportunity to calibrate and test the accuracy of various distance indicators, to trace the three-dimensional structure of the Milky Way and other galaxies, to discover exotic types of intrinsically variable stars, or to study previously unknown features and behaviors of pulsators. We present historical and recent findings on various types of pulsating stars obtained from the optical large-scale surveys, with particular emphasis on the OGLE project which currently offers the largest photometric database among surveys for stellar variability.
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework

PubMed Central

2012-01-01

Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.

PubMed

Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John

2012-12-05

For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
FishTraits: a database of ecological and life-history traits of freshwater fishes of the United States

USGS Publications Warehouse

Angermeier, Paul L.; Frimpong, Emmanuel A.

2011-01-01

The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. We have compiled a database of > 100 traits for 809 (731 native and 78 nonnative) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database, named Fish Traits, contains information on four major categories of traits: (1) trophic ecology; (2) body size, reproductive ecology, and life history; (3) habitat preferences; and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status was also compiled. The database enhances many opportunities for conducting research on fish species traits and constitutes the first step toward establishing a central repository for a continually expanding set of traits of North American fishes.
NVST Data Archiving System Based On FastBit NoSQL Database

NASA Astrophysics Data System (ADS)

Liu, Ying-bo; Wang, Feng; Ji, Kai-fan; Deng, Hui; Dai, Wei; Liang, Bo

2014-06-01

The New Vacuum Solar Telescope (NVST) is a 1-meter vacuum solar telescope that aims to observe the fine structures of active regions on the Sun. The main tasks of the NVST are high resolution imaging and spectral observations, including the measurements of the solar magnetic field. The NVST has been collecting more than 20 million FITS files since it began routine observations in 2012 and produces a maximum observational records of 120 thousand files in a day. Given the large amount of files, the effective archiving and retrieval of files becomes a critical and urgent problem. In this study, we implement a new data archiving system for the NVST based on the Fastbit Not Only Structured Query Language (NoSQL) database. Comparing to the relational database (i.e., MySQL; My Structured Query Language), the Fastbit database manifests distinctive advantages on indexing and querying performance. In a large scale database of 40 million records, the multi-field combined query response time of Fastbit database is about 15 times faster and fully meets the requirements of the NVST. Our study brings a new idea for massive astronomical data archiving and would contribute to the design of data management systems for other astronomical telescopes.
Scaling up health knowledge at European level requires sharing integrated data: an approach for collection of database specification.

PubMed

Menditto, Enrica; Bolufer De Gea, Angela; Cahir, Caitriona; Marengoni, Alessandra; Riegler, Salvatore; Fico, Giuseppe; Costa, Elisio; Monaco, Alessandro; Pecorelli, Sergio; Pani, Luca; Prados-Torres, Alexandra

2016-01-01

Computerized health care databases have been widely described as an excellent opportunity for research. The availability of "big data" has brought about a wave of innovation in projects when conducting health services research. Most of the available secondary data sources are restricted to the geographical scope of a given country and present heterogeneous structure and content. Under the umbrella of the European Innovation Partnership on Active and Healthy Ageing, collaborative work conducted by the partners of the group on "adherence to prescription and medical plans" identified the use of observational and large-population databases to monitor medication-taking behavior in the elderly. This article describes the methodology used to gather the information from available databases among the Adherence Action Group partners with the aim of improving data sharing on a European level. A total of six databases belonging to three different European countries (Spain, Republic of Ireland, and Italy) were included in the analysis. Preliminary results suggest that there are some similarities. However, these results should be applied in different contexts and European countries, supporting the idea that large European studies should be designed in order to get the most of already available databases.
Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE).

PubMed

Schmedes, Sarah E; King, Jonathan L; Budowle, Bruce

2015-01-01

Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.
ApoptoProteomics, an integrated database for analysis of proteomics data obtained from apoptotic cells.

PubMed

Arntzen, Magnus Ø; Thiede, Bernd

2012-02-01

Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no.
ApoptoProteomics, an Integrated Database for Analysis of Proteomics Data Obtained from Apoptotic Cells*

PubMed Central

Arntzen, Magnus Ø.; Thiede, Bernd

2012-01-01

Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no. PMID:22067098
A global, open-source database of flood protection standards

NASA Astrophysics Data System (ADS)

Scussolini, Paolo; Aerts, Jeroen; Jongman, Brenden; Bouwer, Laurens; Winsemius, Hessel; de Moel, Hans; Ward, Philip

2016-04-01

Accurate flood risk estimation is pivotal in that it enables risk-informed policies in disaster risk reduction, as emphasized in the recent Sendai framework for Disaster Risk Reduction. To improve our understanding of flood risk, models are now capable to provide actionable risk information on the (sub)global scale. Still the accuracy of their results is greatly limited by the lack of information on standards of protection to flood that are actually in place; and researchers thus take large assumptions on the extent of protection. With our work we propose a first global, open-source database of FLOod PROtection Standards, FLOPROS, covering a range of spatial scales. FLOPROS is structured in three layers of information, and merges them into one consistent database: 1) the Design layer contains empirical information about the standard of protection presently in place; 2) the Policy layer contains intended protection standards from normative documents; 3) the Model layer uses a validated numerical approach to calculate protection standards for areas not covered in the other layers. The FLOPROS database can be used for more accurate risk assessment exercises across scales. As the database should be continually updated to reflect new interventions, we invite researchers and practitioners to contribute information. Further, we look for partners within the risk community to participate in additional strategies to implement the amount and accuracy of information contained in this first version of FLOPROS.
Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India.

PubMed

Pemberton, T J; Jakobsson, M; Conrad, D F; Coop, G; Wall, J D; Pritchard, J K; Patel, P I; Rosenberg, N A

2008-07-01

When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis - such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.
Towards a New Assessment of Urban Areas from Local to Global Scales

NASA Astrophysics Data System (ADS)

Bhaduri, B. L.; Roy Chowdhury, P. K.; McKee, J.; Weaver, J.; Bright, E.; Weber, E.

2015-12-01

Since early 2000s, starting with NASA MODIS, satellite based remote sensing has facilitated collection of imagery with medium spatial resolution but high temporal resolution (daily). This trend continues with an increasing number of sensors and data products. Increasing spatial and temporal resolutions of remotely sensed data archives, from both public and commercial sources, have significantly enhanced the quality of mapping and change data products. However, even with automation of such analysis on evolving computing platforms, rates of data processing have been suboptimal largely because of the ever-increasing pixel to processor ratio coupled with limitations of the computing architectures. Novel approaches utilizing spatiotemporal data mining techniques and computational architectures have emerged that demonstrates the potential for sustained and geographically scalable landscape monitoring to be operational. We exemplify this challenge with two broad research initiatives on High Performance Geocomputation at Oak Ridge National Laboratory: (a) mapping global settlement distribution; (b) developing national critical infrastructure databases. Our present effort, on large GPU based architectures, to exploit high resolution (1m or less) satellite and airborne imagery for extracting settlements at global scale is yielding understanding of human settlement patterns and urban areas at unprecedented resolution. Comparison of such urban land cover database, with existing national and global land cover products, at various geographic scales in selected parts of the world is revealing intriguing patterns and insights for urban assessment. Early results, from the USA, Taiwan, and Egypt, indicate closer agreements (5-10%) in urban area assessments among databases at larger, aggregated geographic extents. However, spatial variability at local scales could be significantly different (over 50% disagreement).

Empirical performance of the self-controlled case series design: lessons for developing a risk identification and analysis system.

PubMed

Suchard, Marc A; Zorych, Ivan; Simpson, Shawn E; Schuemie, Martijn J; Ryan, Patrick B; Madigan, David

2013-10-01

The self-controlled case series (SCCS) offers potential as an statistical method for risk identification involving medical products from large-scale observational healthcare data. However, analytic design choices remain in encoding the longitudinal health records into the SCCS framework and its risk identification performance across real-world databases is unknown. To evaluate the performance of SCCS and its design choices as a tool for risk identification in observational healthcare data. We examined the risk identification performance of SCCS across five design choices using 399 drug-health outcome pairs in five real observational databases (four administrative claims and one electronic health records). In these databases, the pairs involve 165 positive controls and 234 negative controls. We also consider several synthetic databases with known relative risks between drug-outcome pairs. We evaluate risk identification performance through estimating the area under the receiver-operator characteristics curve (AUC) and bias and coverage probability in the synthetic examples. The SCCS achieves strong predictive performance. Twelve of the twenty health outcome-database scenarios return AUCs >0.75 across all drugs. Including all adverse events instead of just the first per patient and applying a multivariate adjustment for concomitant drug use are the most important design choices. However, the SCCS as applied here returns relative risk point-estimates biased towards the null value of 1 with low coverage probability. The SCCS recently extended to apply a multivariate adjustment for concomitant drug use offers promise as a statistical tool for risk identification in large-scale observational healthcare databases. Poor estimator calibration dampens enthusiasm, but on-going work should correct this short-coming.
GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.

PubMed

Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de

2006-03-31

Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.
DBMap: a TreeMap-based framework for data navigation and visualization of brain research registry

NASA Astrophysics Data System (ADS)

Zhang, Ming; Zhang, Hong; Tjandra, Donny; Wong, Stephen T. C.

2003-05-01

The purpose of this study is to investigate and apply a new, intuitive and space-conscious visualization framework to facilitate efficient data presentation and exploration of large-scale data warehouses. We have implemented the DBMap framework for the UCSF Brain Research Registry. Such a novel utility would facilitate medical specialists and clinical researchers in better exploring and evaluating a number of attributes organized in the brain research registry. The current UCSF Brain Research Registry consists of a federation of disease-oriented database modules, including Epilepsy, Brain Tumor, Intracerebral Hemorrphage, and CJD (Creuzfeld-Jacob disease). These database modules organize large volumes of imaging and non-imaging data to support Web-based clinical research. While the data warehouse supports general information retrieval and analysis, there lacks an effective way to visualize and present the voluminous and complex data stored. This study investigates whether the TreeMap algorithm can be adapted to display and navigate categorical biomedical data warehouse or registry. TreeMap is a space constrained graphical representation of large hierarchical data sets, mapped to a matrix of rectangles, whose size and color represent interested database fields. It allows the display of a large amount of numerical and categorical information in limited real estate of computer screen with an intuitive user interface. The paper will describe, DBMap, the proposed new data visualization framework for large biomedical databases. Built upon XML, Java and JDBC technologies, the prototype system includes a set of software modules that reside in the application server tier and provide interface to backend database tier and front-end Web tier of the brain registry.
GIS applications for military operations in coastal zones

USGS Publications Warehouse

Fleming, S.; Jordan, T.; Madden, M.; Usery, E.L.; Welch, R.

2009-01-01

In order to successfully support current and future US military operations in coastal zones, geospatial information must be rapidly integrated and analyzed to meet ongoing force structure evolution and new mission directives. Coastal zones in a military-operational environment are complex regions that include sea, land and air features that demand high-volume databases of extreme detail within relatively narrow geographic corridors. Static products in the form of analog maps at varying scales traditionally have been used by military commanders and their operational planners. The rapidly changing battlefield of 21st Century warfare, however, demands dynamic mapping solutions. Commercial geographic information system (GIS) software for military-specific applications is now being developed and employed with digital databases to provide customized digital maps of variable scale, content and symbolization tailored to unique demands of military units. Research conducted by the Center for Remote Sensing and Mapping Science at the University of Georgia demonstrated the utility of GIS-based analysis and digital map creation when developing large-scale (1:10,000) products from littoral warfare databases. The methodology employed-selection of data sources (including high resolution commercial images and Lidar), establishment of analysis/modeling parameters, conduct of vehicle mobility analysis, development of models and generation of products (such as a continuous sea-land DEM and geo-visualization of changing shorelines with tidal levels)-is discussed. Based on observations and identified needs from the National Geospatial-Intelligence Agency, formerly the National Imagery and Mapping Agency, and the Department of Defense, prototype GIS models for military operations in sea, land and air environments were created from multiple data sets of a study area at US Marine Corps Base Camp Lejeune, North Carolina. Results of these models, along with methodologies for developing large-scale littoral warfare databases, aid the National Geospatial-Intelligence Agency in meeting littoral warfare analysis, modeling and map generation requirements for US military organizations. ?? 2008 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
GIS applications for military operations in coastal zones

NASA Astrophysics Data System (ADS)

Fleming, S.; Jordan, T.; Madden, M.; Usery, E. L.; Welch, R.

In order to successfully support current and future US military operations in coastal zones, geospatial information must be rapidly integrated and analyzed to meet ongoing force structure evolution and new mission directives. Coastal zones in a military-operational environment are complex regions that include sea, land and air features that demand high-volume databases of extreme detail within relatively narrow geographic corridors. Static products in the form of analog maps at varying scales traditionally have been used by military commanders and their operational planners. The rapidly changing battlefield of 21st Century warfare, however, demands dynamic mapping solutions. Commercial geographic information system (GIS) software for military-specific applications is now being developed and employed with digital databases to provide customized digital maps of variable scale, content and symbolization tailored to unique demands of military units. Research conducted by the Center for Remote Sensing and Mapping Science at the University of Georgia demonstrated the utility of GIS-based analysis and digital map creation when developing large-scale (1:10,000) products from littoral warfare databases. The methodology employed-selection of data sources (including high resolution commercial images and Lidar), establishment of analysis/modeling parameters, conduct of vehicle mobility analysis, development of models and generation of products (such as a continuous sea-land DEM and geo-visualization of changing shorelines with tidal levels)-is discussed. Based on observations and identified needs from the National Geospatial-Intelligence Agency, formerly the National Imagery and Mapping Agency, and the Department of Defense, prototype GIS models for military operations in sea, land and air environments were created from multiple data sets of a study area at US Marine Corps Base Camp Lejeune, North Carolina. Results of these models, along with methodologies for developing large-scale littoral warfare databases, aid the National Geospatial-Intelligence Agency in meeting littoral warfare analysis, modeling and map generation requirements for US military organizations.
Overcoming Dietary Assessment Challenges in Low-Income Countries: Technological Solutions Proposed by the International Dietary Data Expansion (INDDEX) Project.

PubMed

Coates, Jennifer C; Colaiezzi, Brooke A; Bell, Winnie; Charrondiere, U Ruth; Leclercq, Catherine

2017-03-16

An increasing number of low-income countries (LICs) exhibit high rates of malnutrition coincident with rising rates of overweight and obesity. Individual-level dietary data are needed to inform effective responses, yet dietary data from large-scale surveys conducted in LICs remain extremely limited. This discussion paper first seeks to highlight the barriers to collection and use of individual-level dietary data in LICs. Second, it introduces readers to new technological developments and research initiatives to remedy this situation, led by the International Dietary Data Expansion (INDDEX) Project. Constraints to conducting large-scale dietary assessments include significant costs, time burden, technical complexity, and limited investment in dietary research infrastructure, including the necessary tools and databases required to collect individual-level dietary data in large surveys. To address existing bottlenecks, the INDDEX Project is developing a dietary assessment platform for LICs, called INDDEX24, consisting of a mobile application integrated with a web database application, which is expected to facilitate seamless data collection and processing. These tools will be subject to rigorous testing including feasibility, validation, and cost studies. To scale up dietary data collection and use in LICs, the INDDEX Project will also invest in food composition databases, an individual-level dietary data dissemination platform, and capacity development activities. Although the INDDEX Project activities are expected to improve the ability of researchers and policymakers in low-income countries to collect, process, and use dietary data, the global nutrition community is urged to commit further significant investments in order to adequately address the range and scope of challenges described in this paper.
GHEP-ISFG collaborative simulated exercise for DVI/MPI: Lessons learned about large-scale profile database comparisons.

PubMed

Vullo, Carlos M; Romero, Magdalena; Catelli, Laura; Šakić, Mustafa; Saragoni, Victor G; Jimenez Pleguezuelos, María Jose; Romanini, Carola; Anjos Porto, Maria João; Puente Prieto, Jorge; Bofarull Castro, Alicia; Hernandez, Alexis; Farfán, María José; Prieto, Victoria; Alvarez, David; Penacino, Gustavo; Zabalza, Santiago; Hernández Bolaños, Alejandro; Miguel Manterola, Irati; Prieto, Lourdes; Parsons, Thomas

2016-03-01

The GHEP-ISFG Working Group has recognized the importance of assisting DNA laboratories to gain expertise in handling DVI or missing persons identification (MPI) projects which involve the need for large-scale genetic profile comparisons. Eleven laboratories participated in a DNA matching exercise to identify victims from a hypothetical conflict with 193 missing persons. The post mortem database was comprised of 87 skeletal remain profiles from a secondary mass grave displaying a minimal number of 58 individuals with evidence of commingling. The reference database was represented by 286 family reference profiles with diverse pedigrees. The goal of the exercise was to correctly discover re-associations and family matches. The results of direct matching for commingled remains re-associations were correct and fully concordant among all laboratories. However, the kinship analysis for missing persons identifications showed variable results among the participants. There was a group of laboratories with correct, concordant results but nearly half of the others showed discrepant results exhibiting likelihood ratio differences of several degrees of magnitude in some cases. Three main errors were detected: (a) some laboratories did not use the complete reference family genetic data to report the match with the remains, (b) the identity and/or non-identity hypotheses were sometimes wrongly expressed in the likelihood ratio calculations, and (c) many laboratories did not properly evaluate the prior odds for the event. The results suggest that large-scale profile comparisons for DVI or MPI is a challenge for forensic genetics laboratories and the statistical treatment of DNA matching and the Bayesian framework should be better standardized among laboratories. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Overcoming Dietary Assessment Challenges in Low-Income Countries: Technological Solutions Proposed by the International Dietary Data Expansion (INDDEX) Project

PubMed Central

Coates, Jennifer C.; Colaiezzi, Brooke A.; Bell, Winnie; Charrondiere, U. Ruth; Leclercq, Catherine

2017-01-01

An increasing number of low-income countries (LICs) exhibit high rates of malnutrition coincident with rising rates of overweight and obesity. Individual-level dietary data are needed to inform effective responses, yet dietary data from large-scale surveys conducted in LICs remain extremely limited. This discussion paper first seeks to highlight the barriers to collection and use of individual-level dietary data in LICs. Second, it introduces readers to new technological developments and research initiatives to remedy this situation, led by the International Dietary Data Expansion (INDDEX) Project. Constraints to conducting large-scale dietary assessments include significant costs, time burden, technical complexity, and limited investment in dietary research infrastructure, including the necessary tools and databases required to collect individual-level dietary data in large surveys. To address existing bottlenecks, the INDDEX Project is developing a dietary assessment platform for LICs, called INDDEX24, consisting of a mobile application integrated with a web database application, which is expected to facilitate seamless data collection and processing. These tools will be subject to rigorous testing including feasibility, validation, and cost studies. To scale up dietary data collection and use in LICs, the INDDEX Project will also invest in food composition databases, an individual-level dietary data dissemination platform, and capacity development activities. Although the INDDEX Project activities are expected to improve the ability of researchers and policymakers in low-income countries to collect, process, and use dietary data, the global nutrition community is urged to commit further significant investments in order to adequately address the range and scope of challenges described in this paper. PMID:28300759
MetReS, an Efficient Database for Genomic Applications.

PubMed

Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc

2018-02-01

MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.
Database recovery using redundant disk arrays

NASA Technical Reports Server (NTRS)

Mourad, Antoine N.; Fuchs, W. K.; Saab, Daniel G.

1992-01-01

Redundant disk arrays provide a way for achieving rapid recovery from media failures with a relatively low storage cost for large scale database systems requiring high availability. In this paper a method is proposed for using redundant disk arrays to support rapid-recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, it is shown that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
Recovery issues in databases using redundant disk arrays

NASA Technical Reports Server (NTRS)

Mourad, Antoine N.; Fuchs, W. K.; Saab, Daniel G.

1993-01-01

Redundant disk arrays provide a way for achieving rapid recovery from media failures with a relatively low storage cost for large scale database systems requiring high availability. In this paper we propose a method for using redundant disk arrays to support rapid recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, we show that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs.

PubMed

Cormode, Graham; Dasgupta, Anirban; Goyal, Amit; Lee, Chi Hoon

2018-01-01

Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users' queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with "vanilla" LSH, even when using the same amount of space.
Digital version of "Open-File Report 92-179: Geologic map of the Cow Cove Quadrangle, San Bernardino County, California"

USGS Publications Warehouse

Wilshire, Howard G.; Bedford, David R.; Coleman, Teresa

2002-01-01

3. Plottable map representations of the database at 1:24,000 scale in PostScript and Adobe PDF formats. The plottable files consist of a color geologic map derived from the spatial database, composited with a topographic base map in the form of the USGS Digital Raster Graphic for the map area. Color symbology from each of these datasets is maintained, which can cause plot file sizes to be large.
A Two-Layer Least Squares Support Vector Machine Approach to Credit Risk Assessment

NASA Astrophysics Data System (ADS)

Liu, Jingli; Li, Jianping; Xu, Weixuan; Shi, Yong

Least squares support vector machine (LS-SVM) is a revised version of support vector machine (SVM) and has been proved to be a useful tool for pattern recognition. LS-SVM had excellent generalization performance and low computational cost. In this paper, we propose a new method called two-layer least squares support vector machine which combines kernel principle component analysis (KPCA) and linear programming form of least square support vector machine. With this method sparseness and robustness is obtained while solving large dimensional and large scale database. A U.S. commercial credit card database is used to test the efficiency of our method and the result proved to be a satisfactory one.
A user-defined data type for the storage of time series data allowing efficient similarity screening.

PubMed

Sorokin, Anatoly; Selkov, Gene; Goryanin, Igor

2012-07-16

The volume of the experimentally measured time series data is rapidly growing, while storage solutions offering better data types than simple arrays of numbers or opaque blobs for keeping series data are sorely lacking. A number of indexing methods have been proposed to provide efficient access to time series data, but none has so far been integrated into a tried-and-proven database system. To explore the possibility of such integration, we have developed a data type for time series storage in PostgreSQL, an object-relational database system, and equipped it with an access method based on SAX (Symbolic Aggregate approXimation). This new data type has been successfully tested in a database supporting a large-scale plant gene expression experiment, and it was additionally tested on a very large set of simulated time series data. Copyright © 2011 Elsevier B.V. All rights reserved.
Massive parallelization of serial inference algorithms for a complex generalized linear model

PubMed Central

Suchard, Marc A.; Simpson, Shawn E.; Zorych, Ivan; Ryan, Patrick; Madigan, David

2014-01-01

Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety. PMID:25328363
PREPping Students for Authentic Science

ERIC Educational Resources Information Center

Dolan, Erin L.; Lally, David J.; Brooks, Eric; Tax, Frans E.

2008-01-01

In this article, the authors describe a large-scale research collaboration, the Partnership for Research and Education in Plants (PREP), which has capitalized on publicly available databases that contain massive amounts of biological information; stock centers that house and distribute inexpensive organisms with different genotypes; and the…
Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA.

PubMed

Xu, Weijia; Ozer, Stuart; Gutell, Robin R

2009-01-01

With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.
Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA

PubMed Central

Xu, Weijia; Ozer, Stuart; Gutell, Robin R.

2010-01-01

With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure. PMID:20502534
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets

NASA Astrophysics Data System (ADS)

Juric, Mario

2011-01-01

The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.

Unified Access Architecture for Large-Scale Scientific Datasets

NASA Astrophysics Data System (ADS)

Karna, Risav

2014-05-01

Data-intensive sciences have to deploy diverse large scale database technologies for data analytics as scientists have now been dealing with much larger volume than ever before. While array databases have bridged many gaps between the needs of data-intensive research fields and DBMS technologies (Zhang 2011), invocation of other big data tools accompanying these databases is still manual and separate the database management's interface. We identify this as an architectural challenge that will increasingly complicate the user's work flow owing to the growing number of useful but isolated and niche database tools. Such use of data analysis tools in effect leaves the burden on the user's end to synchronize the results from other data manipulation analysis tools with the database management system. To this end, we propose a unified access interface for using big data tools within large scale scientific array database using the database queries themselves to embed foreign routines belonging to the big data tools. Such an invocation of foreign data manipulation routines inside a query into a database can be made possible through a user-defined function (UDF). UDFs that allow such levels of freedom as to call modules from another language and interface back and forth between the query body and the side-loaded functions would be needed for this purpose. For the purpose of this research we attempt coupling of four widely used tools Hadoop (hadoop1), Matlab (matlab1), R (r1) and ScaLAPACK (scalapack1) with UDF feature of rasdaman (Baumann 98), an array-based data manager, for investigating this concept. The native array data model used by an array-based data manager provides compact data storage and high performance operations on ordered data such as spatial data, temporal data, and matrix-based data for linear algebra operations (scidbusr1). Performances issues arising due to coupling of tools with different paradigms, niche functionalities, separate processes and output data formats have been anticipated and considered during the design of the unified architecture. The research focuses on the feasibility of the designed coupling mechanism and the evaluation of the efficiency and benefits of our proposed unified access architecture. Zhang 2011: Zhang, Ying and Kersten, Martin and Ivanova, Milena and Nes, Niels, SciQL: Bridging the Gap Between Science and Relational DBMS, Proceedings of the 15th Symposium on International Database Engineering Applications, 2011. Baumann 98: Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N., "The Multidimensional Database System RasDaMan", SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, 1998. hadoop1: hadoop.apache.org, "Hadoop", http://hadoop.apache.org/, [Online; accessed 12-Jan-2014]. scalapack1: netlib.org/scalapack, "ScaLAPACK", http://www.netlib.org/scalapack,[Online; accessed 12-Jan-2014]. r1: r-project.org, "R", http://www.r-project.org/,[Online; accessed 12-Jan-2014]. matlab1: mathworks.com, "Matlab Documentation", http://www.mathworks.de/de/help/matlab/,[Online; accessed 12-Jan-2014]. scidbusr1: scidb.org, "SciDB User's Guide", http://scidb.org/HTMLmanual/13.6/scidb_ug,[Online; accessed 01-Dec-2013].
ANN multiscale model of anti-HIV drugs activity vs AIDS prevalence in the US at county level based on information indices of molecular graphs and social networks.

PubMed

González-Díaz, Humberto; Herrera-Ibatá, Diana María; Duardo-Sánchez, Aliuska; Munteanu, Cristian R; Orbegozo-Medina, Ricardo Alfredo; Pazos, Alejandro

2014-03-24

This work is aimed at describing the workflow for a methodology that combines chemoinformatics and pharmacoepidemiology methods and at reporting the first predictive model developed with this methodology. The new model is able to predict complex networks of AIDS prevalence in the US counties, taking into consideration the social determinants and activity/structure of anti-HIV drugs in preclinical assays. We trained different Artificial Neural Networks (ANNs) using as input information indices of social networks and molecular graphs. We used a Shannon information index based on the Gini coefficient to quantify the effect of income inequality in the social network. We obtained the data on AIDS prevalence and the Gini coefficient from the AIDSVu database of Emory University. We also used the Balaban information indices to quantify changes in the chemical structure of anti-HIV drugs. We obtained the data on anti-HIV drug activity and structure (SMILE codes) from the ChEMBL database. Last, we used Box-Jenkins moving average operators to quantify information about the deviations of drugs with respect to data subsets of reference (targets, organisms, experimental parameters, protocols). The best model found was a Linear Neural Network (LNN) with values of Accuracy, Specificity, and Sensitivity above 0.76 and AUROC > 0.80 in training and external validation series. This model generates a complex network of AIDS prevalence in the US at county level with respect to the preclinical activity of anti-HIV drugs in preclinical assays. To train/validate the model and predict the complex network we needed to analyze 43,249 data points including values of AIDS prevalence in 2,310 counties in the US vs ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4,856 protocols, and 10 possible experimental measures.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Roehm, Dominic; Pavel, Robert S.; Barros, Kipton

We present an adaptive sampling method supplemented by a distributed database and a prediction method for multiscale simulations using the Heterogeneous Multiscale Method. A finite-volume scheme integrates the macro-scale conservation laws for elastodynamics, which are closed by momentum and energy fluxes evaluated at the micro-scale. In the original approach, molecular dynamics (MD) simulations are launched for every macro-scale volume element. Our adaptive sampling scheme replaces a large fraction of costly micro-scale MD simulations with fast table lookup and prediction. The cloud database Redis provides the plain table lookup, and with locality aware hashing we gather input data for our predictionmore » scheme. For the latter we use kriging, which estimates an unknown value and its uncertainty (error) at a specific location in parameter space by using weighted averages of the neighboring points. We find that our adaptive scheme significantly improves simulation performance by a factor of 2.5 to 25, while retaining high accuracy for various choices of the algorithm parameters.« less
Large scale validation of the M5L lung CAD on heterogeneous CT datasets.

PubMed

Torres, E Lopez; Fiorina, E; Pennazio, F; Peroni, C; Saletta, M; Camarlinghi, N; Fantacci, M E; Cerello, P

2015-04-01

M5L, a fully automated computer-aided detection (CAD) system for the detection and segmentation of lung nodules in thoracic computed tomography (CT), is presented and validated on several image datasets. M5L is the combination of two independent subsystems, based on the Channeler Ant Model as a segmentation tool [lung channeler ant model (lungCAM)] and on the voxel-based neural approach. The lungCAM was upgraded with a scan equalization module and a new procedure to recover the nodules connected to other lung structures; its classification module, which makes use of a feed-forward neural network, is based of a small number of features (13), so as to minimize the risk of lacking generalization, which could be possible given the large difference between the size of the training and testing datasets, which contain 94 and 1019 CTs, respectively. The lungCAM (standalone) and M5L (combined) performance was extensively tested on 1043 CT scans from three independent datasets, including a detailed analysis of the full Lung Image Database Consortium/Image Database Resource Initiative database, which is not yet found in literature. The lungCAM and M5L performance is consistent across the databases, with a sensitivity of about 70% and 80%, respectively, at eight false positive findings per scan, despite the variable annotation criteria and acquisition and reconstruction conditions. A reduced sensitivity is found for subtle nodules and ground glass opacities (GGO) structures. A comparison with other CAD systems is also presented. The M5L performance on a large and heterogeneous dataset is stable and satisfactory, although the development of a dedicated module for GGOs detection could further improve it, as well as an iterative optimization of the training procedure. The main aim of the present study was accomplished: M5L results do not deteriorate when increasing the dataset size, making it a candidate for supporting radiologists on large scale screenings and clinical programs.
A Priori Subgrid Analysis of Temporal Mixing Layers with Evaporating Droplets

NASA Technical Reports Server (NTRS)

Okongo, Nora; Bellan, Josette

1999-01-01

Subgrid analysis of a transitional temporal mixing layer with evaporating droplets has been performed using three sets of results from a Direct Numerical Simulation (DNS) database, with Reynolds numbers (based on initial vorticity thickness) as large as 600 and with droplet mass loadings as large as 0.5. In the DNS, the gas phase is computed using a Eulerian formulation, with Lagrangian droplet tracking. The Large Eddy Simulation (LES) equations corresponding to the DNS are first derived, and key assumptions in deriving them are first confirmed by computing the terms using the DNS database. Since LES of this flow requires the computation of unfiltered gas-phase variables at droplet locations from filtered gas-phase variables at the grid points, it is proposed to model these by assuming the gas-phase variables to be the sum of the filtered variables and a correction based on the filtered standard deviation; this correction is then computed from the Subgrid Scale (SGS) standard deviation. This model predicts the unfiltered variables at droplet locations considerably better than simply interpolating the filtered variables. Three methods are investigated for modeling the SGS standard deviation: the Smagorinsky approach, the Gradient model and the Scale-Similarity formulation. When the proportionality constant inherent in the SGS models is properly calculated, the Gradient and Scale-Similarity methods give results in excellent agreement with the DNS.
INVENTORY AND CLASSIFICATION OF GREAT LAKES COASTAL WETLANDS FOR MONITORING AND ASSESSMENT AT LARGE SPATIAL SCALES

EPA Science Inventory

Monitoring aquatic resources for regional assessments requires an accurate and comprehensive inventory of the resource and useful classification of exosystem similarities. Our research effort to create an electronic database and work with various ways to classify coastal wetlands...
Learning Deep Representations for Ground to Aerial Geolocalization (Open Access)

DTIC Science & Technology

2015-10-15

proposed approach, Where-CNN, is inspired by deep learning success in face verification and achieves significant improvements over tra- ditional hand...crafted features and existing deep features learned from other large-scale databases. We show the ef- fectiveness of Where-CNN in finding matches
Health-Terrain: Visualizing Large Scale Health Data

DTIC Science & Technology

2014-12-01

systems can only be realized if the quality of emerging large medical databases can be characterized and the meaning of the data understood. For this...Designed and tested an evaluation procedure for health data visualization system. This visualization framework offers a real time and web-based solution...rule is shown in the table, with the quality measures of each rule including the support, confidence, Laplace, Gain, p-s, lift and Conviction. We
Deep learning with non-medical training used for chest pathology identification

NASA Astrophysics Data System (ADS)

Bar, Yaniv; Diamant, Idit; Wolf, Lior; Greenspan, Hayit

2015-03-01

In this work, we examine the strength of deep learning approaches for pathology detection in chest radiograph data. Convolutional neural networks (CNN) deep architecture classification approaches have gained popularity due to their ability to learn mid and high level image representations. We explore the ability of a CNN to identify different types of pathologies in chest x-ray images. Moreover, since very large training sets are generally not available in the medical domain, we explore the feasibility of using a deep learning approach based on non-medical learning. We tested our algorithm on a dataset of 93 images. We use a CNN that was trained with ImageNet, a well-known large scale nonmedical image database. The best performance was achieved using a combination of features extracted from the CNN and a set of low-level features. We obtained an area under curve (AUC) of 0.93 for Right Pleural Effusion detection, 0.89 for Enlarged heart detection and 0.79 for classification between healthy and abnormal chest x-ray, where all pathologies are combined into one large class. This is a first-of-its-kind experiment that shows that deep learning with large scale non-medical image databases may be sufficient for general medical image recognition tasks.
Using Chemoinformatics, Bioinformatics, and Bioassay to Predict and Explain the Antibacterial Activity of Nonantibiotic Food and Drug Administration Drugs.

PubMed

Kahlous, Nour Aldin; Bawarish, Muhammad Al Mohdi; Sarhan, Muhammad Arabi; Küpper, Manfred; Hasaba, Ali; Rajab, Mazen

2017-04-01

Discovering of new and effective antibiotics is a major issue facing scientists today. Luckily, the development of computer science offers new methods to overcome this issue. In this study, a set of computer software was used to predict the antibacterial activity of nonantibiotic Food and Drug Administration (FDA)-approved drugs, and to explain their action by possible binding to well-known bacterial protein targets, along with testing their antibacterial activity against Gram-positive and Gram-negative bacteria. A three-dimensional virtual screening method that relies on chemical and shape similarity was applied using rapid overlay of chemical structures (ROCS) software to select candidate compounds from the FDA-approved drugs database that share similarity with 17 known antibiotics. Then, to check their antibacterial activity, disk diffusion test was applied on Staphylococcus aureus and Escherichia coli. Finally, a protein docking method was applied using HYBRID software to predict the binding of the active candidate to the target receptor of its similar antibiotic. Of the 1,991 drugs that were screened, 34 had been selected and among them 10 drugs showed antibacterial activity, whereby drotaverine and metoclopramide activities were without precedent reports. Furthermore, the docking process predicted that diclofenac, drotaverine, (S)-flurbiprofen, (S)-ibuprofen, and indomethacin could bind to the protein target of their similar antibiotics. Nevertheless, their antibacterial activities are weak compared with those of their similar antibiotics, which can be potentiated further by performing chemical modifications on their structure.
Combination of 2D/3D ligand-based similarity search in rapid virtual screening from multimillion compound repositories. Selection and biological evaluation of potential PDE4 and PDE5 inhibitors.

PubMed

Dobi, Krisztina; Hajdú, István; Flachner, Beáta; Fabó, Gabriella; Szaszkó, Mária; Bognár, Melinda; Magyar, Csaba; Simon, István; Szisz, Dániel; Lőrincz, Zsolt; Cseh, Sándor; Dormán, György

2014-05-28

Rapid in silico selection of target focused libraries from commercial repositories is an attractive and cost effective approach. If structures of active compounds are available rapid 2D similarity search can be performed on multimillion compound databases but the generated library requires further focusing by various 2D/3D chemoinformatics tools. We report here a combination of the 2D approach with a ligand-based 3D method (Screen3D) which applies flexible matching to align reference and target compounds in a dynamic manner and thus to assess their structural and conformational similarity. In the first case study we compared the 2D and 3D similarity scores on an existing dataset derived from the biological evaluation of a PDE5 focused library. Based on the obtained similarity metrices a fusion score was proposed. The fusion score was applied to refine the 2D similarity search in a second case study where we aimed at selecting and evaluating a PDE4B focused library. The application of this fused 2D/3D similarity measure led to an increase of the hit rate from 8.5% (1st round, 47% inhibition at 10 µM) to 28.5% (2nd round at 50% inhibition at 10 µM) and the best two hits had 53 nM inhibitory activities.
Scaling up health knowledge at European level requires sharing integrated data: an approach for collection of database specification

PubMed Central

Menditto, Enrica; Bolufer De Gea, Angela; Cahir, Caitriona; Marengoni, Alessandra; Riegler, Salvatore; Fico, Giuseppe; Costa, Elisio; Monaco, Alessandro; Pecorelli, Sergio; Pani, Luca; Prados-Torres, Alexandra

2016-01-01

Computerized health care databases have been widely described as an excellent opportunity for research. The availability of “big data” has brought about a wave of innovation in projects when conducting health services research. Most of the available secondary data sources are restricted to the geographical scope of a given country and present heterogeneous structure and content. Under the umbrella of the European Innovation Partnership on Active and Healthy Ageing, collaborative work conducted by the partners of the group on “adherence to prescription and medical plans” identified the use of observational and large-population databases to monitor medication-taking behavior in the elderly. This article describes the methodology used to gather the information from available databases among the Adherence Action Group partners with the aim of improving data sharing on a European level. A total of six databases belonging to three different European countries (Spain, Republic of Ireland, and Italy) were included in the analysis. Preliminary results suggest that there are some similarities. However, these results should be applied in different contexts and European countries, supporting the idea that large European studies should be designed in order to get the most of already available databases. PMID:27358570
Real-time terrain storage generation from multiple sensors towards mobile robot operation interface.

PubMed

Song, Wei; Cho, Seoungjae; Xi, Yulong; Cho, Kyungeun; Um, Kyhyun

2014-01-01

A mobile robot mounted with multiple sensors is used to rapidly collect 3D point clouds and video images so as to allow accurate terrain modeling. In this study, we develop a real-time terrain storage generation and representation system including a nonground point database (PDB), ground mesh database (MDB), and texture database (TDB). A voxel-based flag map is proposed for incrementally registering large-scale point clouds in a terrain model in real time. We quantize the 3D point clouds into 3D grids of the flag map as a comparative table in order to remove the redundant points. We integrate the large-scale 3D point clouds into a nonground PDB and a node-based terrain mesh using the CPU. Subsequently, we program a graphics processing unit (GPU) to generate the TDB by mapping the triangles in the terrain mesh onto the captured video images. Finally, we produce a nonground voxel map and a ground textured mesh as a terrain reconstruction result. Our proposed methods were tested in an outdoor environment. Our results show that the proposed system was able to rapidly generate terrain storage and provide high resolution terrain representation for mobile mapping services and a graphical user interface between remote operators and mobile robots.
Real-Time Terrain Storage Generation from Multiple Sensors towards Mobile Robot Operation Interface

PubMed Central

Cho, Seoungjae; Xi, Yulong; Cho, Kyungeun

2014-01-01

A mobile robot mounted with multiple sensors is used to rapidly collect 3D point clouds and video images so as to allow accurate terrain modeling. In this study, we develop a real-time terrain storage generation and representation system including a nonground point database (PDB), ground mesh database (MDB), and texture database (TDB). A voxel-based flag map is proposed for incrementally registering large-scale point clouds in a terrain model in real time. We quantize the 3D point clouds into 3D grids of the flag map as a comparative table in order to remove the redundant points. We integrate the large-scale 3D point clouds into a nonground PDB and a node-based terrain mesh using the CPU. Subsequently, we program a graphics processing unit (GPU) to generate the TDB by mapping the triangles in the terrain mesh onto the captured video images. Finally, we produce a nonground voxel map and a ground textured mesh as a terrain reconstruction result. Our proposed methods were tested in an outdoor environment. Our results show that the proposed system was able to rapidly generate terrain storage and provide high resolution terrain representation for mobile mapping services and a graphical user interface between remote operators and mobile robots. PMID:25101321
Bioclipse: an open source workbench for chemo- and bioinformatics.

PubMed

Spjuth, Ola; Helmus, Tobias; Willighagen, Egon L; Kuhn, Stefan; Eklund, Martin; Wagener, Johannes; Murray-Rust, Peter; Steinbeck, Christoph; Wikberg, Jarl E S

2007-02-22

There is a need for software applications that provide users with a complete and extensible toolkit for chemo- and bioinformatics accessible from a single workbench. Commercial packages are expensive and closed source, hence they do not allow end users to modify algorithms and add custom functionality. Existing open source projects are more focused on providing a framework for integrating existing, separately installed bioinformatics packages, rather than providing user-friendly interfaces. No open source chemoinformatics workbench has previously been published, and no successful attempts have been made to integrate chemo- and bioinformatics into a single framework. Bioclipse is an advanced workbench for resources in chemo- and bioinformatics, such as molecules, proteins, sequences, spectra, and scripts. It provides 2D-editing, 3D-visualization, file format conversion, calculation of chemical properties, and much more; all fully integrated into a user-friendly desktop application. Editing supports standard functions such as cut and paste, drag and drop, and undo/redo. Bioclipse is written in Java and based on the Eclipse Rich Client Platform with a state-of-the-art plugin architecture. This gives Bioclipse an advantage over other systems as it can easily be extended with functionality in any desired direction. Bioclipse is a powerful workbench for bio- and chemoinformatics as well as an advanced integration platform. The rich functionality, intuitive user interface, and powerful plugin architecture make Bioclipse the most advanced and user-friendly open source workbench for chemo- and bioinformatics. Bioclipse is released under Eclipse Public License (EPL), an open source license which sets no constraints on external plugin licensing; it is totally open for both open source plugins as well as commercial ones. Bioclipse is freely available at http://www.bioclipse.net.
Chemometric classification of morphologically similar Umbelliferae medicinal herbs by DART-TOF-MS fingerprint.

PubMed

Lee, Sang Min; Kim, Hye-Jin; Jang, Young Pyo

2012-01-01

It needs many years of special training to gain expertise on the organoleptic classification of botanical raw materials and, even for those experts, discrimination among Umbelliferae medicinal herbs remains an intricate challenge due to their morphological similarity. To develop a new chemometric classification method using a direct analysis in real time-time of flight-mass spectrometry (DART-TOF-MS) fingerprinting for Umbelliferae medicinal herbs and to provide a platform for its application to the discrimination of other herbal medicines. Angelica tenuissima, Angelica gigas, Angelica dahurica and Cnidium officinale were chosen for this study and ten samples of each species were purchased from various Korean markets. DART-TOF-MS was employed on powdered raw materials to obtain a chemical fingerprint of each sample and the orthogonal partial-least squares method in discriminant analysis (OPLS-DA) was used for multivariate analysis. All samples of collected species were successfully discriminated from each other according to their characteristic DART-TOF-MS fingerprint. Decursin (or decursinol angelate) and byakangelicol were identified as marker molecules for Angelica gigas and A. dahurica, respectively. Using the OPLS method for discriminant analysis, Angelica tenuissima and Cnidium officinale were clearly separated into two groups. Angelica tenuissima was characterised by the presence of ligustilide and unidentified molecular ions of m/z 239 and 283, while senkyunolide A together with signals with m/z 387 and 389 were the marker compounds for Cnidium officinale. Elaborating with chemoinformatics, DART-TOF-MS fingerprinting with chemoinformatic tools results in a powerful method for the classification of morphologically similar Umbelliferae medicinal herbs and quality control of medicinal herbal products, including the extracts of these crude drugs. Copyright © 2012 John Wiley & Sons, Ltd.
QuBiLS-MIDAS: a parallel free-software for molecular descriptors computation based on multilinear algebraic maps.

PubMed

García-Jacas, César R; Marrero-Ponce, Yovani; Acevedo-Martínez, Liesner; Barigye, Stephen J; Valdés-Martiní, José R; Contreras-Torres, Ernesto

2014-07-05

The present report introduces the QuBiLS-MIDAS software belonging to the ToMoCoMD-CARDD suite for the calculation of three-dimensional molecular descriptors (MDs) based on the two-linear (bilinear), three-linear, and four-linear (multilinear or N-linear) algebraic forms. Thus, it is unique software that computes these tensor-based indices. These descriptors, establish relations for two, three, and four atoms by using several (dis-)similarity metrics or multimetrics, matrix transformations, cutoffs, local calculations and aggregation operators. The theoretical background of these N-linear indices is also presented. The QuBiLS-MIDAS software was developed in the Java programming language and employs the Chemical Development Kit library for the manipulation of the chemical structures and the calculation of the atomic properties. This software is composed by a desktop user-friendly interface and an Abstract Programming Interface library. The former was created to simplify the configuration of the different options of the MDs, whereas the library was designed to allow its easy integration to other software for chemoinformatics applications. This program provides functionalities for data cleaning tasks and for batch processing of the molecular indices. In addition, it offers parallel calculation of the MDs through the use of all available processors in current computers. The studies of complexity of the main algorithms demonstrate that these were efficiently implemented with respect to their trivial implementation. Lastly, the performance tests reveal that this software has a suitable behavior when the amount of processors is increased. Therefore, the QuBiLS-MIDAS software constitutes a useful application for the computation of the molecular indices based on N-linear algebraic maps and it can be used freely to perform chemoinformatics studies. Copyright © 2014 Wiley Periodicals, Inc.
Characterizing the Response of Composite Panels to a Pyroshock Induced Environment Using Design of Experiments Methodology

NASA Technical Reports Server (NTRS)

Parsons, David S.; Ordway, David; Johnson, Kenneth

2013-01-01

This experimental study seeks to quantify the impact various composite parameters have on the structural response of a composite structure in a pyroshock environment. The prediction of an aerospace structure's response to pyroshock induced loading is largely dependent on empirical databases created from collections of development and flight test data. While there is significant structural response data due to pyroshock induced loading for metallic structures, there is much less data available for composite structures. One challenge of developing a composite pyroshock response database as well as empirical prediction methods for composite structures is the large number of parameters associated with composite materials. This experimental study uses data from a test series planned using design of experiments (DOE) methods. Statistical analysis methods are then used to identify which composite material parameters most greatly influence a flat composite panel's structural response to pyroshock induced loading. The parameters considered are panel thickness, type of ply, ply orientation, and pyroshock level induced into the panel. The results of this test will aid in future large scale testing by eliminating insignificant parameters as well as aid in the development of empirical scaling methods for composite structures' response to pyroshock induced loading.
Characterizing the Response of Composite Panels to a Pyroshock Induced Environment using Design of Experiments Methodology

NASA Technical Reports Server (NTRS)

Parsons, David S.; Ordway, David O.; Johnson, Kenneth L.

2013-01-01

This experimental study seeks to quantify the impact various composite parameters have on the structural response of a composite structure in a pyroshock environment. The prediction of an aerospace structure's response to pyroshock induced loading is largely dependent on empirical databases created from collections of development and flight test data. While there is significant structural response data due to pyroshock induced loading for metallic structures, there is much less data available for composite structures. One challenge of developing a composite pyroshock response database as well as empirical prediction methods for composite structures is the large number of parameters associated with composite materials. This experimental study uses data from a test series planned using design of experiments (DOE) methods. Statistical analysis methods are then used to identify which composite material parameters most greatly influence a flat composite panel's structural response to pyroshock induced loading. The parameters considered are panel thickness, type of ply, ply orientation, and pyroshock level induced into the panel. The results of this test will aid in future large scale testing by eliminating insignificant parameters as well as aid in the development of empirical scaling methods for composite structures' response to pyroshock induced loading.
Privacy-preserving search for chemical compound databases.

PubMed

Shimizu, Kana; Nuida, Koji; Arai, Hiromi; Mitsunari, Shigeo; Attrapadung, Nuttapong; Hamada, Michiaki; Tsuda, Koji; Hirokawa, Takatsugu; Sakuma, Jun; Hanaoka, Goichiro; Asai, Kiyoshi

2015-01-01

Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources. In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation. We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.

Privacy-preserving search for chemical compound databases

PubMed Central

2015-01-01

Background Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources. Results In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder's privacy and database holder's privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation. Conclusion We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information. PMID:26678650
Inter-annual and decadal changes in teleconnections drive continental-scale synchronization of tree reproduction.

PubMed

Ascoli, Davide; Vacchiano, Giorgio; Turco, Marco; Conedera, Marco; Drobyshev, Igor; Maringer, Janet; Motta, Renzo; Hacket-Pain, Andrew

2017-12-20

Climate teleconnections drive highly variable and synchronous seed production (masting) over large scales. Disentangling the effect of high-frequency (inter-annual variation) from low-frequency (decadal trends) components of climate oscillations will improve our understanding of masting as an ecosystem process. Using century-long observations on masting (the MASTREE database) and data on the Northern Atlantic Oscillation (NAO), we show that in the last 60 years both high-frequency summer and spring NAO, and low-frequency winter NAO components are highly correlated to continent-wide masting in European beech and Norway spruce. Relationships are weaker (non-stationary) in the early twentieth century. This finding improves our understanding on how climate variation affects large-scale synchronization of tree masting. Moreover, it supports the connection between proximate and ultimate causes of masting: indeed, large-scale features of atmospheric circulation coherently drive cues and resources for masting, as well as its evolutionary drivers, such as pollination efficiency, abundance of seed dispersers, and natural disturbance regimes.
Advanced Model for Extreme Lift and Improved Aeroacoustics (AMELIA)

NASA Technical Reports Server (NTRS)

Lichtwardt, Jonathan; Paciano, Eric; Jameson, Tina; Fong, Robert; Marshall, David

2012-01-01

With the very recent advent of NASA's Environmentally Responsible Aviation Project (ERA), which is dedicated to designing aircraft that will reduce the impact of aviation on the environment, there is a need for research and development of methodologies to minimize fuel burn, emissions, and reduce community noise produced by regional airliners. ERA tackles airframe technology, propulsion technology, and vehicle systems integration to meet performance objectives in the time frame for the aircraft to be at a Technology Readiness Level (TRL) of 4-6 by the year of 2020 (deemed N+2). The proceeding project that investigated similar goals to ERA was NASA's Subsonic Fixed Wing (SFW). SFW focused on conducting research to improve prediction methods and technologies that will produce lower noise, lower emissions, and higher performing subsonic aircraft for the Next Generation Air Transportation System. The work provided in this investigation was a NASA Research Announcement (NRA) contract #NNL07AA55C funded by Subsonic Fixed Wing. The project started in 2007 with a specific goal of conducting a large-scale wind tunnel test along with the development of new and improved predictive codes for the advanced powered-lift concepts. Many of the predictive codes were incorporated to refine the wind tunnel model outer mold line design. The large scale wind tunnel test goal was to investigate powered lift technologies and provide an experimental database to validate current and future modeling techniques. Powered-lift concepts investigated were Circulation Control (CC) wing in conjunction with over-the-wing mounted engines to entrain the exhaust to further increase the lift generated by CC technologies alone. The NRA was a five-year effort; during the first year the objective was to select and refine CESTOL concepts and then to complete a preliminary design of a large-scale wind tunnel model for the large scale test. During the second, third, and fourth years the large-scale wind tunnel model design would be completed, manufactured, and calibrated. During the fifth year the large scale wind tunnel test was conducted. This technical memo will describe all phases of the Advanced Model for Extreme Lift and Improved Aeroacoustics (AMELIA) project and provide a brief summary of the background and modeling efforts involved in the NRA. The conceptual designs considered for this project and the decision process for the selected configuration adapted for a wind tunnel model will be briefly discussed. The internal configuration of AMELIA, and the internal measurements chosen in order to satisfy the requirements of obtaining a database of experimental data to be used for future computational model validations. The external experimental techniques that were employed during the test, along with the large-scale wind tunnel test facility are covered in great detail. Experimental measurements in the database include forces and moments, and surface pressure distributions, local skin friction measurements, boundary and shear layer velocity profiles, far-field acoustic data and noise signatures from turbofan propulsion simulators. Results and discussion of the circulation control performance, over-the-wing mounted engines, and the combined performance are also discussed in great detail.
Quadratic integrand double-hybrid made spin-component-scaled

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brémond, Éric, E-mail: eric.bremond@iit.it; Savarese, Marika; Sancho-García, Juan C.

2016-03-28

We propose two analytical expressions aiming to rationalize the spin-component-scaled (SCS) and spin-opposite-scaled (SOS) schemes for double-hybrid exchange-correlation density-functionals. Their performances are extensively tested within the framework of the nonempirical quadratic integrand double-hybrid (QIDH) model on energetic properties included into the very large GMTKN30 benchmark database, and on structural properties of semirigid medium-sized organic compounds. The SOS variant is revealed as a less computationally demanding alternative to reach the accuracy of the original QIDH model without losing any theoretical background.
An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs

PubMed Central

2018-01-01

Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users’ queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with “vanilla” LSH, even when using the same amount of space. PMID:29346410
MycoDB, a global database of plant response to mycorrhizal fungi.

PubMed

Chaudhary, V Bala; Rúa, Megan A; Antoninka, Anita; Bever, James D; Cannon, Jeffery; Craig, Ashley; Duchicela, Jessica; Frame, Alicia; Gardes, Monique; Gehring, Catherine; Ha, Michelle; Hart, Miranda; Hopkins, Jacob; Ji, Baoming; Johnson, Nancy Collins; Kaonongbua, Wittaya; Karst, Justine; Koide, Roger T; Lamit, Louis J; Meadow, James; Milligan, Brook G; Moore, John C; Pendergast, Thomas H; Piculell, Bridget; Ramsby, Blake; Simard, Suzanne; Shrestha, Shubha; Umbanhowar, James; Viechtbauer, Wolfgang; Walters, Lawrence; Wilson, Gail W T; Zee, Peter C; Hoeksema, Jason D

2016-05-10

Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems.
MycoDB, a global database of plant response to mycorrhizal fungi

PubMed Central

Chaudhary, V. Bala; Rúa, Megan A.; Antoninka, Anita; Bever, James D.; Cannon, Jeffery; Craig, Ashley; Duchicela, Jessica; Frame, Alicia; Gardes, Monique; Gehring, Catherine; Ha, Michelle; Hart, Miranda; Hopkins, Jacob; Ji, Baoming; Johnson, Nancy Collins; Kaonongbua, Wittaya; Karst, Justine; Koide, Roger T.; Lamit, Louis J.; Meadow, James; Milligan, Brook G.; Moore, John C.; Pendergast IV, Thomas H.; Piculell, Bridget; Ramsby, Blake; Simard, Suzanne; Shrestha, Shubha; Umbanhowar, James; Viechtbauer, Wolfgang; Walters, Lawrence; Wilson, Gail W. T.; Zee, Peter C.; Hoeksema, Jason D.

2016-01-01

Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems. PMID:27163938
MycoDB, a global database of plant response to mycorrhizal fungi

NASA Astrophysics Data System (ADS)

Chaudhary, V. Bala; Rúa, Megan A.; Antoninka, Anita; Bever, James D.; Cannon, Jeffery; Craig, Ashley; Duchicela, Jessica; Frame, Alicia; Gardes, Monique; Gehring, Catherine; Ha, Michelle; Hart, Miranda; Hopkins, Jacob; Ji, Baoming; Johnson, Nancy Collins; Kaonongbua, Wittaya; Karst, Justine; Koide, Roger T.; Lamit, Louis J.; Meadow, James; Milligan, Brook G.; Moore, John C.; Pendergast, Thomas H., IV; Piculell, Bridget; Ramsby, Blake; Simard, Suzanne; Shrestha, Shubha; Umbanhowar, James; Viechtbauer, Wolfgang; Walters, Lawrence; Wilson, Gail W. T.; Zee, Peter C.; Hoeksema, Jason D.

2016-05-01

Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems.
NREL Supercomputer Tackles Grid Challenges | News | NREL

Science.gov Websites

traditional database processes. Photo by Dennis Schroeder, NREL "Big data" is playing an imagery, and large-scale simulation data. Photo by Dennis Schroeder, NREL "Peregrine provides much . Photo by Dennis Schroeder, NREL Collaboration is key, and it is hard-wired into the ESIF's core. NREL
Freight Demand Characteristics and Mode Choice: An Analysis of the Results of Modeling with Disaggregate Revealed Preference Data

DOT National Transportation Integrated Search

1999-12-01

This paper analyzes the freight demand characteristics that drive modal choice by means of a large scale, national, disaggregate revealed preference database for shippers in France in 1988, using a nested logit. Particular attention is given to priva...
The Starkey habitat database for ungulate research: construction, documentation, and use.

Treesearch

Mary M. Rowland; Priscilla K. Coe; Rosemary J. Stussy; [and others].

1998-01-01

The Starkey Project, a large-scale, multidisciplinary research venture, began in 1987 in the Starkey Experimental Forest and Range in northeast Oregon. Researchers are studying effects of forest management on interactions and habitat use of mule deer (Odocoileus hemionus hemionus), elk (Cervus elaphus nelsoni), and cattle. A...
Large Eddy Simulation of jets laden with evaporating drops

NASA Technical Reports Server (NTRS)

Leboissetier, A.; Okong'o, N.; Bellan, J.

2004-01-01

LES of a circular jet laden with evaporating liquid drops are conducted to assess computational-drop modeling and three different SGS-flux models: the Scale Similarity model (SSC), using a constant coefficient calibrated on a temporal mixing layer DNS database, and dynamic-coefficient Gradient and Smagorinsky models.
Fast Crystallization of the Phase Change Compound GeTe by Large-Scale Molecular Dynamics Simulations.

PubMed

Sosso, Gabriele C; Miceli, Giacomo; Caravati, Sebastiano; Giberti, Federico; Behler, Jörg; Bernasconi, Marco

2013-12-19

Phase change materials are of great interest as active layers in rewritable optical disks and novel electronic nonvolatile memories. These applications rest on a fast and reversible transformation between the amorphous and crystalline phases upon heating, taking place on the nanosecond time scale. In this work, we investigate the microscopic origin of the fast crystallization process by means of large-scale molecular dynamics simulations of the phase change compound GeTe. To this end, we use an interatomic potential generated from a Neural Network fitting of a large database of ab initio energies. We demonstrate that in the temperature range of the programming protocols of the electronic memories (500-700 K), nucleation of the crystal in the supercooled liquid is not rate-limiting. In this temperature range, the growth of supercritical nuclei is very fast because of a large atomic mobility, which is, in turn, the consequence of the high fragility of the supercooled liquid and the associated breakdown of the Stokes-Einstein relation between viscosity and diffusivity.
Experimental Study of Homogeneous Isotropic Slowly-Decaying Turbulence in Giant Grid-Wind Tunnel Set Up

NASA Astrophysics Data System (ADS)

Aliseda, Alberto; Bourgoin, Mickael; Eswirp Collaboration

2014-11-01

We present preliminary results from a recent grid turbulence experiment conducted at the ONERA wind tunnel in Modane, France. The ESWIRP Collaboration was conceived to probe the smallest scales of a canonical turbulent flow with very high Reynolds numbers. To achieve this, the largest scales of the turbulence need to be extremely big so that, even with the large separation of scales, the smallest scales would be well above the spatial and temporal resolution of the instruments. The ONERA wind tunnel in Modane (8 m -diameter test section) was chosen as a limit of the biggest large scales achievable in a laboratory setting. A giant inflatable grid (M = 0.8 m) was conceived to induce slowly-decaying homogeneous isotropic turbulence in a large region of the test section, with minimal structural risk. An international team or researchers collected hot wire anemometry, ultrasound anemometry, resonant cantilever anemometry, fast pitot tube anemometry, cold wire thermometry and high-speed particle tracking data of this canonical turbulent flow. While analysis of this large database, which will become publicly available over the next 2 years, has only started, the Taylor-scale Reynolds number is estimated to be between 400 and 800, with Kolmogorov scales as large as a few mm . The ESWIRP Collaboration is formed by an international team of scientists to investigate experimentally the smallest scales of turbulence. It was funded by the European Union to take advantage of the largest wind tunnel in Europe for fundamental research.
Optical components damage parameters database system

NASA Astrophysics Data System (ADS)

Tao, Yizheng; Li, Xinglan; Jin, Yuquan; Xie, Dongmei; Tang, Dingyong

2012-10-01

Optical component is the key to large-scale laser device developed by one of its load capacity is directly related to the device output capacity indicators, load capacity depends on many factors. Through the optical components will damage parameters database load capacity factors of various digital, information technology, for the load capacity of optical components to provide a scientific basis for data support; use of business processes and model-driven approach, the establishment of component damage parameter information model and database systems, system application results that meet the injury test optical components business processes and data management requirements of damage parameters, component parameters of flexible, configurable system is simple, easy to use, improve the efficiency of the optical component damage test.
Scaling Semantic Graph Databases in Size and Performance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morari, Alessandro; Castellana, Vito G.; Villa, Oreste

In this paper we present SGEM, a full software system for accelerating large-scale semantic graph databases on commodity clusters. Unlike current approaches, SGEM addresses semantic graph databases by only employing graph methods at all the levels of the stack. On one hand, this allows exploiting the space efficiency of graph data structures and the inherent parallelism of graph algorithms. These features adapt well to the increasing system memory and core counts of modern commodity clusters. On the other hand, however, these systems are optimized for regular computation and batched data transfers, while graph methods usually are irregular and generate fine-grainedmore » data accesses with poor spatial and temporal locality. Our framework comprises a SPARQL to data parallel C compiler, a library of parallel graph methods and a custom, multithreaded runtime system. We introduce our stack, motivate its advantages with respect to other solutions and show how we solved the challenges posed by irregular behaviors. We present the result of our software stack on the Berlin SPARQL benchmarks with datasets up to 10 billion triples (a triple corresponds to a graph edge), demonstrating scaling in dataset size and in performance as more nodes are added to the cluster.« less
The EMBL nucleotide sequence database

PubMed Central

Stoesser, Guenter; Baker, Wendy; van den Broek, Alexandra; Camon, Evelyn; Garcia-Pastor, Maria; Kanz, Carola; Kulikova, Tamara; Lombard, Vincent; Lopez, Rodrigo; Parkinson, Helen; Redaschi, Nicole; Sterk, Peter; Stoehr, Peter; Tuli, Mary Ann

2001-01-01

The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data is exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via ftp, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many specialized databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT. PMID:11125039
Efficient frequent pattern mining algorithm based on node sets in cloud computing environment

NASA Astrophysics Data System (ADS)

Billa, V. N. Vinay Kumar; Lakshmanna, K.; Rajesh, K.; Reddy, M. Praveen Kumar; Nagaraja, G.; Sudheer, K.

2017-11-01

The ultimate goal of Data Mining is to determine the hidden information which is useful in making decisions using the large databases collected by an organization. This Data Mining involves many tasks that are to be performed during the process. Mining frequent itemsets is the one of the most important tasks in case of transactional databases. These transactional databases contain the data in very large scale where the mining of these databases involves the consumption of physical memory and time in proportion to the size of the database. A frequent pattern mining algorithm is said to be efficient only if it consumes less memory and time to mine the frequent itemsets from the given large database. Having these points in mind in this thesis we proposed a system which mines frequent itemsets in an optimized way in terms of memory and time by using cloud computing as an important factor to make the process parallel and the application is provided as a service. A complete framework which uses a proven efficient algorithm called FIN algorithm. FIN algorithm works on Nodesets and POC (pre-order coding) tree. In order to evaluate the performance of the system we conduct the experiments to compare the efficiency of the same algorithm applied in a standalone manner and in cloud computing environment on a real time data set which is traffic accidents data set. The results show that the memory consumption and execution time taken for the process in the proposed system is much lesser than those of standalone system.
Large perceptual distortions of locomotor action space occur in ground-based coordinates: Angular expansion and the large-scale horizontal-vertical illusion.

PubMed

Klein, Brennan J; Li, Zhi; Durgin, Frank H

2016-04-01

What is the natural reference frame for seeing large-scale spatial scenes in locomotor action space? Prior studies indicate an asymmetric angular expansion in perceived direction in large-scale environments: Angular elevation relative to the horizon is perceptually exaggerated by a factor of 1.5, whereas azimuthal direction is exaggerated by a factor of about 1.25. Here participants made angular and spatial judgments when upright or on their sides to dissociate egocentric from allocentric reference frames. In Experiment 1, it was found that body orientation did not affect the magnitude of the up-down exaggeration of direction, suggesting that the relevant orientation reference frame for this directional bias is allocentric rather than egocentric. In Experiment 2, the comparison of large-scale horizontal and vertical extents was somewhat affected by viewer orientation, but only to the extent necessitated by the classic (5%) horizontal-vertical illusion (HVI) that is known to be retinotopic. Large-scale vertical extents continued to appear much larger than horizontal ground extents when observers lay sideways. When the visual world was reoriented in Experiment 3, the bias remained tied to the ground-based allocentric reference frame. The allocentric HVI is quantitatively consistent with differential angular exaggerations previously measured for elevation and azimuth in locomotor space. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
New Zealand's National Landslide Database

NASA Astrophysics Data System (ADS)

Rosser, B.; Dellow, S.; Haubrook, S.; Glassey, P.

2016-12-01

Since 1780, landslides have caused an average of about 3 deaths a year in New Zealand and have cost the economy an average of at least NZ$250M/a (0.1% GDP). To understand the risk posed by landslide hazards to society, a thorough knowledge of where, when and why different types of landslides occur is vital. The main objective for establishing the database was to provide a centralised national-scale, publically available database to collate landslide information that could be used for landslide hazard and risk assessment. Design of a national landslide database for New Zealand required consideration of both existing landslide data stored in a variety of digital formats, and future data, yet to be collected. Pre-existing databases were developed and populated with data reflecting the needs of the landslide or hazard project, and the database structures of the time. Bringing these data into a single unified database required a new structure capable of storing and delivering data at a variety of scales and accuracy and with different attributes. A "unified data model" was developed to enable the database to hold old and new landslide data irrespective of scale and method of capture. The database contains information on landslide locations and where available: 1) the timing of landslides and the events that may have triggered them; 2) the type of landslide movement; 3) the volume and area; 4) the source and debris tail; and 5) the impacts caused by the landslide. Information from a variety of sources including aerial photographs (and other remotely sensed data), field reconnaissance and media accounts has been collated and is presented for each landslide along with metadata describing the data sources and quality. There are currently nearly 19,000 landslide records in the database that include point locations, polygons of landslide source and deposit areas, and linear features. Several large datasets are awaiting upload which will bring the total number of landslides to over 100,000. The geo-spatial database is publicly available via the Internet. Software components, including the underlying database (PostGIS), Web Map Server (GeoServer) and web application use open-source software. The hope is that others will add relevant information to the database as well as download the data contained in it.

Historical reconstructions of California wildfires vary by data source

USGS Publications Warehouse

Syphard, Alexandra D.; Keeley, Jon E.

2016-01-01

Historical data are essential for understanding how fire activity responds to different drivers. It is important that the source of data is commensurate with the spatial and temporal scale of the question addressed, but fire history databases are derived from different sources with different restrictions. In California, a frequently used fire history dataset is the State of California Fire and Resource Assessment Program (FRAP) fire history database, which circumscribes fire perimeters at a relatively fine scale. It includes large fires on both state and federal lands but only covers fires that were mapped or had other spatially explicit data. A different database is the state and federal governments’ annual reports of all fires. They are more complete than the FRAP database but are only spatially explicit to the level of county (California Department of Forestry and Fire Protection – Cal Fire) or forest (United States Forest Service – USFS). We found substantial differences between the FRAP database and the annual summaries, with the largest and most consistent discrepancy being in fire frequency. The FRAP database missed the majority of fires and is thus a poor indicator of fire frequency or indicators of ignition sources. The FRAP database is also deficient in area burned, especially before 1950. Even in contemporary records, the huge number of smaller fires not included in the FRAP database account for substantial cumulative differences in area burned. Wildfires in California account for nearly half of the western United States fire suppression budget. Therefore, the conclusions about data discrepancies and the implications for fire research are of broad importance.
The Amma-Sat Database

NASA Astrophysics Data System (ADS)

Ramage, K.; Desbois, M.; Eymard, L.

2004-12-01

The African Monsoon Multidisciplinary Analysis project is a French initiative, which aims at identifying and analysing in details the multidisciplinary and multi-scales processes that lead to a better understanding of the physical mechanisms linked to the African Monsoon. The main components of the African Monsoon are: Atmospheric Dynamics, the Continental Water Cycle, Atmospheric Chemistry, Oceanic and Continental Surface Conditions. Satellites contribute to various objectives of the project both for process analysis and for large scale-long term studies: some series of satellites (METEOSAT, NOAA,.) have been flown for more than 20 years, ensuring a good quality monitoring of some of the West African atmosphere and surface characteristics. Moreover, several recent missions, and several projects will strongly improve and complement this survey. The AMMA project offers an opportunity to develop the exploitation of satellite data and to make collaboration between specialist and non-specialist users. In this purpose databases are being developed to collect all past and future satellite data related to the African Monsoon. It will then be possible to compare different types of data from different resolution, to validate satellite data with in situ measurements or numerical simulations. AMMA-SAT database main goal is to offer an easy access to satellite data to the AMMA scientific community. The database contains geophysical products estimated from operational or research algorithms and covering the different components of the AMMA project. Nevertheless, the choice has been made to group data within pertinent scales rather than within their thematic. In this purpose, five regions of interest where defined to extract the data: An area covering Tropical Atlantic and Africa for large scale studies, an area covering West Africa for mesoscale studies and three local areas surrounding sites of in situ observations. Within each of these regions satellite data are projected on a regular grid with a spatial resolution compatible with the spatial variability of the geophysical parameter. Data are stored in NetCDF files to facilitate their use. Satellite products can be selected using several spatial and temporal criteria and ordered through a web interface developed in PHP-MySQL. More common means of access are also available such as direct FTP or NFS access for identified users. A Live Access Server allows quick visualization of the data. A meta-data catalogue based on the Directory Interchange Format manages the documentation of each satellite product. The database is currently under development, but some products are already available. The database will be complete by the end of 2005.
ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining.

PubMed

Huan, Tianxiao; Sivachenko, Andrey Y; Harrison, Scott H; Chen, Jake Y

2008-08-12

New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed. We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network. The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
The Génolevures database.

PubMed

Martin, Tiphaine; Sherman, David J; Durrens, Pascal

2011-01-01

The Génolevures online database (URL: http://www.genolevures.org) stores and provides the data and results obtained by the Génolevures Consortium through several campaigns of genome annotation of the yeasts in the Saccharomycotina subphylum (hemiascomycetes). This database is dedicated to large-scale comparison of these genomes, storing not only the different chromosomal elements detected in the sequences, but also the logical relations between them. The database is divided into a public part, accessible to anyone through Internet, and a private part where the Consortium members make genome annotations with our Magus annotation system; this system is used to annotate several related genomes in parallel. The public database is widely consulted and offers structured data, organized using a REST web site architecture that allows for automated requests. The implementation of the database, as well as its associated tools and methods, is evolving to cope with the influx of genome sequences produced by Next Generation Sequencing (NGS). Copyright © 2011 Académie des sciences. Published by Elsevier SAS. All rights reserved.
Structural systems pharmacology: a new frontier in discovering novel drug targets.

PubMed

Tan, Hepan; Ge, Xiaoxia; Xie, Lei

2013-08-01

The modern target-based drug discovery process, characterized by the one-drug-one-gene paradigm, has been of limited success. In contrast, phenotype-based screening produces thousands of active compounds but gives no hint as to what their molecular targets are or which ones merit further research. This presents a question: What is a suitable target for an efficient and safe drug? In this paper, we argue that target selection should take into account the proteome-wide energetic and kinetic landscape of drug-target interactions, as well as their cellular and organismal consequences. We propose a new paradigm of structural systems pharmacology to deconvolute the molecular targets of successful drugs as well as to identify druggable targets and their drug-like binders. Here we face two major challenges in structural systems pharmacology: How do we characterize and analyze the structural and energetic origins of drug-target interactions on a proteome scale? How do we correlate the dynamic molecular interactions to their in vivo activity? We will review recent advances in developing new computational tools for biophysics, bioinformatics, chemoinformatics, and systems biology related to the identification of genome-wide target profiles. We believe that the integration of these tools will realize structural systems pharmacology, enabling us to both efficiently develop effective therapeutics for complex diseases and combat drug resistance.
Universal scaling function in discrete time asymmetric exclusion processes

NASA Astrophysics Data System (ADS)

Chia, Nicholas; Bundschuh, Ralf

2005-03-01

In the universality class of the one dimensional Kardar-Parisi-Zhang surface growth, Derrida and Lebowitz conjectured the universality of not only the scaling exponents, but of an entire scaling function. Since Derrida and Lebowitz' original publication this universality has been verified for a variety of continuous time systems in the KPZ universality class. We study the Derrida-Lebowitz scaling function for multi-particle versions of the discrete time Asymmetric Exclusion Process. We find that in this discrete time system the Derrida-Lebowitz scaling function not only properly characterizes the large system size limit, but even accurately describes surprisingly small systems. These results have immediate applications in searching biological sequence databases.
A priori and a posteriori investigations for developing large eddy simulations of multi-species turbulent mixing under high-pressure conditions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Borghesi, Giulio; Bellan, Josette, E-mail: josette.bellan@jpl.nasa.gov; Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California 91109-8099

2015-03-15

A Direct Numerical Simulation (DNS) database was created representing mixing of species under high-pressure conditions. The configuration considered is that of a temporally evolving mixing layer. The database was examined and analyzed for the purpose of modeling some of the unclosed terms that appear in the Large Eddy Simulation (LES) equations. Several metrics are used to understand the LES modeling requirements. First, a statistical analysis of the DNS-database large-scale flow structures was performed to provide a metric for probing the accuracy of the proposed LES models as the flow fields obtained from accurate LESs should contain structures of morphology statisticallymore » similar to those observed in the filtered-and-coarsened DNS (FC-DNS) fields. To characterize the morphology of the large-scales structures, the Minkowski functionals of the iso-surfaces were evaluated for two different fields: the second-invariant of the rate of deformation tensor and the irreversible entropy production rate. To remove the presence of the small flow scales, both of these fields were computed using the FC-DNS solutions. It was found that the large-scale structures of the irreversible entropy production rate exhibit higher morphological complexity than those of the second invariant of the rate of deformation tensor, indicating that the burden of modeling will be on recovering the thermodynamic fields. Second, to evaluate the physical effects which must be modeled at the subfilter scale, an a priori analysis was conducted. This a priori analysis, conducted in the coarse-grid LES regime, revealed that standard closures for the filtered pressure, the filtered heat flux, and the filtered species mass fluxes, in which a filtered function of a variable is equal to the function of the filtered variable, may no longer be valid for the high-pressure flows considered in this study. The terms requiring modeling are the filtered pressure, the filtered heat flux, the filtered pressure work, and the filtered species mass fluxes. Improved models were developed based on a scale-similarity approach and were found to perform considerably better than the classical ones. These improved models were also assessed in an a posteriori study. Different combinations of the standard models and the improved ones were tested. At the relatively small Reynolds numbers achievable in DNS and at the relatively small filter widths used here, the standard models for the filtered pressure, the filtered heat flux, and the filtered species fluxes were found to yield accurate results for the morphology of the large-scale structures present in the flow. Analysis of the temporal evolution of several volume-averaged quantities representative of the mixing layer growth, and of the cross-stream variation of homogeneous-plane averages and second-order correlations, as well as of visualizations, indicated that the models performed equivalently for the conditions of the simulations. The expectation is that at the much larger Reynolds numbers and much larger filter widths used in practical applications, the improved models will have much more accurate performance than the standard one.« less
Integration of a neuroimaging processing pipeline into a pan-canadian computing grid

NASA Astrophysics Data System (ADS)

Lavoie-Courchesne, S.; Rioux, P.; Chouinard-Decorte, F.; Sherif, T.; Rousseau, M.-E.; Das, S.; Adalat, R.; Doyon, J.; Craddock, C.; Margulies, D.; Chu, C.; Lyttelton, O.; Evans, A. C.; Bellec, P.

2012-02-01

The ethos of the neuroimaging field is quickly moving towards the open sharing of resources, including both imaging databases and processing tools. As a neuroimaging database represents a large volume of datasets and as neuroimaging processing pipelines are composed of heterogeneous, computationally intensive tools, such open sharing raises specific computational challenges. This motivates the design of novel dedicated computing infrastructures. This paper describes an interface between PSOM, a code-oriented pipeline development framework, and CBRAIN, a web-oriented platform for grid computing. This interface was used to integrate a PSOM-compliant pipeline for preprocessing of structural and functional magnetic resonance imaging into CBRAIN. We further tested the capacity of our infrastructure to handle a real large-scale project. A neuroimaging database including close to 1000 subjects was preprocessed using our interface and publicly released to help the participants of the ADHD-200 international competition. This successful experiment demonstrated that our integrated grid-computing platform is a powerful solution for high-throughput pipeline analysis in the field of neuroimaging.
The Untapped Promise of Secondary Data Sets in International and Comparative Education Policy Research

ERIC Educational Resources Information Center

Chudagr, Amita; Luschei, Thomas F.

2016-01-01

The objective of this commentary is to call attention to the feasibility and importance of large-scale, systematic, quantitative analysis in international and comparative education research. We contend that although many existing databases are under- or unutilized in quantitative international-comparative research, these resources present the…
Data Intensive Systems (DIS) Benchmark Performance Summary

DTIC Science & Technology

2003-08-01

models assumed by today’s conventional architectures. Such applications include model- based Automatic Target Recognition (ATR), synthetic aperture...radar (SAR) codes, large scale dynamic databases/battlefield integration, dynamic sensor- based processing, high-speed cryptanalysis, high speed...distributed interactive and data intensive simulations, data-oriented problems characterized by pointer- based and other highly irregular data structures
Very Large Scale Distributed Information Processing Systems

DTIC Science & Technology

1991-09-27

USENIX Conference Proceedings, pp. 31-43. USENIX, February 1988. [KLA90] Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis Apos- tolides, Beth...will be selected if cost is the curlcron Iorsleettin- IfFigure 2 R DistribUted Database lSgtam and its we combin the abolve two pit , n r-itcrr
Using National Education Longitudinal Data Sets in School Counseling Research

ERIC Educational Resources Information Center

Bryan, Julia A.; Day-Vines, Norma L.; Holcomb-McCoy, Cheryl; Moore-Thomas, Cheryl

2010-01-01

National longitudinal databases hold much promise for school counseling researchers. Several of the more frequently used data sets, possible professional implications, and strategies for acquiring training in the use of large-scale national data sets are described. A 6-step process for conducting research with the data sets is explicated:…
Similarity-based modeling in large-scale prediction of drug-drug interactions.

PubMed

Vilar, Santiago; Uriarte, Eugenio; Santana, Lourdes; Lorberbaum, Tal; Hripcsak, George; Friedman, Carol; Tatonetti, Nicholas P

2014-09-01

Drug-drug interactions (DDIs) are a major cause of adverse drug effects and a public health concern, as they increase hospital care expenses and reduce patients' quality of life. DDI detection is, therefore, an important objective in patient safety, one whose pursuit affects drug development and pharmacovigilance. In this article, we describe a protocol applicable on a large scale to predict novel DDIs based on similarity of drug interaction candidates to drugs involved in established DDIs. The method integrates a reference standard database of known DDIs with drug similarity information extracted from different sources, such as 2D and 3D molecular structure, interaction profile, target and side-effect similarities. The method is interpretable in that it generates drug interaction candidates that are traceable to pharmacological or clinical effects. We describe a protocol with applications in patient safety and preclinical toxicity screening. The time frame to implement this protocol is 5-7 h, with additional time potentially necessary, depending on the complexity of the reference standard DDI database and the similarity measures implemented.
StructRNAfinder: an automated pipeline and web server for RNA families prediction.

PubMed

Arias-Carrasco, Raúl; Vásquez-Morán, Yessenia; Nakaya, Helder I; Maracaja-Coutinho, Vinicius

2018-02-17

The function of many noncoding RNAs (ncRNAs) depend upon their secondary structures. Over the last decades, several methodologies have been developed to predict such structures or to use them to functionally annotate RNAs into RNA families. However, to fully perform this analysis, researchers should utilize multiple tools, which require the constant parsing and processing of several intermediate files. This makes the large-scale prediction and annotation of RNAs a daunting task even to researchers with good computational or bioinformatics skills. We present an automated pipeline named StructRNAfinder that predicts and annotates RNA families in transcript or genome sequences. This single tool not only displays the sequence/structural consensus alignments for each RNA family, according to Rfam database but also provides a taxonomic overview for each assigned functional RNA. Moreover, we implemented a user-friendly web service that allows researchers to upload their own nucleotide sequences in order to perform the whole analysis. Finally, we provided a stand-alone version of StructRNAfinder to be used in large-scale projects. The tool was developed under GNU General Public License (GPLv3) and is freely available at http://structrnafinder.integrativebioinformatics.me . The main advantage of StructRNAfinder relies on the large-scale processing and integrating the data obtained by each tool and database employed along the workflow, of which several files are generated and displayed in user-friendly reports, useful for downstream analyses and data exploration.
Leaf optical properties shed light on foliar trait variability at individual to global scales

NASA Astrophysics Data System (ADS)

Shiklomanov, A. N.; Serbin, S.; Dietze, M.

2016-12-01

Recent syntheses of large trait databases have contributed immensely to our understanding of drivers of plant function at the global scale. However, the global trade-offs revealed by such syntheses, such as the trade-off between leaf productivity and resilience (i.e. "leaf economics spectrum"), are often absent at smaller scales and fail to correlate with actual functional limitations. An improved understanding of how traits vary within communities, species, and individuals is critical to accurate representations of vegetation ecophysiology and ecological dynamics in ecosystem models. Spectral data from both field observations and remote sensing platforms present a potentially rich and widely available source of information on plant traits. In particular, the inversion of physically-based radiative transfer models (RTMs) is an effective and general method for estimating plant traits from spectral measurements. Here, we apply Bayesian inversion of the PROSPECT leaf RTM to a large database of field spectra and plant traits spanning tropical, temperate, and boreal forests, agricultural plots, arid shrublands, and tundra to identify dominant sources of variability and characterize trade-offs in plant functional traits. By leveraging such a large and diverse dataset, we re-calibrate the empirical absorption coefficients underlying the PROSPECT model and expand its scope to include additional leaf biochemical components, namely leaf nitrogen content. Our work provides a key methodological contribution as a physically-based retrieval of leaf nitrogen from remote sensing observations, and provides substantial insights about trait trade-offs related to plant acclimation, adaptation, and community assembly.
Sachem: a chemical cartridge for high-performance substructure search.

PubMed

Kratochvíl, Miroslav; Vondrášek, Jiří; Galgonek, Jakub

2018-05-23

Structure search is one of the valuable capabilities of small-molecule databases. Fingerprint-based screening methods are usually employed to enhance the search performance by reducing the number of calls to the verification procedure. In substructure search, fingerprints are designed to capture important structural aspects of the molecule to aid the decision about whether the molecule contains a given substructure. Currently available cartridges typically provide acceptable search performance for processing user queries, but do not scale satisfactorily with dataset size. We present Sachem, a new open-source chemical cartridge that implements two substructure search methods: The first is a performance-oriented reimplementation of substructure indexing based on the OrChem fingerprint, and the second is a novel method that employs newly designed fingerprints stored in inverted indices. We assessed the performance of both methods on small, medium, and large datasets containing 1, 10, and 94 million compounds, respectively. Comparison of Sachem with other freely available cartridges revealed improvements in overall performance, scaling potential and screen-out efficiency. The Sachem cartridge allows efficient substructure searches in databases of all sizes. The sublinear performance scaling of the second method and the ability to efficiently query large amounts of pre-extracted information may together open the door to new applications for substructure searches.
Large-Scale Chemical Similarity Networks for Target Profiling of Compounds Identified in Cell-Based Chemical Screens

PubMed Central

Lo, Yu-Chen; Senese, Silvia; Li, Chien-Ming; Hu, Qiyang; Huang, Yong; Damoiseaux, Robert; Torres, Jorge Z.

2015-01-01

Target identification is one of the most critical steps following cell-based phenotypic chemical screens aimed at identifying compounds with potential uses in cell biology and for developing novel disease therapies. Current in silico target identification methods, including chemical similarity database searches, are limited to single or sequential ligand analysis that have limited capabilities for accurate deconvolution of a large number of compounds with diverse chemical structures. Here, we present CSNAP (Chemical Similarity Network Analysis Pulldown), a new computational target identification method that utilizes chemical similarity networks for large-scale chemotype (consensus chemical pattern) recognition and drug target profiling. Our benchmark study showed that CSNAP can achieve an overall higher accuracy (>80%) of target prediction with respect to representative chemotypes in large (>200) compound sets, in comparison to the SEA approach (60–70%). Additionally, CSNAP is capable of integrating with biological knowledge-based databases (Uniprot, GO) and high-throughput biology platforms (proteomic, genetic, etc) for system-wise drug target validation. To demonstrate the utility of the CSNAP approach, we combined CSNAP's target prediction with experimental ligand evaluation to identify the major mitotic targets of hit compounds from a cell-based chemical screen and we highlight novel compounds targeting microtubules, an important cancer therapeutic target. The CSNAP method is freely available and can be accessed from the CSNAP web server (http://services.mbi.ucla.edu/CSNAP/). PMID:25826798
A generic method for improving the spatial interoperability of medical and ecological databases.

PubMed

Ghenassia, A; Beuscart, J B; Ficheur, G; Occelli, F; Babykina, E; Chazard, E; Genin, M

2017-10-03

The availability of big data in healthcare and the intensive development of data reuse and georeferencing have opened up perspectives for health spatial analysis. However, fine-scale spatial studies of ecological and medical databases are limited by the change of support problem and thus a lack of spatial unit interoperability. The use of spatial disaggregation methods to solve this problem introduces errors into the spatial estimations. Here, we present a generic, two-step method for merging medical and ecological databases that avoids the use of spatial disaggregation methods, while maximizing the spatial resolution. Firstly, a mapping table is created after one or more transition matrices have been defined. The latter link the spatial units of the original databases to the spatial units of the final database. Secondly, the mapping table is validated by (1) comparing the covariates contained in the two original databases, and (2) checking the spatial validity with a spatial continuity criterion and a spatial resolution index. We used our novel method to merge a medical database (the French national diagnosis-related group database, containing 5644 spatial units) with an ecological database (produced by the French National Institute of Statistics and Economic Studies, and containing with 36,594 spatial units). The mapping table yielded 5632 final spatial units. The mapping table's validity was evaluated by comparing the number of births in the medical database and the ecological databases in each final spatial unit. The median [interquartile range] relative difference was 2.3% [0; 5.7]. The spatial continuity criterion was low (2.4%), and the spatial resolution index was greater than for most French administrative areas. Our innovative approach improves interoperability between medical and ecological databases and facilitates fine-scale spatial analyses. We have shown that disaggregation models and large aggregation techniques are not necessarily the best ways to tackle the change of support problem.
The use of data from national and other large-scale user experience surveys in local quality work: a systematic review.

PubMed

Haugum, Mona; Danielsen, Kirsten; Iversen, Hilde Hestad; Bjertnaes, Oyvind

2014-12-01

An important goal for national and large-scale surveys of user experiences is quality improvement. However, large-scale surveys are normally conducted by a professional external surveyor, creating an institutionalized division between the measurement of user experiences and the quality work that is performed locally. The aim of this study was to identify and describe scientific studies related to the use of national and large-scale surveys of user experiences in local quality work. Ovid EMBASE, Ovid MEDLINE, Ovid PsycINFO and the Cochrane Database of Systematic Reviews. Scientific publications about user experiences and satisfaction about the extent to which data from national and other large-scale user experience surveys are used for local quality work in the health services. Themes of interest were identified and a narrative analysis was undertaken. Thirteen publications were included, all differed substantially in several characteristics. The results show that large-scale surveys of user experiences are used in local quality work. The types of follow-up activity varied considerably from conducting a follow-up analysis of user experience survey data to information sharing and more-systematic efforts to use the data as a basis for improving the quality of care. This review shows that large-scale surveys of user experiences are used in local quality work. However, there is a need for more, better and standardized research in this field. The considerable variation in follow-up activities points to the need for systematic guidance on how to use data in local quality work. © The Author 2014. Published by Oxford University Press in association with the International Society for Quality in Health Care; all rights reserved.
Statistical Downscaling in Multi-dimensional Wave Climate Forecast

NASA Astrophysics Data System (ADS)

Camus, P.; Méndez, F. J.; Medina, R.; Losada, I. J.; Cofiño, A. S.; Gutiérrez, J. M.

2009-04-01

Wave climate at a particular site is defined by the statistical distribution of sea state parameters, such as significant wave height, mean wave period, mean wave direction, wind velocity, wind direction and storm surge. Nowadays, long-term time series of these parameters are available from reanalysis databases obtained by numerical models. The Self-Organizing Map (SOM) technique is applied to characterize multi-dimensional wave climate, obtaining the relevant "wave types" spanning the historical variability. This technique summarizes multi-dimension of wave climate in terms of a set of clusters projected in low-dimensional lattice with a spatial organization, providing Probability Density Functions (PDFs) on the lattice. On the other hand, wind and storm surge depend on instantaneous local large-scale sea level pressure (SLP) fields while waves depend on the recent history of these fields (say, 1 to 5 days). Thus, these variables are associated with large-scale atmospheric circulation patterns. In this work, a nearest-neighbors analog method is used to predict monthly multi-dimensional wave climate. This method establishes relationships between the large-scale atmospheric circulation patterns from numerical models (SLP fields as predictors) with local wave databases of observations (monthly wave climate SOM PDFs as predictand) to set up statistical models. A wave reanalysis database, developed by Puertos del Estado (Ministerio de Fomento), is considered as historical time series of local variables. The simultaneous SLP fields calculated by NCEP atmospheric reanalysis are used as predictors. Several applications with different size of sea level pressure grid and with different temporal domain resolution are compared to obtain the optimal statistical model that better represents the monthly wave climate at a particular site. In this work we examine the potential skill of this downscaling approach considering perfect-model conditions, but we will also analyze the suitability of this methodology to be used for seasonal forecast and for long-term climate change scenario projection of wave climate.

Evaluation of Tsunami Run-Up on Coastal Areas at Regional Scale

NASA Astrophysics Data System (ADS)

González, M.; Aniel-Quiroga, Í.; Gutiérrez, O.

2017-12-01

Tsunami hazard assessment is tackled by means of numerical simulations, giving as a result, the areas flooded by tsunami wave inland. To get this, some input data is required, i.e., the high resolution topobathymetry of the study area, the earthquake focal mechanism parameters, etc. The computational cost of these kinds of simulations are still excessive. An important restriction for the elaboration of large scale maps at National or regional scale is the reconstruction of high resolution topobathymetry on the coastal zone. An alternative and traditional method consists of the application of empirical-analytical formulations to calculate run-up at several coastal profiles (i.e. Synolakis, 1987), combined with numerical simulations offshore without including coastal inundation. In this case, the numerical simulations are faster but some limitations are added as the coastal bathymetric profiles are very simply idealized. In this work, we present a complementary methodology based on a hybrid numerical model, formed by 2 models that were coupled ad hoc for this work: a non-linear shallow water equations model (NLSWE) for the offshore part of the propagation and a Volume of Fluid model (VOF) for the areas near the coast and inland, applying each numerical scheme where they better reproduce the tsunami wave. The run-up of a tsunami scenario is obtained by applying the coupled model to an ad-hoc numerical flume. To design this methodology, hundreds of worldwide topobathymetric profiles have been parameterized, using 5 parameters (2 depths and 3 slopes). In addition, tsunami waves have been also parameterized by their height and period. As an application of the numerical flume methodology, the coastal parameterized profiles and tsunami waves have been combined to build a populated database of run-up calculations. The combination was tackled by means of numerical simulations in the numerical flume The result is a tsunami run-up database that considers real profiles shape, realistic tsunami waves, and optimized numerical simulations. This database allows the calculation of the run-up of any new tsunami wave by interpolation on the database, in a short period of time, based on the tsunami wave characteristics provided as an output of the NLSWE model along the coast at a large scale domain (regional or National scale).
The development of large-scale de-identified biomedical databases in the age of genomics-principles and challenges.

PubMed

Dankar, Fida K; Ptitsyn, Andrey; Dankar, Samar K

2018-04-10

Contemporary biomedical databases include a wide range of information types from various observational and instrumental sources. Among the most important features that unite biomedical databases across the field are high volume of information and high potential to cause damage through data corruption, loss of performance, and loss of patient privacy. Thus, issues of data governance and privacy protection are essential for the construction of data depositories for biomedical research and healthcare. In this paper, we discuss various challenges of data governance in the context of population genome projects. The various challenges along with best practices and current research efforts are discussed through the steps of data collection, storage, sharing, analysis, and knowledge dissemination.
Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution.

PubMed

Gu, Xun; Wang, Yufeng; Gu, Jianying

2002-06-01

The classical (two-round) hypothesis of vertebrate genome duplication proposes two successive whole-genome duplication(s) (polyploidizations) predating the origin of fishes, a view now being seriously challenged. As the debate largely concerns the relative merits of the 'big-bang mode' theory (large-scale duplication) and the 'continuous mode' theory (constant creation by small-scale duplications), we tested whether a significant proportion of paralogous genes in the contemporary human genome was indeed generated in the early stage of vertebrate evolution. After an extensive search of major databases, we dated 1,739 gene duplication events from the phylogenetic analysis of 749 vertebrate gene families. We found a pattern characterized by two waves (I, II) and an ancient component. Wave I represents a recent gene family expansion by tandem or segmental duplications, whereas wave II, a rapid paralogous gene increase in the early stage of vertebrate evolution, supports the idea of genome duplication(s) (the big-bang mode). Further analysis indicated that large- and small-scale gene duplications both make a significant contribution during the early stage of vertebrate evolution to build the current hierarchy of the human proteome.
A Matter of Time: Faster Percolator Analysis via Efficient SVM Learning for Large-Scale Proteomics.

PubMed

Halloran, John T; Rocke, David M

2018-05-04

Percolator is an important tool for greatly improving the results of a database search and subsequent downstream analysis. Using support vector machines (SVMs), Percolator recalibrates peptide-spectrum matches based on the learned decision boundary between targets and decoys. To improve analysis time for large-scale data sets, we update Percolator's SVM learning engine through software and algorithmic optimizations rather than heuristic approaches that necessitate the careful study of their impact on learned parameters across different search settings and data sets. We show that by optimizing Percolator's original learning algorithm, l 2 -SVM-MFN, large-scale SVM learning requires nearly only a third of the original runtime. Furthermore, we show that by employing the widely used Trust Region Newton (TRON) algorithm instead of l 2 -SVM-MFN, large-scale Percolator SVM learning is reduced to nearly only a fifth of the original runtime. Importantly, these speedups only affect the speed at which Percolator converges to a global solution and do not alter recalibration performance. The upgraded versions of both l 2 -SVM-MFN and TRON are optimized within the Percolator codebase for multithreaded and single-thread use and are available under Apache license at bitbucket.org/jthalloran/percolator_upgrade .
Levelling and merging of two discrete national-scale geochemical databases: A case study showing the surficial expression of metalliferous black shales

USGS Publications Warehouse

Smith, Steven M.; Neilson, Ryan T.; Giles, Stuart A.

2015-01-01

Government-sponsored, national-scale, soil and sediment geochemical databases are used to estimate regional and local background concentrations for environmental issues, identify possible anthropogenic contamination, estimate mineral endowment, explore for new mineral deposits, evaluate nutrient levels for agriculture, and establish concentration relationships with human or animal health. Because of these different uses, it is difficult for any single database to accommodate all the needs of each client. Smith et al. (2013, p. 168) reviewed six national-scale soil and sediment geochemical databases for the United States (U.S.) and, for each, evaluated “its appropriateness as a national-scale geochemical database and its usefulness for national-scale geochemical mapping.” Each of the evaluated databases has strengths and weaknesses that were listed in that review.Two of these U.S. national-scale geochemical databases are similar in their sample media and collection protocols but have different strengths—primarily sampling density and analytical consistency. This project was implemented to determine whether those databases could be merged to produce a combined dataset that could be used for mineral resource assessments. The utility of the merged database was tested to see whether mapped distributions could identify metalliferous black shales at a national scale.
The thermodynamic scale of inorganic crystalline metastability

PubMed Central

Sun, Wenhao; Dacek, Stephen T.; Ong, Shyue Ping; Hautier, Geoffroy; Jain, Anubhav; Richards, William D.; Gamst, Anthony C.; Persson, Kristin A.; Ceder, Gerbrand

2016-01-01

The space of metastable materials offers promising new design opportunities for next-generation technological materials, such as complex oxides, semiconductors, pharmaceuticals, steels, and beyond. Although metastable phases are ubiquitous in both nature and technology, only a heuristic understanding of their underlying thermodynamics exists. We report a large-scale data-mining study of the Materials Project, a high-throughput database of density functional theory–calculated energetics of Inorganic Crystal Structure Database structures, to explicitly quantify the thermodynamic scale of metastability for 29,902 observed inorganic crystalline phases. We reveal the influence of chemistry and composition on the accessible thermodynamic range of crystalline metastability for polymorphic and phase-separating compounds, yielding new physical insights that can guide the design of novel metastable materials. We further assert that not all low-energy metastable compounds can necessarily be synthesized, and propose a principle of ‘remnant metastability’—that observable metastable crystalline phases are generally remnants of thermodynamic conditions where they were once the lowest free-energy phase. PMID:28138514
Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics.

PubMed

Chung, Ming-Hua; Wang, Yuping; Tang, Hailin; Zou, Wen; Basinger, John; Xu, Xiaowei; Tong, Weida

2015-01-01

The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this paper, we developed a generalized probabilistic topic model to analyze a toxicogenomics dataset that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.
A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies.

PubMed

Thakur, Shalabh; Guttman, David S

2016-06-30

Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/ .
Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context

PubMed Central

Faith, Jeremiah J; Olson, Andrew J; Gardner, Timothy S; Sachidanandam, Ravi

2007-01-01

Background Lightweight genome viewer (lwgv) is a web-based tool for visualization of sequence annotations in their chromosomal context. It performs most of the functions of larger genome browsers, while relying on standard flat-file formats and bypassing the database needs of most visualization tools. Visualization as an aide to discovery requires display of novel data in conjunction with static annotations in their chromosomal context. With database-based systems, displaying dynamic results requires temporary tables that need to be tracked for removal. Results lwgv simplifies the visualization of user-generated results on a local computer. The dynamic results of these analyses are written to transient files, which can import static content from a more permanent file. lwgv is currently used in many different applications, from whole genome browsers to single-gene RNAi design visualization, demonstrating its applicability in a large variety of contexts and scales. Conclusion lwgv provides a lightweight alternative to large genome browsers for visualizing biological annotations and dynamic analyses in their chromosomal context. It is particularly suited for applications ranging from short sequences to medium-sized genomes when the creation and maintenance of a large software and database infrastructure is not necessary or desired. PMID:17877794
Lightweight genome viewer: portable software for browsing genomics data in its chromosomal context.

PubMed

Faith, Jeremiah J; Olson, Andrew J; Gardner, Timothy S; Sachidanandam, Ravi

2007-09-18

Lightweight genome viewer (lwgv) is a web-based tool for visualization of sequence annotations in their chromosomal context. It performs most of the functions of larger genome browsers, while relying on standard flat-file formats and bypassing the database needs of most visualization tools. Visualization as an aide to discovery requires display of novel data in conjunction with static annotations in their chromosomal context. With database-based systems, displaying dynamic results requires temporary tables that need to be tracked for removal. lwgv simplifies the visualization of user-generated results on a local computer. The dynamic results of these analyses are written to transient files, which can import static content from a more permanent file. lwgv is currently used in many different applications, from whole genome browsers to single-gene RNAi design visualization, demonstrating its applicability in a large variety of contexts and scales. lwgv provides a lightweight alternative to large genome browsers for visualizing biological annotations and dynamic analyses in their chromosomal context. It is particularly suited for applications ranging from short sequences to medium-sized genomes when the creation and maintenance of a large software and database infrastructure is not necessary or desired.
BIG: a large-scale data integration tool for renal physiology.

PubMed

Zhao, Yue; Yang, Chin-Rang; Raghuram, Viswanathan; Parulekar, Jaya; Knepper, Mark A

2016-10-01

Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: "How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?" This is the type of problem that has motivated the "Big-Data" revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/.
Coincident scales of forest feedback on climate and conservation in a diversity hot spot

PubMed Central

Webb, Thomas J; Gaston, Kevin J; Hannah, Lee; Ian Woodward, F

2005-01-01

The dynamic relationship between vegetation and climate is now widely acknowledged. Climate influences the distribution of vegetation; and through a number of feedback mechanisms vegetation affects climate. This implies that land-use changes such as deforestation will have climatic consequences. However, the spatial scales at which such feedbacks occur remain largely unknown. Here, we use a large database of precipitation and tree cover records for an area of the biodiversity-rich Atlantic forest region in south eastern Brazil to investigate the forest–rainfall feedback at a range of spatial scales from ca 101–104 km2. We show that the strength of the feedback increases up to scales of at least 103 km2, with the climate at a particular locality influenced by the pattern of landcover extending over a large area. Thus, smaller forest fragments, even if well protected, may suffer degradation due to the climate responding to land-use change in the surrounding area. Atlantic forest vertebrate taxa also require large areas of forest to support viable populations. Areas of forest of ca 103 km2 would be large enough to support such populations at the same time as minimizing the risk of climatic feedbacks resulting from deforestation. PMID:16608697
Coincident scales of forest feedback on climate and conservation in a diversity hot spot.

PubMed

Webb, Thomas J; Gaston, Kevin J; Hannah, Lee; Ian Woodward, F

2006-03-22

The dynamic relationship between vegetation and climate is now widely acknowledged. Climate influences the distribution of vegetation; and through a number of feedback mechanisms vegetation affects climate. This implies that land-use changes such as deforestation will have climatic consequences. However, the spatial scales at which such feedbacks occur remain largely unknown. Here, we use a large database of precipitation and tree cover records for an area of the biodiversity-rich Atlantic forest region in south eastern Brazil to investigate the forest-rainfall feedback at a range of spatial scales from ca 10(1)-10(4) km2. We show that the strength of the feedback increases up to scales of at least 10(3) km2, with the climate at a particular locality influenced by the pattern of landcover extending over a large area. Thus, smaller forest fragments, even if well protected, may suffer degradation due to the climate responding to land-use change in the surrounding area. Atlantic forest vertebrate taxa also require large areas of forest to support viable populations. Areas of forest of ca 10(3) km2 would be large enough to support such populations at the same time as minimizing the risk of climatic feedbacks resulting from deforestation.
Administrative Databases in Orthopaedic Research: Pearls and Pitfalls of Big Data.

PubMed

Patel, Alpesh A; Singh, Kern; Nunley, Ryan M; Minhas, Shobhit V

2016-03-01

The drive for evidence-based decision-making has highlighted the shortcomings of traditional orthopaedic literature. Although high-quality, prospective, randomized studies in surgery are the benchmark in orthopaedic literature, they are often limited by size, scope, cost, time, and ethical concerns and may not be generalizable to larger populations. Given these restrictions, there is a growing trend toward the use of large administrative databases to investigate orthopaedic outcomes. These datasets afford the opportunity to identify a large numbers of patients across a broad spectrum of comorbidities, providing information regarding disparities in care and outcomes, preoperative risk stratification parameters for perioperative morbidity and mortality, and national epidemiologic rates and trends. Although there is power in these databases in terms of their impact, potential problems include administrative data that are at risk of clerical inaccuracies, recording bias secondary to financial incentives, temporal changes in billing codes, a lack of numerous clinically relevant variables and orthopaedic-specific outcomes, and the absolute requirement of an experienced epidemiologist and/or statistician when evaluating results and controlling for confounders. Despite these drawbacks, administrative database studies are fundamental and powerful tools in assessing outcomes on a national scale and will likely be of substantial assistance in the future of orthopaedic research.
JEnsembl: a version-aware Java API to Ensembl data systems.

PubMed

Paterson, Trevor; Law, Andy

2012-11-01

The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing 'through time' comparative analyses to be performed. Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net).
CellLineNavigator: a workbench for cancer cell line analysis

PubMed Central

Krupp, Markus; Itzel, Timo; Maass, Thorsten; Hildebrandt, Andreas; Galle, Peter R.; Teufel, Andreas

2013-01-01

The CellLineNavigator database, freely available at http://www.medicalgenomics.org/celllinenavigator, is a web-based workbench for large scale comparisons of a large collection of diverse cell lines. It aims to support experimental design in the fields of genomics, systems biology and translational biomedical research. Currently, this compendium holds genome wide expression profiles of 317 different cancer cell lines, categorized into 57 different pathological states and 28 individual tissues. To enlarge the scope of CellLineNavigator, the database was furthermore closely linked to commonly used bioinformatics databases and knowledge repositories. To ensure easy data access and search ability, a simple data and an intuitive querying interface were implemented. It allows the user to explore and filter gene expression, focusing on pathological or physiological conditions. For a more complex search, the advanced query interface may be used to query for (i) differentially expressed genes; (ii) pathological or physiological conditions; or (iii) gene names or functional attributes, such as Kyoto Encyclopaedia of Genes and Genomes pathway maps. These queries may also be combined. Finally, CellLineNavigator allows additional advanced analysis of differentially regulated genes by a direct link to the Database for Annotation, Visualization and Integrated Discovery (DAVID) Bioinformatics Resources. PMID:23118487
The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space.

PubMed

Nicolaou, Christos A; Watson, Ian A; Hu, Hong; Wang, Jibo

2016-07-25

Venturing into the immensity of the small molecule universe to identify novel chemical structure is a much discussed objective of many methods proposed by the chemoinformatics community. To this end, numerous approaches using techniques from the fields of computational de novo design, virtual screening and reaction informatics, among others, have been proposed. Although in principle this objective is commendable, in practice there are several obstacles to useful exploitation of the chemical space. Prime among them are the sheer number of theoretically feasible compounds and the practical concern regarding the synthesizability of the chemical structures conceived using in silico methods. We present the Proximal Lilly Collection initiative implemented at Eli Lilly and Co. with the aims to (i) define the chemical space of small, drug-like compounds that could be synthesized using in-house resources and (ii) facilitate access to compounds in this large space for the purposes of ongoing drug discovery efforts. The implementation of PLC relies on coupling access to available synthetic knowledge and resources with chemo/reaction informatics techniques and tools developed for this purpose. We describe in detail the computational framework supporting this initiative and elaborate on the characteristics of the PLC virtual collection of compounds. As an example of the opportunities provided to drug discovery researchers by easy access to a large, realistically feasible virtual collection such as the PLC, we describe a recent application of the technology that led to the discovery of selective kinase inhibitors.
Protocol for developing a Database of Zoonotic disease Research in India (DoZooRI).

PubMed

Chatterjee, Pranab; Bhaumik, Soumyadeep; Chauhan, Abhimanyu Singh; Kakkar, Manish

2017-12-10

Zoonotic and emerging infectious diseases (EIDs) represent a public health threat that has been acknowledged only recently although they have been on the rise for the past several decades. On an average, every year since the Second World War, one pathogen has emerged or re-emerged on a global scale. Low/middle-income countries such as India bear a significant burden of zoonotic and EIDs. We propose that the creation of a database of published, peer-reviewed research will open up avenues for evidence-based policymaking for targeted prevention and control of zoonoses. A large-scale systematic mapping of the published peer-reviewed research conducted in India will be undertaken. All published research will be included in the database, without any prejudice for quality screening, to broaden the scope of included studies. Structured search strategies will be developed for priority zoonotic diseases (leptospirosis, rabies, anthrax, brucellosis, cysticercosis, salmonellosis, bovine tuberculosis, Japanese encephalitis and rickettsial infections), and multiple databases will be searched for studies conducted in India. The database will be managed and hosted on a cloud-based platform called Rayyan. Individual studies will be tagged based on key preidentified parameters (disease, study design, study type, location, randomisation status and interventions, host involvement and others, as applicable). The database will incorporate already published studies, obviating the need for additional ethical clearances. The database will be made available online, and in collaboration with multisectoral teams, domains of enquiries will be identified and subsequent research questions will be raised. The database will be queried for these and resulting evidence will be analysed and published in peer-reviewed journals. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Transitioning to a new nursing home: one organization's experience.

PubMed

O'Brien, Kelli; Welsh, Darlene; Lundrigan, Elaine; Doyle, Anne

2013-01-01

Restructuring of long-term care in Western Health, a regional health authority within Newfoundland and Labrador, created a unique opportunity to study the widespread impacts of the transition. Staff and long-term-care residents were relocated from a variety of settings to a newly constructed facility. A plan was developed to assess the impact of relocation on staff, residents, and families. Indicators included fall rates, medication errors, complaints, media database, sick leave, overtime, injuries, and staff and family satisfaction. This article reports on the findings and lessons learned from an organizational perspective with such a large-scale transition. Some of the key findings included the necessity of premove and postmove strategies to minimize negative impacts, ongoing communication and involvement in decision making during transitions, tracking of key indicators, recognition from management regarding increased workload and stress experienced by staff, engagement of residents and families throughout the transition, and assessing the timing of large-scale relocations. These findings would be of interest to health care managers and leadership team in organizations planning large-scale changes.
[Advances in the research of application of artificial intelligence in burn field].

PubMed

Li, H H; Bao, Z X; Liu, X B; Zhu, S H

2018-04-20

Artificial intelligence has been able to automatically learn and judge large-scale data to some extent. Based on database of a large amount of burn data and in-depth learning, artificial intelligence can assist burn surgeons to evaluate burn surface, diagnose burn depth, guide fluid supply during shock stage, and predict prognosis, with high accuracy. With the development of technology, artificial intelligence can provide more accurate information for burn surgeons to make clinical diagnosis and treatment strategies.

Remote visualization and scale analysis of large turbulence datatsets

NASA Astrophysics Data System (ADS)

Livescu, D.; Pulido, J.; Burns, R.; Canada, C.; Ahrens, J.; Hamann, B.

2015-12-01

Accurate simulations of turbulent flows require solving all the dynamically relevant scales of motions. This technique, called Direct Numerical Simulation, has been successfully applied to a variety of simple flows; however, the large-scale flows encountered in Geophysical Fluid Dynamics (GFD) would require meshes outside the range of the most powerful supercomputers for the foreseeable future. Nevertheless, the current generation of petascale computers has enabled unprecedented simulations of many types of turbulent flows which focus on various GFD aspects, from the idealized configurations extensively studied in the past to more complex flows closer to the practical applications. The pace at which such simulations are performed only continues to increase; however, the simulations themselves are restricted to a small number of groups with access to large computational platforms. Yet the petabytes of turbulence data offer almost limitless information on many different aspects of the flow, from the hierarchy of turbulence moments, spectra and correlations, to structure-functions, geometrical properties, etc. The ability to share such datasets with other groups can significantly reduce the time to analyze the data, help the creative process and increase the pace of discovery. Using the largest DOE supercomputing platforms, we have performed some of the biggest turbulence simulations to date, in various configurations, addressing specific aspects of turbulence production and mixing mechanisms. Until recently, the visualization and analysis of such datasets was restricted by access to large supercomputers. The public Johns Hopkins Turbulence database simplifies the access to multi-Terabyte turbulence datasets and facilitates turbulence analysis through the use of commodity hardware. First, one of our datasets, which is part of the database, will be described and then a framework that adds high-speed visualization and wavelet support for multi-resolution analysis of turbulence will be highlighted. The addition of wavelet support reduces the latency and bandwidth requirements for visualization, allowing for many concurrent users, and enables new types of analyses, including scale decomposition and coherent feature extraction.
QuartetS-DB: A Large-Scale Orthology Database for Prokaryotes and Eukaryotes Inferred by Evolutionary Evidence

DTIC Science & Technology

2012-01-01

particular functions and identify species that contain these proteins. For example, if users select two species, Homo sapiens and Mus musculus, and...Kerr AR, McCormack TJ, Riley M: Evolution by leaps: gene duplication in bacteria. Biol Direct 2009, 4:46. 12. Remm M, Storm CE, Sonnhammer EL
Developing Data Systems To Support the Analysis and Development of Large-Scale, On-Line Assessment.

ERIC Educational Resources Information Center

Yu, Chong Ho

Today many data warehousing systems are data rich, but information poor. Extracting useful information from an ocean of data to support administrative, policy, and instructional decisions becomes a major challenge to both database designers and measurement specialists. This paper focuses on the development of a data processing system that…
Mathematics Teacher Education Quality in TEDS-M: Globalizing the Views of Future Teachers and Teacher Educators

ERIC Educational Resources Information Center

Hsieh, Feng-Jui; Law, Chiu-Keung; Shy, Haw-Yaw; Wang, Ting-Ying; Hsieh, Chia-Jui; Tang, Shu-Jyh

2011-01-01

The Teacher Education and Development Study in Mathematics, sponsored by the International Association for the Evaluation of Educational Achievement, is the first data-based study about mathematics teacher education with large-scale samples; this article is based on its data but develops a stand-alone conceptual framework to investigate the…
Gender Implications in Curriculum and Entrance Exam Grouping: Institutional Factors and Their Effects

ERIC Educational Resources Information Center

Hsaieh, Hsiao-Chin; Yang, Chia-Ling

2014-01-01

While access to higher education has reached gender parity in Taiwan, the phenomenon of gender segregation and stratification by fields of study and by division of labor persist. In this article, we trace the historical evolution of Taiwan's education system and data using large-scale educational databases to analyze the association of…
Height-diameter allometry of tropical forest trees

Treesearch

T.R. Feldpausch; L. Banin; O.L. Phillips; T.R. Baker; S.L. Lewis; C.A. Quesada; K. Affum-Baffoe; E.J.M.M. Arets; N.J. Berry; M. Bird; E.S. Brondizio; P de Camargo; J. Chave; G. Djagbletey; T.F. Domingues; M. Drescher; P.M. Fearnside; M.B. Franca; N.M. Fyllas; G. Lopez-Gonzalez; A. Hladik; N. Higuchi; M.O. Hunter; Y. Iida; K.A. Salim; A.R. Kassim; M. Keller; J. Kemp; D.A. King; J.C. Lovett; B.S. Marimon; B.H. Marimon-Junior; E. Lenza; A.R. Marshall; D.J. Metcalfe; E.T.A. Mitchard; E.F. Moran; B.W. Nelson; R. Nilus; E.M. Nogueira; M. Palace; S. Patiño; K.S.-H. Peh; M.T. Raventos; J.M. Reitsma; G. Saiz; F. Schrodt; B. Sonke; H.E. Taedoumg; S. Tan; L. White; H. Woll; J. Lloyd

2011-01-01

Tropical tree height-diameter (H:D) relationships may vary by forest type and region making large-scale estimates of above-ground biomass subject to bias if they ignore these differences in stem allometry. We have therefore developed a new global tropical forest database consisting of 39 955 concurrent H and D measurements encompassing 283 sites in 22 tropical...
Obesity, High-Calorie Food Intake, and Academic Achievement Trends among U.S. School Children

ERIC Educational Resources Information Center

Li, Jian; O'Connell, Ann A.

2012-01-01

The authors investigated children's self-reported high-calorie food intake in Grade 5 and its relationship to trends in obesity status and academic achievement over the first 6 years of school. They used 3-level hierarchical linear models in the large-scale database (the Early Childhood Longitudinal Study--Kindergarten Cohort). Findings indicated…
GenBank

PubMed Central

Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.

2007-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (). PMID:17202161
A Computational Chemistry Database for Semiconductor Processing

NASA Technical Reports Server (NTRS)

Jaffe, R.; Meyyappan, M.; Arnold, J. O. (Technical Monitor)

1998-01-01

The concept of 'virtual reactor' or 'virtual prototyping' has received much attention recently in the semiconductor industry. Commercial codes to simulate thermal CVD and plasma processes have become available to aid in equipment and process design efforts, The virtual prototyping effort would go nowhere if codes do not come with a reliable database of chemical and physical properties of gases involved in semiconductor processing. Commercial code vendors have no capabilities to generate such a database, rather leave the task to the user of finding whatever is needed. While individual investigations of interesting chemical systems continue at Universities, there has not been any large scale effort to create a database. In this presentation, we outline our efforts in this area. Our effort focuses on the following five areas: 1. Thermal CVD reaction mechanism and rate constants. 2. Thermochemical properties. 3. Transport properties.4. Electron-molecule collision cross sections. and 5. Gas-surface interactions.
Generalized in vitro-in vivo relationship (IVIVR) model based on artificial neural networks

PubMed Central

Mendyk, Aleksander; Tuszyński, Paweł K; Polak, Sebastian; Jachowicz, Renata

2013-01-01

Background The aim of this study was to develop a generalized in vitro-in vivo relationship (IVIVR) model based on in vitro dissolution profiles together with quantitative and qualitative composition of dosage formulations as covariates. Such a model would be of substantial aid in the early stages of development of a pharmaceutical formulation, when no in vivo results are yet available and it is impossible to create a classical in vitro-in vivo correlation (IVIVC)/IVIVR. Methods Chemoinformatics software was used to compute the molecular descriptors of drug substances (ie, active pharmaceutical ingredients) and excipients. The data were collected from the literature. Artificial neural networks were used as the modeling tool. The training process was carried out using the 10-fold cross-validation technique. Results The database contained 93 formulations with 307 inputs initially, and was later limited to 28 in a course of sensitivity analysis. The four best models were introduced into the artificial neural network ensemble. Complete in vivo profiles were predicted accurately for 37.6% of the formulations. Conclusion It has been shown that artificial neural networks can be an effective predictive tool for constructing IVIVR in an integrated generalized model for various formulations. Because IVIVC/IVIVR is classically conducted for 2–4 formulations and with a single active pharmaceutical ingredient, the approach described here is unique in that it incorporates various active pharmaceutical ingredients and dosage forms into a single model. Thus, preliminary IVIVC/IVIVR can be available without in vivo data, which is impossible using current IVIVC/IVIVR procedures. PMID:23569360
Integrated Computational Approach for Virtual Hit Identification against Ebola Viral Proteins VP35 and VP40.

PubMed

Mirza, Muhammad Usman; Ikram, Nazia

2016-10-26

The Ebola virus (EBOV) has been recognised for nearly 40 years, with the most recent EBOV outbreak being in West Africa, where it created a humanitarian crisis. Mortalities reported up to 30 March 2016 totalled 11,307. However, up until now, EBOV drugs have been far from achieving regulatory (FDA) approval. It is therefore essential to identify parent compounds that have the potential to be developed into effective drugs. Studies on Ebola viral proteins have shown that some can elicit an immunological response in mice, and these are now considered essential components of a vaccine designed to protect against Ebola haemorrhagic fever. The current study focuses on chemoinformatic approaches to identify virtual hits against Ebola viral proteins (VP35 and VP40), including protein binding site prediction, drug-likeness, pharmacokinetic and pharmacodynamic properties, metabolic site prediction, and molecular docking. Retrospective validation was performed using a database of non-active compounds, and early enrichment of EBOV actives at different false positive rates was calculated. Homology modelling and subsequent superimposition of binding site residues on other strains of EBOV were carried out to check residual conformations, and hence to confirm the efficacy of potential compounds. As a mechanism for artefactual inhibition of proteins through non-specific compounds, virtual hits were assessed for their aggregator potential compared with previously reported aggregators. These systematic studies have indicated that a few compounds may be effective inhibitors of EBOV replication and therefore might have the potential to be developed as anti-EBOV drugs after subsequent testing and validation in experiments in vivo.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2008-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
GenBank

PubMed Central

Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.

2008-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov PMID:18073190
Australia's continental-scale acoustic tracking database and its automated quality control process

NASA Astrophysics Data System (ADS)

Hoenner, Xavier; Huveneers, Charlie; Steckenreuter, Andre; Simpfendorfer, Colin; Tattersall, Katherine; Jaine, Fabrice; Atkins, Natalia; Babcock, Russ; Brodie, Stephanie; Burgess, Jonathan; Campbell, Hamish; Heupel, Michelle; Pasquer, Benedicte; Proctor, Roger; Taylor, Matthew D.; Udyawer, Vinay; Harcourt, Robert

2018-01-01

Our ability to predict species responses to environmental changes relies on accurate records of animal movement patterns. Continental-scale acoustic telemetry networks are increasingly being established worldwide, producing large volumes of information-rich geospatial data. During the last decade, the Integrated Marine Observing System's Animal Tracking Facility (IMOS ATF) established a permanent array of acoustic receivers around Australia. Simultaneously, IMOS developed a centralised national database to foster collaborative research across the user community and quantify individual behaviour across a broad range of taxa. Here we present the database and quality control procedures developed to collate 49.6 million valid detections from 1891 receiving stations. This dataset consists of detections for 3,777 tags deployed on 117 marine species, with distances travelled ranging from a few to thousands of kilometres. Connectivity between regions was only made possible by the joint contribution of IMOS infrastructure and researcher-funded receivers. This dataset constitutes a valuable resource facilitating meta-analysis of animal movement, distributions, and habitat use, and is important for relating species distribution shifts with environmental covariates.
A Comparison of Global Indexing Schemes to Facilitate Earth Science Data Management

NASA Astrophysics Data System (ADS)

Griessbaum, N.; Frew, J.; Rilee, M. L.; Kuo, K. S.

2017-12-01

Recent advances in database technology have led to systems optimized for managing petabyte-scale multidimensional arrays. These array databases are a good fit for subsets of the Earth's surface that can be projected into a rectangular coordinate system with acceptable geometric fidelity. However, for global analyses, array databases must address the same distortions and discontinuities that apply to map projections in general. The array database SciDB supports enormous databases spread across thousands of computing nodes. Additionally, the following SciDB characteristics are particularly germane to the coordinate system problem: SciDB efficiently stores and manipulates sparse (i.e. mostly empty) arrays. SciDB arrays have 64-bit indexes. SciDB supports user-defined data types, functions, and operators. We have implemented two geospatial indexing schemes in SciDB. The simplest uses two array dimensions to represent longitude and latitude. For representation as 64-bit integers, the coordinates are multiplied by a scale factor large enough to yield an appropriate Earth surface resolution (e.g., a scale factor of 100,000 yields a resolution of approximately 1m at the equator). Aside from the longitudinal discontinuity, the principal disadvantage of this scheme is its fixed scale factor. The second scheme uses a single array dimension to represent the bit-codes for locations in a hierarchical triangular mesh (HTM) coordinate system. A HTM maps the Earth's surface onto an octahedron, and then recursively subdivides each triangular face to the desired resolution. Earth surface locations are represented as the concatenation of an octahedron face code and a quadtree code within the face. Unlike our integerized lat-lon scheme, the HTM allow for objects of different size (e.g., pixels with differing resolutions) to be represented in the same indexing scheme. We present an evaluation of the relative utility of these two schemes for managing and analyzing MODIS swath data.
Evolving from bioinformatics in-the-small to bioinformatics in-the-large.

PubMed

Parker, D Stott; Gorlick, Michael M; Lee, Christopher J

2003-01-01

We argue the significance of a fundamental shift in bioinformatics, from in-the-small to in-the-large. Adopting a large-scale perspective is a way to manage the problems endemic to the world of the small-constellations of incompatible tools for which the effort required to assemble an integrated system exceeds the perceived benefit of the integration. Where bioinformatics in-the-small is about data and tools, bioinformatics in-the-large is about metadata and dependencies. Dependencies represent the complexities of large-scale integration, including the requirements and assumptions governing the composition of tools. The popular make utility is a very effective system for defining and maintaining simple dependencies, and it offers a number of insights about the essence of bioinformatics in-the-large. Keeping an in-the-large perspective has been very useful to us in large bioinformatics projects. We give two fairly different examples, and extract lessons from them showing how it has helped. These examples both suggest the benefit of explicitly defining and managing knowledge flows and knowledge maps (which represent metadata regarding types, flows, and dependencies), and also suggest approaches for developing bioinformatics database systems. Generally, we argue that large-scale engineering principles can be successfully adapted from disciplines such as software engineering and data management, and that having an in-the-large perspective will be a key advantage in the next phase of bioinformatics development.
A data model and database for high-resolution pathology analytical image informatics.

PubMed

Wang, Fusheng; Kong, Jun; Cooper, Lee; Pan, Tony; Kurc, Tahsin; Chen, Wenjin; Sharma, Ashish; Niedermayr, Cristobal; Oh, Tae W; Brat, Daniel; Farris, Alton B; Foran, David J; Saltz, Joel

2011-01-01

The systematic analysis of imaged pathology specimens often results in a vast amount of morphological information at both the cellular and sub-cellular scales. While microscopy scanners and computerized analysis are capable of capturing and analyzing data rapidly, microscopy image data remain underutilized in research and clinical settings. One major obstacle which tends to reduce wider adoption of these new technologies throughout the clinical and scientific communities is the challenge of managing, querying, and integrating the vast amounts of data resulting from the analysis of large digital pathology datasets. This paper presents a data model, which addresses these challenges, and demonstrates its implementation in a relational database system. This paper describes a data model, referred to as Pathology Analytic Imaging Standards (PAIS), and a database implementation, which are designed to support the data management and query requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines on whole-slide images and tissue microarrays (TMAs). (1) Development of a data model capable of efficiently representing and storing virtual slide related image, annotation, markup, and feature information. (2) Development of a database, based on the data model, capable of supporting queries for data retrieval based on analysis and image metadata, queries for comparison of results from different analyses, and spatial queries on segmented regions, features, and classified objects. The work described in this paper is motivated by the challenges associated with characterization of micro-scale features for comparative and correlative analyses involving whole-slides tissue images and TMAs. Technologies for digitizing tissues have advanced significantly in the past decade. Slide scanners are capable of producing high-magnification, high-resolution images from whole slides and TMAs within several minutes. Hence, it is becoming increasingly feasible for basic, clinical, and translational research studies to produce thousands of whole-slide images. Systematic analysis of these large datasets requires efficient data management support for representing and indexing results from hundreds of interrelated analyses generating very large volumes of quantifications such as shape and texture and of classifications of the quantified features. We have designed a data model and a database to address the data management requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines. The data model represents virtual slide related image, annotation, markup and feature information. The database supports a wide range of metadata and spatial queries on images, annotations, markups, and features. We currently have three databases running on a Dell PowerEdge T410 server with CentOS 5.5 Linux operating system. The database server is IBM DB2 Enterprise Edition 9.7.2. The set of databases consists of 1) a TMA database containing image analysis results from 4740 cases of breast cancer, with 641 MB storage size; 2) an algorithm validation database, which stores markups and annotations from two segmentation algorithms and two parameter sets on 18 selected slides, with 66 GB storage size; and 3) an in silico brain tumor study database comprising results from 307 TCGA slides, with 365 GB storage size. The latter two databases also contain human-generated annotations and markups for regions and nuclei. Modeling and managing pathology image analysis results in a database provide immediate benefits on the value and usability of data in a research study. The database provides powerful query capabilities, which are otherwise difficult or cumbersome to support by other approaches such as programming languages. Standardized, semantic annotated data representation and interfaces also make it possible to more efficiently share image data and analysis results.
Rationalizing the chemical space of protein-protein interaction inhibitors.

PubMed

Sperandio, Olivier; Reynès, Christelle H; Camproux, Anne-Claude; Villoutreix, Bruno O

2010-03-01

Protein-protein interactions (PPIs) are one of the next major classes of therapeutic targets, although they are too intricate to tackle with standard approaches. This is due, in part, to the inadequacy of today's chemical libraries. However, the emergence of a growing number of experimentally validated inhibitors of PPIs (i-PPIs) allows drug designers to use chemoinformatics and machine learning technologies to unravel the nature of the chemical space covered by the reported compounds. Key characteristics of i-PPIs can then be revealed and highlight the importance of specific shapes and/or aromatic bonds, enabling the design of i-PPI-enriched focused libraries and, therefore, of cost-effective screening strategies. 2009 Elsevier Ltd. All rights reserved.
Accounting for rainfall spatial variability in the prediction of flash floods

NASA Astrophysics Data System (ADS)

Saharia, Manabendra; Kirstetter, Pierre-Emmanuel; Gourley, Jonathan J.; Hong, Yang; Vergara, Humberto; Flamig, Zachary L.

2017-04-01

Flash floods are a particularly damaging natural hazard worldwide in terms of both fatalities and property damage. In the United States, the lack of a comprehensive database that catalogues information related to flash flood timing, location, causative rainfall, and basin geomorphology has hindered broad characterization studies. First a representative and long archive of more than 15,000 flooding events during 2002-2011 is used to analyze the spatial and temporal variability of flash floods. We also derive large number of spatially distributed geomorphological and climatological parameters such as basin area, mean annual precipitation, basin slope etc. to identify static basin characteristics that influence flood response. For the same period, the National Severe Storms Laboratory (NSSL) has produced a decadal archive of Multi-Radar/Multi-Sensor (MRMS) radar-only precipitation rates at 1-km spatial resolution with 5-min temporal resolution. This provides an unprecedented opportunity to analyze the impact of event-level precipitation variability on flooding using a big data approach. To analyze the impact of sub-basin scale rainfall spatial variability on flooding, certain indices such as the first and second scaled moment of rainfall, horizontal gap, vertical gap etc. are computed from the MRMS dataset. Finally, flooding characteristics such as rise time, lag time, and peak discharge are linked to derived geomorphologic, climatologic, and rainfall indices to identify basin characteristics that drive flash floods. The database has been subjected to rigorous quality control by accounting for radar beam height and percentage snow in basins. So far studies involving rainfall variability indices have only been performed on a case study basis, and a large scale approach is expected to provide a deeper insight into how sub-basin scale precipitation variability affects flooding. Finally, these findings are validated using the National Weather Service storm reports and a historical flood fatalities database. This analysis framework will serve as a baseline for evaluating distributed hydrologic model simulations such as the Flooded Locations And Simulated Hydrographs Project (FLASH) (http://flash.ou.edu).
Toward the automated generation of genome-scale metabolic networks in the SEED.

PubMed

DeJongh, Matthew; Formsma, Kevin; Boillot, Paul; Gould, John; Rycenga, Matthew; Best, Aaron

2007-04-26

Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process. We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis). We have implemented our tools and database within the SEED, an open-source software environment for comparative genome annotation and analysis. Our method sets the stage for the automated generation of substantially complete metabolic networks for over 400 complete genome sequences currently in the SEED. With each genome that is processed using our tools, the database of common components grows to cover more of the diversity of metabolic pathways. This increases the likelihood that components of reaction networks for subsequently processed genomes can be retrieved from the database, rather than assembled and verified manually.

A Study of the Efficiency of Spatial Indexing Methods Applied to Large Astronomical Databases

NASA Astrophysics Data System (ADS)

Donaldson, Tom; Berriman, G. Bruce; Good, John; Shiao, Bernie

2018-01-01

Spatial indexing of astronomical databases generally uses quadrature methods, which partition the sky into cells used to create an index (usually a B-tree) written as database column. We report the results of a study to compare the performance of two common indexing methods, HTM and HEALPix, on Solaris and Windows database servers installed with a PostgreSQL database, and a Windows Server installed with MS SQL Server. The indexing was applied to the 2MASS All-Sky Catalog and to the Hubble Source catalog. On each server, the study compared indexing performance by submitting 1 million queries at each index level with random sky positions and random cone search radius, which was computed on a logarithmic scale between 1 arcsec and 1 degree, and measuring the time to complete the query and write the output. These simulated queries, intended to model realistic use patterns, were run in a uniform way on many combinations of indexing method and indexing level. The query times in all simulations are strongly I/O-bound and are linear with number of records returned for large numbers of sources. There are, however, considerable differences between simulations, which reveal that hardware I/O throughput is a more important factor in managing the performance of a DBMS than the choice of indexing scheme. The choice of index itself is relatively unimportant: for comparable index levels, the performance is consistent within the scatter of the timings. At small index levels (large cells; e.g. level 4; cell size 3.7 deg), there is large scatter in the timings because of wide variations in the number of sources found in the cells. At larger index levels, performance improves and scatter decreases, but the improvement at level 8 (14 min) and higher is masked to some extent in the timing scatter caused by the range of query sizes. At very high levels (20; 0.0004 arsec), the granularity of the cells becomes so high that a large number of extraneous empty cells begin to degrade performance. Thus, for the use patterns studied here the database performance is not critically dependent on the exact choices of index or level.
MGIS: managing banana (Musa spp.) genetic resources information and high-throughput genotyping data

PubMed Central

Guignon, V.; Sempere, G.; Sardos, J.; Hueber, Y.; Duvergey, H.; Andrieu, A.; Chase, R.; Jenny, C.; Hazekamp, T.; Irish, B.; Jelali, K.; Adeka, J.; Ayala-Silva, T.; Chao, C.P.; Daniells, J.; Dowiya, B.; Effa effa, B.; Gueco, L.; Herradura, L.; Ibobondji, L.; Kempenaers, E.; Kilangi, J.; Muhangi, S.; Ngo Xuan, P.; Paofa, J.; Pavis, C.; Thiemele, D.; Tossou, C.; Sandoval, J.; Sutanto, A.; Vangu Paka, G.; Yi, G.; Van den houwe, I.; Roux, N.

2017-01-01

Abstract Unraveling the genetic diversity held in genebanks on a large scale is underway, due to advances in Next-generation sequence (NGS) based technologies that produce high-density genetic markers for a large number of samples at low cost. Genebank users should be in a position to identify and select germplasm from the global genepool based on a combination of passport, genotypic and phenotypic data. To facilitate this, a new generation of information systems is being designed to efficiently handle data and link it with other external resources such as genome or breeding databases. The Musa Germplasm Information System (MGIS), the database for global ex situ-held banana genetic resources, has been developed to address those needs in a user-friendly way. In developing MGIS, we selected a generic database schema (Chado), the robust content management system Drupal for the user interface, and Tripal, a set of Drupal modules which links the Chado schema to Drupal. MGIS allows germplasm collection examination, accession browsing, advanced search functions, and germplasm orders. Additionally, we developed unique graphical interfaces to compare accessions and to explore them based on their taxonomic information. Accession-based data has been enriched with publications, genotyping studies and associated genotyping datasets reporting on germplasm use. Finally, an interoperability layer has been implemented to facilitate the link with complementary databases like the Banana Genome Hub and the MusaBase breeding database. Database URL: https://www.crop-diversity.org/mgis/ PMID:29220435
RARGE II: an integrated phenotype database of Arabidopsis mutant traits using a controlled vocabulary.

PubMed

Akiyama, Kenji; Kurotani, Atsushi; Iida, Kei; Kuromori, Takashi; Shinozaki, Kazuo; Sakurai, Tetsuya

2014-01-01

Arabidopsis thaliana is one of the most popular experimental plants. However, only 40% of its genes have at least one experimental Gene Ontology (GO) annotation assigned. Systematic observation of mutant phenotypes is an important technique for elucidating gene functions. Indeed, several large-scale phenotypic analyses have been performed and have generated phenotypic data sets from many Arabidopsis mutant lines and overexpressing lines, which are freely available online. Since each Arabidopsis mutant line database uses individual phenotype expression, the differences in the structured term sets used by each database make it difficult to compare data sets and make it impossible to search across databases. Therefore, we obtained publicly available information for a total of 66,209 Arabidopsis mutant lines, including loss-of-function (RATM and TARAPPER) and gain-of-function (AtFOX and OsFOX) lines, and integrated the phenotype data by mapping the descriptions onto Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) terms. This approach made it possible to manage the four different phenotype databases as one large data set. Here, we report a publicly accessible web-based database, the RIKEN Arabidopsis Genome Encyclopedia II (RARGE II; http://rarge-v2.psc.riken.jp/), in which all of the data described in this study are included. Using the database, we demonstrated consistency (in terms of protein function) with a previous study and identified the presumed function of an unknown gene. We provide examples of AT1G21600, which is a subunit in the plastid-encoded RNA polymerase complex, and AT5G56980, which is related to the jasmonic acid signaling pathway.
Traditional manual acupuncture combined with rehabilitation therapy for shoulder hand syndrome after stroke within the Chinese healthcare system: a systematic review and meta-analysis.

PubMed

Peng, Le; Zhang, Chao; Zhou, Lan; Zuo, Hong-Xia; He, Xiao-Kuo; Niu, Yu-Ming

2018-04-01

To investigate the effectiveness of traditional manual acupuncture combined with rehabilitation therapy versus rehabilitation therapy alone for shoulder hand syndrome after stroke. PubMed, EMBASE, the Cochrane Library, Chinese Biomedicine Database, China National Knowledge Infrastructure, VIP Information Database, Wan Fang Database and reference lists of the eligible studies were searched up to July 2017 for relevant studies. Randomized controlled trials that compared the combined effects of traditional manual acupuncture and rehabilitation therapy to rehabilitation therapy alone for shoulder hand syndrome after stroke were included. Two reviewers independently screened the searched records, extracted the data and assessed risk of bias of the included studies. The treatment effect sizes were pooled in a meta-analysis using RevMan 5.3 software. A total of 20 studies involving 1918 participants were included in this study. Compared to rehabilitation therapy alone, the combined therapy significantly reduced pain on the visual analogue scale and improved limb movement on the Fugl-Meyer Assessment scale and the performance of activities of daily living (ADL) on the Barthel Index scale or Modified Barthel Index scale. Of these, the visual analogue scale score changes were significantly higher (mean difference = 1.49, 95% confidence interval = 1.15-1.82, P < 0.00001) favoring the combined therapy after treatment, with severe heterogeneity ( I 2 = 71%, P = 0.0005). Current evidence suggests that traditional manual acupuncture integrated with rehabilitation therapy is more effective in alleviating pain, improving limb movement and ADL. However, considering the relatively low quality of available evidence, further rigorously designed and large-scale randomized controlled trials are needed to confirm the results.
Automatic location of L/H transition times for physical studies with a large statistical basis

NASA Astrophysics Data System (ADS)

González, S.; Vega, J.; Murari, A.; Pereira, A.; Dormido-Canto, S.; Ramírez, J. M.; contributors, JET-EFDA

2012-06-01

Completely automatic techniques to estimate and validate L/H transition times can be essential in L/H transition analyses. The generation of databases with hundreds of transition times and without human intervention is an important step to accomplish (a) L/H transition physics analysis, (b) validation of L/H theoretical models and (c) creation of L/H scaling laws. An entirely unattended methodology is presented in this paper to build large databases of transition times in JET using time series. The proposed technique has been applied to a dataset of 551 JET discharges between campaigns C21 and C26. A prediction with discharges that show a clear signature in time series is made through the locating properties of the wavelet transform. It is an accurate prediction and the uncertainty interval is ±3.2 ms. The discharges with a non-clear pattern in the time series use an L/H mode classifier based on discharges with a clear signature. In this case, the estimation error shows a distribution with mean and standard deviation of 27.9 ms and 37.62 ms, respectively. Two different regression methods have been applied to the measurements acquired at the transition times identified by the automatic system. The obtained scaling laws for the threshold power are not significantly different from those obtained using the data at the transition times determined manually by the experts. The automatic methods allow performing physical studies with a large number of discharges, showing, for example, that there are statistically different types of transitions characterized by different scaling laws.
Virtual Systems Pharmacology (ViSP) software for simulation from mechanistic systems-level models.

PubMed

Ermakov, Sergey; Forster, Peter; Pagidala, Jyotsna; Miladinov, Marko; Wang, Albert; Baillie, Rebecca; Bartlett, Derek; Reed, Mike; Leil, Tarek A

2014-01-01

Multiple software programs are available for designing and running large scale system-level pharmacology models used in the drug development process. Depending on the problem, scientists may be forced to use several modeling tools that could increase model development time, IT costs and so on. Therefore, it is desirable to have a single platform that allows setting up and running large-scale simulations for the models that have been developed with different modeling tools. We developed a workflow and a software platform in which a model file is compiled into a self-contained executable that is no longer dependent on the software that was used to create the model. At the same time the full model specifics is preserved by presenting all model parameters as input parameters for the executable. This platform was implemented as a model agnostic, therapeutic area agnostic and web-based application with a database back-end that can be used to configure, manage and execute large-scale simulations for multiple models by multiple users. The user interface is designed to be easily configurable to reflect the specifics of the model and the user's particular needs and the back-end database has been implemented to store and manage all aspects of the systems, such as Models, Virtual Patients, User Interface Settings, and Results. The platform can be adapted and deployed on an existing cluster or cloud computing environment. Its use was demonstrated with a metabolic disease systems pharmacology model that simulates the effects of two antidiabetic drugs, metformin and fasiglifam, in type 2 diabetes mellitus patients.
Virtual Systems Pharmacology (ViSP) software for simulation from mechanistic systems-level models

PubMed Central

Ermakov, Sergey; Forster, Peter; Pagidala, Jyotsna; Miladinov, Marko; Wang, Albert; Baillie, Rebecca; Bartlett, Derek; Reed, Mike; Leil, Tarek A.

2014-01-01

Multiple software programs are available for designing and running large scale system-level pharmacology models used in the drug development process. Depending on the problem, scientists may be forced to use several modeling tools that could increase model development time, IT costs and so on. Therefore, it is desirable to have a single platform that allows setting up and running large-scale simulations for the models that have been developed with different modeling tools. We developed a workflow and a software platform in which a model file is compiled into a self-contained executable that is no longer dependent on the software that was used to create the model. At the same time the full model specifics is preserved by presenting all model parameters as input parameters for the executable. This platform was implemented as a model agnostic, therapeutic area agnostic and web-based application with a database back-end that can be used to configure, manage and execute large-scale simulations for multiple models by multiple users. The user interface is designed to be easily configurable to reflect the specifics of the model and the user's particular needs and the back-end database has been implemented to store and manage all aspects of the systems, such as Models, Virtual Patients, User Interface Settings, and Results. The platform can be adapted and deployed on an existing cluster or cloud computing environment. Its use was demonstrated with a metabolic disease systems pharmacology model that simulates the effects of two antidiabetic drugs, metformin and fasiglifam, in type 2 diabetes mellitus patients. PMID:25374542
Cutaneous lichen planus: A systematic review of treatments.

PubMed

Fazel, Nasim

2015-06-01

Various treatment modalities are available for cutaneous lichen planus. Pubmed, EMBASE, Cochrane Database of Systematic Reviews, Cochrane Central Register of Controlled Trials, Database of Abstracts of Reviews of Effects, and Health Technology Assessment Database were searched for all the systematic reviews and randomized controlled trials related to cutaneous lichen planus. Two systematic reviews and nine relevant randomized controlled trials were identified. Acitretin, griseofulvin, hydroxychloroquine and narrow band ultraviolet B are demonstrated to be effective in the treatment of cutaneous lichen planus. Sulfasalazine is effective, but has an unfavorable safety profile. KH1060, a vitamin D analogue, is not beneficial in the management of cutaneous lichen planus. Evidence from large scale randomized trials demonstrating the safety and efficacy for many other treatment modalities used to treat cutaneous lichen planus is simply not available.
The Eukaryotic Pathogen Databases: a functional genomic resource integrating data from human and veterinary parasites.

PubMed

Harb, Omar S; Roos, David S

2015-01-01

Over the past 20 years, advances in high-throughput biological techniques and the availability of computational resources including fast Internet access have resulted in an explosion of large genome-scale data sets "big data." While such data are readily available for download and personal use and analysis from a variety of repositories, often such analysis requires access to seldom-available computational skills. As a result a number of databases have emerged to provide scientists with online tools enabling the interrogation of data without the need for sophisticated computational skills beyond basic knowledge of Internet browser utility. This chapter focuses on the Eukaryotic Pathogen Databases (EuPathDB: http://eupathdb.org) Bioinformatic Resource Center (BRC) and illustrates some of the available tools and methods.
Large-scale quantitative analysis of painting arts.

PubMed

Kim, Daniel; Son, Seung-Woo; Jeong, Hawoong

2014-12-11

Scientists have made efforts to understand the beauty of painting art in their own languages. As digital image acquisition of painting arts has made rapid progress, researchers have come to a point where it is possible to perform statistical analysis of a large-scale database of artistic paints to make a bridge between art and science. Using digital image processing techniques, we investigate three quantitative measures of images - the usage of individual colors, the variety of colors, and the roughness of the brightness. We found a difference in color usage between classical paintings and photographs, and a significantly low color variety of the medieval period. Interestingly, moreover, the increment of roughness exponent as painting techniques such as chiaroscuro and sfumato have advanced is consistent with historical circumstances.
DEEP: A Database of Energy Efficiency Performance to Accelerate Energy Retrofitting of Commercial Buildings

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoon Lee, Sang; Hong, Tianzhen; Sawaya, Geof

The paper presents a method and process to establish a database of energy efficiency performance (DEEP) to enable quick and accurate assessment of energy retrofit of commercial buildings. DEEP was compiled from results of about 35 million EnergyPlus simulations. DEEP provides energy savings for screening and evaluation of retrofit measures targeting the small and medium-sized office and retail buildings in California. The prototype building models are developed for a comprehensive assessment of building energy performance based on DOE commercial reference buildings and the California DEER prototype buildings. The prototype buildings represent seven building types across six vintages of constructions andmore » 16 California climate zones. DEEP uses these prototypes to evaluate energy performance of about 100 energy conservation measures covering envelope, lighting, heating, ventilation, air-conditioning, plug-loads, and domestic hot water. DEEP consists the energy simulation results for individual retrofit measures as well as packages of measures to consider interactive effects between multiple measures. The large scale EnergyPlus simulations are being conducted on the super computers at the National Energy Research Scientific Computing Center of Lawrence Berkeley National Laboratory. The pre-simulation database is a part of an on-going project to develop a web-based retrofit toolkit for small and medium-sized commercial buildings in California, which provides real-time energy retrofit feedback by querying DEEP with recommended measures, estimated energy savings and financial payback period based on users’ decision criteria of maximizing energy savings, energy cost savings, carbon reduction, or payback of investment. The pre-simulated database and associated comprehensive measure analysis enhances the ability to performance assessments of retrofits to reduce energy use for small and medium buildings and business owners who typically do not have resources to conduct costly building energy audit. DEEP will be migrated into the DEnCity - DOE’s Energy City, which integrates large-scale energy data for multi-purpose, open, and dynamic database leveraging diverse source of existing simulation data.« less
Addition of a breeding database in the Genome Database for Rosaceae

PubMed Central

Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie

2013-01-01

Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox PMID:24247530
Addition of a breeding database in the Genome Database for Rosaceae.

PubMed

Evans, Kate; Jung, Sook; Lee, Taein; Brutcher, Lisa; Cho, Ilhyung; Peace, Cameron; Main, Dorrie

2013-01-01

Breeding programs produce large datasets that require efficient management systems to keep track of performance, pedigree, geographical and image-based data. With the development of DNA-based screening technologies, more breeding programs perform genotyping in addition to phenotyping for performance evaluation. The integration of breeding data with other genomic and genetic data is instrumental for the refinement of marker-assisted breeding tools, enhances genetic understanding of important crop traits and maximizes access and utility by crop breeders and allied scientists. Development of new infrastructure in the Genome Database for Rosaceae (GDR) was designed and implemented to enable secure and efficient storage, management and analysis of large datasets from the Washington State University apple breeding program and subsequently expanded to fit datasets from other Rosaceae breeders. The infrastructure was built using the software Chado and Drupal, making use of the Natural Diversity module to accommodate large-scale phenotypic and genotypic data. Breeders can search accessions within the GDR to identify individuals with specific trait combinations. Results from Search by Parentage lists individuals with parents in common and results from Individual Variety pages link to all data available on each chosen individual including pedigree, phenotypic and genotypic information. Genotypic data are searchable by markers and alleles; results are linked to other pages in the GDR to enable the user to access tools such as GBrowse and CMap. This breeding database provides users with the opportunity to search datasets in a fully targeted manner and retrieve and compare performance data from multiple selections, years and sites, and to output the data needed for variety release publications and patent applications. The breeding database facilitates efficient program management. Storing publicly available breeding data in a database together with genomic and genetic data will further accelerate the cross-utilization of diverse data types by researchers from various disciplines. Database URL: http://www.rosaceae.org/breeders_toolbox.
BIG: a large-scale data integration tool for renal physiology

PubMed Central

Zhao, Yue; Yang, Chin-Rang; Raghuram, Viswanathan; Parulekar, Jaya

2016-01-01

Due to recent advances in high-throughput techniques, we and others have generated multiple proteomic and transcriptomic databases to describe and quantify gene expression, protein abundance, or cellular signaling on the scale of the whole genome/proteome in kidney cells. The existence of so much data from diverse sources raises the following question: “How can researchers find information efficiently for a given gene product over all of these data sets without searching each data set individually?” This is the type of problem that has motivated the “Big-Data” revolution in Data Science, which has driven progress in fields such as marketing. Here we present an online Big-Data tool called BIG (Biological Information Gatherer) that allows users to submit a single online query to obtain all relevant information from all indexed databases. BIG is accessible at http://big.nhlbi.nih.gov/. PMID:27279488
CELL5M: A geospatial database of agricultural indicators for Africa South of the Sahara.

PubMed

Koo, Jawoo; Cox, Cindy M; Bacou, Melanie; Azzarri, Carlo; Guo, Zhe; Wood-Sichra, Ulrike; Gong, Queenie; You, Liangzhi

2016-01-01

Recent progress in large-scale georeferenced data collection is widening opportunities for combining multi-disciplinary datasets from biophysical to socioeconomic domains, advancing our analytical and modeling capacity. Granular spatial datasets provide critical information necessary for decision makers to identify target areas, assess baseline conditions, prioritize investment options, set goals and targets and monitor impacts. However, key challenges in reconciling data across themes, scales and borders restrict our capacity to produce global and regional maps and time series. This paper provides overview, structure and coverage of CELL5M-an open-access database of geospatial indicators at 5 arc-minute grid resolution-and introduces a range of analytical applications and case-uses. CELL5M covers a wide set of agriculture-relevant domains for all countries in Africa South of the Sahara and supports our understanding of multi-dimensional spatial variability inherent in farming landscapes throughout the region.
Leaf optical properties shed light on foliar trait variability at individual to global scales

NASA Astrophysics Data System (ADS)

Shiklomanov, A. N.; Serbin, S.; Dietze, M.

2017-12-01

Recent syntheses of large trait databases have contributed immensely to our understanding of drivers of plant function at the global scale. However, the global trade-offs revealed by such syntheses, such as the trade-off between leaf productivity and resilience (i.e. "leaf economics spectrum"), are often absent at smaller scales and fail to correlate with actual functional limitations. An improved understanding of how traits vary among communities, species, and individuals is critical to accurate representations of vegetation ecophysiology and ecological dynamics in ecosystem models. Spectral data from both field observations and remote sensing platforms present a rich and widely available source of information on plant traits. Here, we apply Bayesian inversion of the PROSPECT leaf radiative transfer model to a large global database of over 60,000 field spectra and plant traits to (1) comprehensively assess the accuracy of leaf trait estimation using PROSPECT spectral inversion; (2) investigate the correlations between optical traits estimable from PROSPECT and other important foliar traits such as nitrogen and lignin concentrations; and (3) identify dominant sources of variability and characterize trade-offs in optical and non-optical foliar traits. Our work provides a key methodological contribution by validating physically-based retrieval of plant traits from remote sensing observations, and provides insights about trait trade-offs related to plant acclimation, adaptation, and community assembly.
Benchmarking of HPCC: A novel 3D molecular representation combining shape and pharmacophoric descriptors for efficient molecular similarity assessments.

PubMed

Karaboga, Arnaud S; Petronin, Florent; Marchetti, Gino; Souchet, Michel; Maigret, Bernard

2013-04-01

Since 3D molecular shape is an important determinant of biological activity, designing accurate 3D molecular representations is still of high interest. Several chemoinformatic approaches have been developed to try to describe accurate molecular shapes. Here, we present a novel 3D molecular description, namely harmonic pharma chemistry coefficient (HPCC), combining a ligand-centric pharmacophoric description projected onto a spherical harmonic based shape of a ligand. The performance of HPCC was evaluated by comparison to the standard ROCS software in a ligand-based virtual screening (VS) approach using the publicly available directory of useful decoys (DUD) data set comprising over 100,000 compounds distributed across 40 protein targets. Our results were analyzed using commonly reported statistics such as the area under the curve (AUC) and normalized sum of logarithms of ranks (NSLR) metrics. Overall, our HPCC 3D method is globally as efficient as the state-of-the-art ROCS software in terms of enrichment and slightly better for more than half of the DUD targets. Since it is largely admitted that VS results depend strongly on the nature of the protein families, we believe that the present HPCC solution is of interest over the current ligand-based VS methods. Copyright © 2013 Elsevier Inc. All rights reserved.
Interactive Profiler: An Intuitive, Web-Based Statistical Application in Visualizing Educational and Marketing Databases

ERIC Educational Resources Information Center

Ip, Edward H.; Leung, Phillip; Johnson, Joseph

2004-01-01

We describe the design and implementation of a web-based statistical program--the Interactive Profiler (IP). The prototypical program, developed in Java, was motivated by the need for the general public to query against data collected from the National Assessment of Educational Progress (NAEP), a large-scale US survey of the academic state of…
Radiocarbon Dating the Anthropocene

NASA Astrophysics Data System (ADS)

Chaput, M. A.; Gajewski, K. J.

2015-12-01

The Anthropocene has no agreed start date since current suggestions for its beginning range from Pre-Industrial times to the Industrial Revolution, and from the mid-twentieth century to the future. To set the boundary of the Anthropocene in geological time, we must first understand when, how and to what extent humans began altering the Earth system. One aspect of this involves reconstructing the effects of prehistoric human activity on the physical landscape. However, for global reconstructions of land use and land cover change to be more accurately interpreted in the context of human interaction with the landscape, large-scale spatio-temporal demographic changes in prehistoric populations must be known. Estimates of the relative number of prehistoric humans in different regions of the world and at different moments in time are needed. To this end, we analyze a dataset of radiocarbon dates from the Canadian Archaeological Radiocarbon Database (CARD), the Palaeolithic Database of Europe and the AustArch Database of Australia, as well as published dates from South America. This is the first time such a large quantity of dates (approximately 60,000) has been mapped and studied at a global scale. Initial results from the analysis of temporal frequency distributions of calibrated radiocarbon dates, assumed to be proportional to population density, will be discussed. The utility of radiocarbon dates in studies of the Anthropocene will be evaluated and potential links between population density and changes in atmospheric greenhouse gas concentrations, climate, migration patterning and fire frequency coincidence will be considered.
Metaproteomics as a Complementary Approach to Gut Microbiota in Health and Disease

NASA Astrophysics Data System (ADS)

Petriz, Bernardo A.; Franco, Octávio L.

2017-01-01

Classic studies on phylotype profiling are limited to the identification of microbial constituents, where information is lacking about the molecular interaction of these bacterial communities with the host genome and the possible outcomes in host biology. A range of OMICs approaches have provided great progress linking the microbiota to health and disease. However, the investigation of this context through proteomic mass spectrometry-based tools is still being improved. Therefore, metaproteomics or community proteogenomics has emerged as a complementary approach to metagenomic data, as a field in proteomics aiming to perform large-scale characterization of proteins from environmental microbiota such as the human gut. The advances in molecular separation methods coupled with mass spectrometry (e.g. LC-MS/MS) and proteome bioinformatics have been fundamental in these novel large-scale metaproteomic studies, which have further been performed in a wide range of samples including soil, plant and human environments. Metaproteomic studies will make major progress if a comprehensive database covering the genes and expresses proteins from all gut microbial species is developed. To this end, we here present some of the main limitations of metaproteomic studies in complex microbiota environments such as the gut, also addressing the up-to-date pipelines in sample preparation prior to fractionation/separation and mass spectrometry analysis. In addition, a novel approach to the limitations of metagenomic databases is also discussed. Finally, prospects are addressed regarding the application of metaproteomic analysis using a unified host-microbiome gene database and other meta-OMICs platforms.

Large Scale Analyses and Visualization of Adaptive Amino Acid Changes Projects.

PubMed

Vázquez, Noé; Vieira, Cristina P; Amorim, Bárbara S R; Torres, André; López-Fernández, Hugo; Fdez-Riverola, Florentino; Sousa, José L R; Reboiro-Jato, Miguel; Vieira, Jorge

2018-03-01

When changes at few amino acid sites are the target of selection, adaptive amino acid changes in protein sequences can be identified using maximum-likelihood methods based on models of codon substitution (such as codeml). Although such methods have been employed numerous times using a variety of different organisms, the time needed to collect the data and prepare the input files means that tens or hundreds of coding regions are usually analyzed. Nevertheless, the recent availability of flexible and easy to use computer applications that collect relevant data (such as BDBM) and infer positively selected amino acid sites (such as ADOPS), means that the entire process is easier and quicker than before. However, the lack of a batch option in ADOPS, here reported, still precludes the analysis of hundreds or thousands of sequence files. Given the interest and possibility of running such large-scale projects, we have also developed a database where ADOPS projects can be stored. Therefore, this study also presents the B+ database, which is both a data repository and a convenient interface that looks at the information contained in ADOPS projects without the need to download and unzip the corresponding ADOPS project file. The ADOPS projects available at B+ can also be downloaded, unzipped, and opened using the ADOPS graphical interface. The availability of such a database ensures results repeatability, promotes data reuse with significant savings on the time needed for preparing datasets, and effortlessly allows further exploration of the data contained in ADOPS projects.
[Drug Repositioning Research Utilizing a Large-scale Medical Claims Database to Improve Survival Rates after Cardiopulmonary Arrest].

PubMed

Zamami, Yoshito; Niimura, Takahiro; Takechi, Kenshi; Imanishi, Masaki; Koyama, Toshihiro; Ishizawa, Keisuke

2017-01-01

Approximately 100000 people suffer cardiopulmonary arrest in Japan every year, and the aging of society means that this number is expected to increase. Worldwide, approximately 100 million develop cardiac arrest annually, making it an international issue. Although survival has improved thanks to advances in cardiopulmonary resuscitation, there is a high rate of postresuscitation encephalopathy after the return of spontaneous circulation, and the proportion of patients who can return to normal life is extremely low. Treatment for postresuscitation encephalopathy is long term, and if sequelae persist then nursing care is required, causing immeasurable economic burdens as a result of ballooning medical costs. As at present there is no drug treatment to improve postresuscitation encephalopathy as a complication of cardiopulmonary arrest, the development of novel drug treatments is desirable. In recent years, new efficacy for existing drugs used in the clinical setting has been discovered, and drug repositioning has been proposed as a strategy for developing those drugs as therapeutic agents for different diseases. This review describes a large-scale database study carried out following a discovery strategy for drug repositioning with the objective of improving survival rates after cardiopulmonary arrest and discusses future repositioning prospects.
Managing Large Scale Project Analysis Teams through a Web Accessible Database

NASA Technical Reports Server (NTRS)

O'Neil, Daniel A.

2008-01-01

Large scale space programs analyze thousands of requirements while mitigating safety, performance, schedule, and cost risks. These efforts involve a variety of roles with interdependent use cases and goals. For example, study managers and facilitators identify ground-rules and assumptions for a collection of studies required for a program or project milestone. Task leaders derive product requirements from the ground rules and assumptions and describe activities to produce needed analytical products. Disciplined specialists produce the specified products and load results into a file management system. Organizational and project managers provide the personnel and funds to conduct the tasks. Each role has responsibilities to establish information linkages and provide status reports to management. Projects conduct design and analysis cycles to refine designs to meet the requirements and implement risk mitigation plans. At the program level, integrated design and analysis cycles studies are conducted to eliminate every 'to-be-determined' and develop plans to mitigate every risk. At the agency level, strategic studies analyze different approaches to exploration architectures and campaigns. This paper describes a web-accessible database developed by NASA to coordinate and manage tasks at three organizational levels. Other topics in this paper cover integration technologies and techniques for process modeling and enterprise architectures.
The Global Streamflow Indices and Metadata archive (G-SIM): A compilation of global streamflow time series indices and meta-data

NASA Astrophysics Data System (ADS)

Do, Hong; Gudmundsson, Lukas; Leonard, Michael; Westra, Seth; Senerivatne, Sonia

2017-04-01

In-situ observations of daily streamflow with global coverage are a crucial asset for understanding large-scale freshwater resources which are an essential component of the Earth system and a prerequisite for societal development. Here we present the Global Streamflow Indices and Metadata archive (G-SIM), a collection indices derived from more than 20,000 daily streamflow time series across the globe. These indices are designed to support global assessments of change in wet and dry extremes, and have been compiled from 12 free-to-access online databases (seven national databases and five international collections). The G-SIM archive also includes significant metadata to help support detailed understanding of streamflow dynamics, with the inclusion of drainage area shapefile and many essential catchment properties such as land cover type, soil and topographic characteristics. The automated procedure in data handling and quality control of the project makes G-SIM a reproducible, extendible archive and can be utilised for many purposes in large-scale hydrology. Some potential applications include the identification of observational trends in hydrological extremes, the assessment of climate change impacts on streamflow regimes, and the validation of global hydrological models.
Characterization of the Kenaf (Hibiscus cannabinus) Global Transcriptome Using Illumina Paired-End Sequencing and Development of EST-SSR Markers

PubMed Central

Li, Hui; Li, Defang; Chen, Anguo; Tang, Huijuan; Li, Jianjun; Huang, Siqi

2016-01-01

Kenaf (Hibiscus cannabinus L.) is an economically important natural fiber crop grown worldwide. However, only 20 expressed tag sequences (ESTs) for kenaf are available in public databases. The aim of this study was to develop large-scale simple sequence repeat (SSR) markers to lay a solid foundation for the construction of genetic linkage maps and marker-assisted breeding in kenaf. We used Illumina paired-end sequencing technology to generate new EST-simple sequences and MISA software to mine SSR markers. We identified 71,318 unigenes with an average length of 1143 nt and annotated these unigenes using four different protein databases. Overall, 9324 complementary pairs were designated as EST-SSR markers, and their quality was validated using 100 randomly selected SSR markers. In total, 72 primer pairs reproducibly amplified target amplicons, and 61 of these primer pairs detected significant polymorphism among 28 kenaf accessions. Thus, in this study, we have developed large-scale SSR markers for kenaf, and this new resource will facilitate construction of genetic linkage maps, investigation of fiber growth and development in kenaf, and also be of value to novel gene discovery and functional genomic studies. PMID:26960153
Systematic analytical characterization of new psychoactive substances: A case study.

PubMed

Lobo Vicente, Joana; Chassaigne, Hubert; Holland, Margaret V; Reniero, Fabiano; Kolář, Kamil; Tirendi, Salvatore; Vandecasteele, Ine; Vinckier, Inge; Guillou, Claude

2016-08-01

New psychoactive substances (NPS) are synthesized compounds that are not usually covered by European and/or international laws. With a slight alteration in the chemical structure of existing illegal substances registered in the European Union (EU), these NPS circumvent existing controls and are thus referred to as "legal highs". They are becoming increasingly available and can easily be purchased through both the internet and other means (smart shops). Thus, it is essential that the identification of NPS keeps up with this rapidly evolving market. In this case study, the Belgian Customs authorities apprehended a parcel, originating from China, containing two samples, declared as being "white pigments". For routine identification, the Belgian Customs Laboratory first analysed both samples by gas-chromatography mass-spectrometry and Fourier-Transform Infrared spectroscopy. The information obtained by these techniques is essential and can give an indication of the chemical structure of an unknown substance but not the complete identification of its structure. To bridge this gap, scientific and technical support is ensured by the Joint Research Centre (JRC) to the European Commission Directorate General for Taxation and Customs Unions (DG TAXUD) and the Customs Laboratory European Network (CLEN) through an Administrative Arrangement for fast recognition of NPS and identification of unknown chemicals. The samples were sent to the JRC for a complete characterization using advanced techniques and chemoinformatic tools. The aim of this study was also to encourage the development of a science-based policy driven approach on NPS. These samples were fully characterized and identified as 5F-AMB and PX-3 using (1)H and (13)C nuclear magnetic resonance (NMR), high-resolution tandem mass-spectrometry (HR-MS/MS) and Raman spectroscopy. A chemoinformatic platform was used to manage, unify analytical data from multiple techniques and instruments, and combine it with chemical and structural information. Copyright © 2016 The Authors. Published by Elsevier Ireland Ltd.. All rights reserved.
Chemoinformatics Profiling of the Chromone Nucleus as a MAO-B/A2AAR Dual Binding Scaffold

PubMed Central

Cruz-Monteagudo, Maykel; Borges, Fernanda; Cordeiro, M. Natália D. S.; Helguera, Aliuska Morales; Tejera, Eduardo; Paz-y-Miño, Cesar; Sánchez-Rodríguez, Aminael; Perera-Sardiña, Yunier; Perez-Castillo, Yunierkis

2017-01-01

Background: In the context of the current drug discovery efforts to find disease modifying therapies for Parkinson´s disease (PD) the current single target strategy has proved inefficient. Consequently, the search for multi-potent agents is attracting more and more attention due to the multiple pathogenetic factors implicated in PD. Multiple evidences points to the dual inhibition of the monoamine oxidase B (MAO-B), as well as adenosine A2A receptor (A2AAR) blockade, as a promising approach to prevent the neurodegeneration involved in PD. Currently, only two chemical scaffolds has been proposed as potential dual MAO-B inhibitors/A2AAR antagonists (caffeine derivatives and benzothiazinones). Methods: In this study, we conduct a series of chemoinformatics analysis in order to evaluate and advance the potential of the chromone nucleus as a MAO-B/A2AAR dual binding scaffold. Results: The information provided by SAR data mining analysis based on network similarity graphs and molecular docking studies support the suitability of the chromone nucleus as a potential MAO-B/A2AAR dual binding scaffold. Additionally, a virtual screening tool based on a group fusion similarity search approach was developed for the prioritization of potential MAO-B/A2AAR dual binder candidates. Among several data fusion schemes evaluated, the MEAN-SIM and MIN-RANK GFSS approaches demonstrated to be efficient virtual screening tools. Then, a combinatorial library potentially enriched with MAO-B/A2AAR dual binding chromone derivatives was assembled and sorted by using the MIN-RANK and then the MEAN-SIM GFSS VS approaches. Conclusion: The information and tools provided in this work represent valuable decision making elements in the search of novel chromone derivatives with a favorable dual binding profile as MAO-B inhibitors and A2AAR antagonists with the potential to act as a disease-modifying therapeutic for Parkinson´s disease. PMID:28093976
The Camden & Islington Research Database: Using electronic mental health records for research.

PubMed

Werbeloff, Nomi; Osborn, David P J; Patel, Rashmi; Taylor, Matthew; Stewart, Robert; Broadbent, Matthew; Hayes, Joseph F

2018-01-01

Electronic health records (EHRs) are widely used in mental health services. Case registers using EHRs from secondary mental healthcare have the potential to deliver large-scale projects evaluating mental health outcomes in real-world clinical populations. We describe the Camden and Islington NHS Foundation Trust (C&I) Research Database which uses the Clinical Record Interactive Search (CRIS) tool to extract and de-identify routinely collected clinical information from a large UK provider of secondary mental healthcare, and demonstrate its capabilities to answer a clinical research question regarding time to diagnosis and treatment of bipolar disorder. The C&I Research Database contains records from 108,168 mental health patients, of which 23,538 were receiving active care. The characteristics of the patient population are compared to those of the catchment area, of London, and of England as a whole. The median time to diagnosis of bipolar disorder was 76 days (interquartile range: 17-391) and median time to treatment was 37 days (interquartile range: 5-194). Compulsory admission under the UK Mental Health Act was associated with shorter intervals to diagnosis and treatment. Prior diagnoses of other psychiatric disorders were associated with longer intervals to diagnosis, though prior diagnoses of schizophrenia and related disorders were associated with decreased time to treatment. The CRIS tool, developed by the South London and Maudsley NHS Foundation Trust (SLaM) Biomedical Research Centre (BRC), functioned very well at C&I. It is reassuring that data from different organizations deliver similar results, and that applications developed in one Trust can then be successfully deployed in another. The information can be retrieved in a quicker and more efficient fashion than more traditional methods of health research. The findings support the secondary use of EHRs for large-scale mental health research in naturalistic samples and settings investigated across large, diverse geographical areas.
Automatic initialization and quality control of large-scale cardiac MRI segmentations.

PubMed

Albà, Xènia; Lekadir, Karim; Pereañez, Marco; Medrano-Gracia, Pau; Young, Alistair A; Frangi, Alejandro F

2018-01-01

Continuous advances in imaging technologies enable ever more comprehensive phenotyping of human anatomy and physiology. Concomitant reduction of imaging costs has resulted in widespread use of imaging in large clinical trials and population imaging studies. Magnetic Resonance Imaging (MRI), in particular, offers one-stop-shop multidimensional biomarkers of cardiovascular physiology and pathology. A wide range of analysis methods offer sophisticated cardiac image assessment and quantification for clinical and research studies. However, most methods have only been evaluated on relatively small databases often not accessible for open and fair benchmarking. Consequently, published performance indices are not directly comparable across studies and their translation and scalability to large clinical trials or population imaging cohorts is uncertain. Most existing techniques still rely on considerable manual intervention for the initialization and quality control of the segmentation process, becoming prohibitive when dealing with thousands of images. The contributions of this paper are three-fold. First, we propose a fully automatic method for initializing cardiac MRI segmentation, by using image features and random forests regression to predict an initial position of the heart and key anatomical landmarks in an MRI volume. In processing a full imaging database, the technique predicts the optimal corrective displacements and positions in relation to the initial rough intersections of the long and short axis images. Second, we introduce for the first time a quality control measure capable of identifying incorrect cardiac segmentations with no visual assessment. The method uses statistical, pattern and fractal descriptors in a random forest classifier to detect failures to be corrected or removed from subsequent statistical analysis. Finally, we validate these new techniques within a full pipeline for cardiac segmentation applicable to large-scale cardiac MRI databases. The results obtained based on over 1200 cases from the Cardiac Atlas Project show the promise of fully automatic initialization and quality control for population studies. Copyright © 2017 Elsevier B.V. All rights reserved.
Use of relational databases to evaluate regional petroleum accumulation, groundwater flow, and CO2 sequestration in Kansas

USGS Publications Warehouse

Carr, T.R.; Merriam, D.F.; Bartley, J.D.

2005-01-01

Large-scale relational databases and geographic information system tools are used to integrate temperature, pressure, and water geo-chemistry data from numerous wells to better understand regional-scale geothermal and hydrogeological regimes of the lower Paleozoic aquifer systems in the mid-continent and to evaluate their potential for geologic CO2 sequestration. The lower Paleozoic (Cambrian to Mississippian) aquifer systems in Kansas, Missouri, and Oklahoma comprise one of the largest regional-scale saline aquifer systems in North America. Understanding hydrologic conditions and processes of these regional-scale aquifer systems provides insight to the evolution of the various sedimentary basins, migration of hydrocarbons out of the Anadarko and Arkoma basins, and the distribution of Arbuckle petroleum reservoirs across Kansas and provides a basis to evaluate CO2 sequestration potential. The Cambrian and Ordovician stratigraphic units form a saline aquifer that is in hydrologic continuity with the freshwater recharge from the Ozark plateau and along the Nemaha anticline. The hydrologic continuity with areas of freshwater recharge provides an explanation for the apparent underpressure in the Arbuckle Group. Copyright ?? 2005. The American Association of Petroleum Geologists. All rights reserved.
Analysis on the Critical Rainfall Value For Predicting Large Scale Landslides Caused by Heavy Rainfall In Taiwan.

NASA Astrophysics Data System (ADS)

Tsai, Kuang-Jung; Chiang, Jie-Lun; Lee, Ming-Hsi; Chen, Yie-Ruey

2017-04-01

Analysis on the Critical Rainfall Value For Predicting Large Scale Landslides Caused by Heavy Rainfall In Taiwan. Kuang-Jung Tsai 1, Jie-Lun Chiang 2,Ming-Hsi Lee 2, Yie-Ruey Chen 1, 1Department of Land Management and Development, Chang Jung Christian Universityt, Tainan, Taiwan. 2Department of Soil and Water Conservation, National Pingtung University of Science and Technology, Pingtung, Taiwan. ABSTRACT The accumulated rainfall amount was recorded more than 2,900mm that were brought by Morakot typhoon in August, 2009 within continuous 3 days. Very serious landslides, and sediment related disasters were induced by this heavy rainfall event. The satellite image analysis project conducted by Soil and Water Conservation Bureau after Morakot event indicated that more than 10,904 sites of landslide with total sliding area of 18,113ha were found by this project. At the same time, all severe sediment related disaster areas are also characterized based on their disaster type, scale, topography, major bedrock formations and geologic structures during the period of extremely heavy rainfall events occurred at the southern Taiwan. Characteristics and mechanism of large scale landslide are collected on the basis of the field investigation technology integrated with GPS/GIS/RS technique. In order to decrease the risk of large scale landslides on slope land, the strategy of slope land conservation, and critical rainfall database should be set up and executed as soon as possible. Meanwhile, study on the establishment of critical rainfall value used for predicting large scale landslides induced by heavy rainfall become an important issue which was seriously concerned by the government and all people live in Taiwan. The mechanism of large scale landslide, rainfall frequency analysis ,sediment budge estimation and river hydraulic analysis under the condition of extremely climate change during the past 10 years would be seriously concerned and recognized as a required issue by this research. Hopefully, all results developed from this research can be used as a warning system for Predicting Large Scale Landslides in the southern Taiwan. Keywords：Heavy Rainfall, Large Scale, landslides, Critical Rainfall Value
ICA model order selection of task co-activation networks.

PubMed

Ray, Kimberly L; McKay, D Reese; Fox, Peter M; Riedel, Michael C; Uecker, Angela M; Beckmann, Christian F; Smith, Stephen M; Fox, Peter T; Laird, Angela R

2013-01-01

Independent component analysis (ICA) has become a widely used method for extracting functional networks in the brain during rest and task. Historically, preferred ICA dimensionality has widely varied within the neuroimaging community, but typically varies between 20 and 100 components. This can be problematic when comparing results across multiple studies because of the impact ICA dimensionality has on the topology of its resultant components. Recent studies have demonstrated that ICA can be applied to peak activation coordinates archived in a large neuroimaging database (i.e., BrainMap Database) to yield whole-brain task-based co-activation networks. A strength of applying ICA to BrainMap data is that the vast amount of metadata in BrainMap can be used to quantitatively assess tasks and cognitive processes contributing to each component. In this study, we investigated the effect of model order on the distribution of functional properties across networks as a method for identifying the most informative decompositions of BrainMap-based ICA components. Our findings suggest dimensionality of 20 for low model order ICA to examine large-scale brain networks, and dimensionality of 70 to provide insight into how large-scale networks fractionate into sub-networks. We also provide a functional and organizational assessment of visual, motor, emotion, and interoceptive task co-activation networks as they fractionate from low to high model-orders.
ICA model order selection of task co-activation networks

PubMed Central

Ray, Kimberly L.; McKay, D. Reese; Fox, Peter M.; Riedel, Michael C.; Uecker, Angela M.; Beckmann, Christian F.; Smith, Stephen M.; Fox, Peter T.; Laird, Angela R.

2013-01-01

Independent component analysis (ICA) has become a widely used method for extracting functional networks in the brain during rest and task. Historically, preferred ICA dimensionality has widely varied within the neuroimaging community, but typically varies between 20 and 100 components. This can be problematic when comparing results across multiple studies because of the impact ICA dimensionality has on the topology of its resultant components. Recent studies have demonstrated that ICA can be applied to peak activation coordinates archived in a large neuroimaging database (i.e., BrainMap Database) to yield whole-brain task-based co-activation networks. A strength of applying ICA to BrainMap data is that the vast amount of metadata in BrainMap can be used to quantitatively assess tasks and cognitive processes contributing to each component. In this study, we investigated the effect of model order on the distribution of functional properties across networks as a method for identifying the most informative decompositions of BrainMap-based ICA components. Our findings suggest dimensionality of 20 for low model order ICA to examine large-scale brain networks, and dimensionality of 70 to provide insight into how large-scale networks fractionate into sub-networks. We also provide a functional and organizational assessment of visual, motor, emotion, and interoceptive task co-activation networks as they fractionate from low to high model-orders. PMID:24339802
hEIDI: An Intuitive Application Tool To Organize and Treat Large-Scale Proteomics Data.

PubMed

Hesse, Anne-Marie; Dupierris, Véronique; Adam, Claire; Court, Magali; Barthe, Damien; Emadali, Anouk; Masselon, Christophe; Ferro, Myriam; Bruley, Christophe

2016-10-07

Advances in high-throughput proteomics have led to a rapid increase in the number, size, and complexity of the associated data sets. Managing and extracting reliable information from such large series of data sets require the use of dedicated software organized in a consistent pipeline to reduce, validate, exploit, and ultimately export data. The compilation of multiple mass-spectrometry-based identification and quantification results obtained in the context of a large-scale project represents a real challenge for developers of bioinformatics solutions. In response to this challenge, we developed a dedicated software suite called hEIDI to manage and combine both identifications and semiquantitative data related to multiple LC-MS/MS analyses. This paper describes how, through a user-friendly interface, hEIDI can be used to compile analyses and retrieve lists of nonredundant protein groups. Moreover, hEIDI allows direct comparison of series of analyses, on the basis of protein groups, while ensuring consistent protein inference and also computing spectral counts. hEIDI ensures that validated results are compliant with MIAPE guidelines as all information related to samples and results is stored in appropriate databases. Thanks to the database structure, validated results generated within hEIDI can be easily exported in the PRIDE XML format for subsequent publication. hEIDI can be downloaded from http://biodev.extra.cea.fr/docs/heidi .
Designing for Peta-Scale in the LSST Database

NASA Astrophysics Data System (ADS)

Kantor, J.; Axelrod, T.; Becla, J.; Cook, K.; Nikolaev, S.; Gray, J.; Plante, R.; Nieto-Santisteban, M.; Szalay, A.; Thakar, A.

2007-10-01

The Large Synoptic Survey Telescope (LSST), a proposed ground-based 8.4 m telescope with a 10 deg^2 field of view, will generate 15 TB of raw images every observing night. When calibration and processed data are added, the image archive, catalogs, and meta-data will grow 15 PB yr^{-1} on average. The LSST Data Management System (DMS) must capture, process, store, index, replicate, and provide open access to this data. Alerts must be triggered within 30 s of data acquisition. To do this in real-time at these data volumes will require advances in data management, database, and file system techniques. This paper describes the design of the LSST DMS and emphasizes features for peta-scale data. The LSST DMS will employ a combination of distributed database and file systems, with schema, partitioning, and indexing oriented for parallel operations. Image files are stored in a distributed file system with references to, and meta-data from, each file stored in the databases. The schema design supports pipeline processing, rapid ingest, and efficient query. Vertical partitioning reduces disk input/output requirements, horizontal partitioning allows parallel data access using arrays of servers and disks. Indexing is extensive, utilizing both conventional RAM-resident indexes and column-narrow, row-deep tag tables/covering indices that are extracted from tables that contain many more attributes. The DMS Data Access Framework is encapsulated in a middleware framework to provide a uniform service interface to all framework capabilities. This framework will provide the automated work-flow, replication, and data analysis capabilities necessary to make data processing and data quality analysis feasible at this scale.
A comparison of working in small-scale and large-scale nursing homes: A systematic review of quantitative and qualitative evidence.

PubMed

Vermeerbergen, Lander; Van Hootegem, Geert; Benders, Jos

2017-02-01

Ongoing shortages of care workers, together with an ageing population, make it of utmost importance to increase the quality of working life in nursing homes. Since the 1970s, normalised and small-scale nursing homes have been increasingly introduced to provide care in a family and homelike environment, potentially providing a richer work life for care workers as well as improved living conditions for residents. 'Normalised' refers to the opportunities given to residents to live in a manner as close as possible to the everyday life of persons not needing care. The study purpose is to provide a synthesis and overview of empirical research comparing the quality of working life - together with related work and health outcomes - of professional care workers in normalised small-scale nursing homes as compared to conventional large-scale ones. A systematic review of qualitative and quantitative studies. A systematic literature search (April 2015) was performed using the electronic databases Pubmed, Embase, PsycInfo, CINAHL and Web of Science. References and citations were tracked to identify additional, relevant studies. We identified 825 studies in the selected databases. After checking the inclusion and exclusion criteria, nine studies were selected for review. Two additional studies were selected after reference and citation tracking. Three studies were excluded after requesting more information on the research setting. The findings from the individual studies suggest that levels of job control and job demands (all but "time pressure") are higher in normalised small-scale homes than in conventional large-scale nursing homes. Additionally, some studies suggested that social support and work motivation are higher, while risks of burnout and mental strain are lower, in normalised small-scale nursing homes. Other studies found no differences or even opposing findings. The studies reviewed showed that these inconclusive findings can be attributed to care workers in some normalised small-scale homes experiencing isolation and too high job demands in their work roles. This systematic review suggests that normalised small-scale homes are a good starting point for creating a higher quality of working life in the nursing home sector. Higher job control enables care workers to manage higher job demands in normalised small-scale homes. However, some jobs would benefit from interventions to address care workers' perceptions of too low social support and of too high job demands. More research is needed to examine strategies to enhance these working life issues in normalised small-scale settings. Copyright Â© 2016 Elsevier Ltd. All rights reserved.
Ice-Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel

NASA Technical Reports Server (NTRS)

Broeren, Andy P.; Potapczuk, Mark G.; Lee, Sam; Malone, Adam M.; Paul, Benard P., Jr.; Woodard, Brian S.

2016-01-01

Icing simulation tools and computational fluid dynamics codes are reaching levels of maturity such that they are being proposed by manufacturers for use in certification of aircraft for flight in icing conditions with increasingly less reliance on natural-icing flight testing and icing-wind-tunnel testing. Sufficient high-quality data to evaluate the performance of these tools is not currently available. The objective of this work was to generate a database of ice-accretion geometry that can be used for development and validation of icing simulation tools as well as for aerodynamic testing. Three large-scale swept wing models were built and tested at the NASA Glenn Icing Research Tunnel (IRT). The models represented the Inboard (20% semispan), Midspan (64% semispan) and Outboard stations (83% semispan) of a wing based upon a 65% scale version of the Common Research Model (CRM). The IRT models utilized a hybrid design that maintained the full-scale leading-edge geometry with a truncated afterbody and flap. The models were instrumented with surface pressure taps in order to acquire sufficient aerodynamic data to verify the hybrid model design capability to simulate the full-scale wing section. A series of ice-accretion tests were conducted over a range of total temperatures from -23.8 deg C to -1.4 deg C with all other conditions held constant. The results showed the changing ice-accretion morphology from rime ice at the colder temperatures to highly 3-D scallop ice in the range of -11.2 deg C to -6.3 deg C. Warmer temperatures generated highly 3-D ice accretion with glaze ice characteristics. The results indicated that the general scallop ice morphology was similar for all three models. Icing results were documented for limited parametric variations in angle of attack, drop size and cloud liquid-water content (LWC). The effect of velocity on ice accretion was documented for the Midspan and Outboard models for a limited number of test cases. The data suggest that there are morphological characteristics of glaze and scallop ice accretion on these swept-wing models that are dependent upon the velocity. This work has resulted in a large database of ice-accretion geometry on large-scale, swept-wing models.
Ice-Accretion Test Results for Three Large-Scale Swept-Wing Models in the NASA Icing Research Tunnel

NASA Technical Reports Server (NTRS)

Broeren, Andy P.; Potapczuk, Mark G.; Lee, Sam; Malone, Adam M.; Paul, Bernard P., Jr.; Woodard, Brian S.

2016-01-01

Icing simulation tools and computational fluid dynamics codes are reaching levels of maturity such that they are being proposed by manufacturers for use in certification of aircraft for flight in icing conditions with increasingly less reliance on natural-icing flight testing and icing-wind-tunnel testing. Sufficient high-quality data to evaluate the performance of these tools is not currently available. The objective of this work was to generate a database of ice-accretion geometry that can be used for development and validation of icing simulation tools as well as for aerodynamic testing. Three large-scale swept wing models were built and tested at the NASA Glenn Icing Research Tunnel (IRT). The models represented the Inboard (20 percent semispan), Midspan (64 percent semispan) and Outboard stations (83 percent semispan) of a wing based upon a 65 percent scale version of the Common Research Model (CRM). The IRT models utilized a hybrid design that maintained the full-scale leading-edge geometry with a truncated afterbody and flap. The models were instrumented with surface pressure taps in order to acquire sufficient aerodynamic data to verify the hybrid model design capability to simulate the full-scale wing section. A series of ice-accretion tests were conducted over a range of total temperatures from -23.8 to -1.4 C with all other conditions held constant. The results showed the changing ice-accretion morphology from rime ice at the colder temperatures to highly 3-D scallop ice in the range of -11.2 to -6.3 C. Warmer temperatures generated highly 3-D ice accretion with glaze ice characteristics. The results indicated that the general scallop ice morphology was similar for all three models. Icing results were documented for limited parametric variations in angle of attack, drop size and cloud liquid-water content (LWC). The effect of velocity on ice accretion was documented for the Midspan and Outboard models for a limited number of test cases. The data suggest that there are morphological characteristics of glaze and scallop ice accretion on these swept-wing models that are dependent upon the velocity. This work has resulted in a large database of ice-accretion geometry on large-scale, swept-wing models.
JEnsembl: a version-aware Java API to Ensembl data systems

PubMed Central

Paterson, Trevor; Law, Andy

2012-01-01

Motivation: The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data. Results: The JEnsembl API implementation provides basic data retrieval and manipulation functionality from the Core, Compara and Variation databases for all species in Ensembl and EnsemblGenomes and is a platform for the development of a richer API to Ensembl datasources. The JEnsembl architecture uses a text-based configuration module to provide evolving, versioned mappings from database schema to code objects. A single installation of the JEnsembl API can therefore simultaneously and transparently connect to current and previous database instances (such as those in the public archive) thus facilitating better analysis repeatability and allowing ‘through time’ comparative analyses to be performed. Availability: Project development, released code libraries, Maven repository and documentation are hosted at SourceForge (http://jensembl.sourceforge.net). Contact: jensembl-develop@lists.sf.net, andy.law@roslin.ed.ac.uk, trevor.paterson@roslin.ed.ac.uk PMID:22945789
An alternative to the search for single polymorphisms: toward molecular personality scales for the five-factor model.

PubMed

McCrae, Robert R; Scally, Matthew; Terracciano, Antonio; Abecasis, Gonçalo R; Costa, Paul T

2010-12-01

There is growing evidence that personality traits are affected by many genes, all of which have very small effects. As an alternative to the largely unsuccessful search for individual polymorphisms associated with personality traits, the authors identified large sets of potentially related single nucleotide polymorphisms (SNPs) and summed them to form molecular personality scales (MPSs) with from 4 to 2,497 SNPs. Scales were derived from two thirds of a large (N = 3,972) sample of individuals from Sardinia who completed the Revised NEO Personality Inventory (P. T. Costa, Jr., & R. R. McCrae, 1992) and were assessed in a genomewide association scan. When MPSs were correlated with the phenotype in the remaining one third of the sample, very small but significant associations were found for 4 of the 5e personality factors when the longest scales were examined. These data suggest that MPSs for Neuroticism, Openness to Experience, Agreeableness, and Conscientiousness (but not Extraversion) contain genetic information that can be refined in future studies, and the procedures described here should be applicable to other quantitative traits. PsycINFO Database Record (c) 2010 APA, all rights reserved.

A Comprehensive Analysis of Multiscale Field-Aligned Currents: Characteristics, Controlling Parameters, and Relationships

NASA Astrophysics Data System (ADS)

McGranaghan, Ryan M.; Mannucci, Anthony J.; Forsyth, Colin

2017-12-01

We explore the characteristics, controlling parameters, and relationships of multiscale field-aligned currents (FACs) using a rigorous, comprehensive, and cross-platform analysis. Our unique approach combines FAC data from the Swarm satellites and the Advanced Magnetosphere and Planetary Electrodynamics Response Experiment (AMPERE) to create a database of small-scale (˜10-150 km, <1° latitudinal width), mesoscale (˜150-250 km, 1-2° latitudinal width), and large-scale (>250 km) FACs. We examine these data for the repeatable behavior of FACs across scales (i.e., the characteristics), the dependence on the interplanetary magnetic field orientation, and the degree to which each scale "departs" from nominal large-scale specification. We retrieve new information by utilizing magnetic latitude and local time dependence, correlation analyses, and quantification of the departure of smaller from larger scales. We find that (1) FACs characteristics and dependence on controlling parameters do not map between scales in a straight forward manner, (2) relationships between FAC scales exhibit local time dependence, and (3) the dayside high-latitude region is characterized by remarkably distinct FAC behavior when analyzed at different scales, and the locations of distinction correspond to "anomalous" ionosphere-thermosphere behavior. Comparing with nominal large-scale FACs, we find that differences are characterized by a horseshoe shape, maximizing across dayside local times, and that difference magnitudes increase when smaller-scale observed FACs are considered. We suggest that both new physics and increased resolution of models are required to address the multiscale complexities. We include a summary table of our findings to provide a quick reference for differences between multiscale FACs.
Graph Databases for Large-Scale Healthcare Systems: A Framework for Efficient Data Management and Data Services

DOE Office of Scientific and Technical Information (OSTI.GOV)

Park, Yubin; Shankar, Mallikarjun; Park, Byung H.

Designing a database system for both efficient data management and data services has been one of the enduring challenges in the healthcare domain. In many healthcare systems, data services and data management are often viewed as two orthogonal tasks; data services refer to retrieval and analytic queries such as search, joins, statistical data extraction, and simple data mining algorithms, while data management refers to building error-tolerant and non-redundant database systems. The gap between service and management has resulted in rigid database systems and schemas that do not support effective analytics. We compose a rich graph structure from an abstracted healthcaremore » RDBMS to illustrate how we can fill this gap in practice. We show how a healthcare graph can be automatically constructed from a normalized relational database using the proposed 3NF Equivalent Graph (3EG) transformation.We discuss a set of real world graph queries such as finding self-referrals, shared providers, and collaborative filtering, and evaluate their performance over a relational database and its 3EG-transformed graph. Experimental results show that the graph representation serves as multiple de-normalized tables, thus reducing complexity in a database and enhancing data accessibility of users. Based on this finding, we propose an ensemble framework of databases for healthcare applications.« less
Changes in the High-Latitude Topside Ionospheric Vertical Electron-Density Profiles in Response to Solar-Wind Perturbations During Large Magnetic Storms

NASA Technical Reports Server (NTRS)

Benson, Robert F.; Fainberg, Joseph; Osherovich, Vladimir; Truhlik, Vladimir; Wang, Yongli; Arbacher, Becca

2011-01-01

The latest results from an investigation to establish links between solar-wind and topside-ionospheric parameters will be presented including a case where high-latitude topside electron-density Ne(h) profiles indicated dramatic rapid changes in the scale height during the main phase of a large magnetic storm (Dst < -200 nT). These scale-height changes suggest a large heat input to the topside ionosphere at this time. The topside profiles were derived from ISIS-1 digital ionograms obtained from the NASA Space Physics Data Facility (SPDF) Coordinated Data Analysis Web (CDA Web). Solar-wind data obtained from the NASA OMNIWeb database indicated that the magnetic storm was due to a magnetic cloud. This event is one of several large magnetic storms being investigated during the interval from 1965 to 1984 when both solar-wind and digital topside ionograms, from either Alouette-2, ISIS-1, or ISIS-2, are potentially available.
Introduction to the special section "Big'er' Data": Scaling up psychotherapy research in counseling psychology.

PubMed

Owen, Jesse; Imel, Zac E

2016-04-01

This article introduces the special section on utilizing large data sets to explore psychotherapy processes and outcomes. The increased use of technology has provided new opportunities for psychotherapy researchers. In particular, there is a rise in large databases of tens of thousands clients. Additionally, there are new ways to pool valuable resources for meta-analytic processes. At the same time, these tools also come with limitations. These issues are introduced as well as brief overview of the articles. (c) 2016 APA, all rights reserved).
The Galics Project: Virtual Galaxy: from Cosmological N-body Simulations

NASA Astrophysics Data System (ADS)

Guiderdoni, B.

The GalICS project develops extensive semi-analytic post-processing of large cosmological simulations to describe hierarchical galaxy formation. The multiwavelength statistical properties of high-redshift and local galaxies are predicted within the large-scale structures. The fake catalogs and mock images that are generated from the outputs are used for the analysis and preparation of deep surveys. The whole set of results is now available in an on-line database that can be easily queried. The GalICS project represents a first step towards a 'Virtual Observatory of virtual galaxies'.
GenomeVista

DOE Office of Scientific and Technical Information (OSTI.GOV)

Poliakov, Alexander; Couronne, Olivier

2002-11-04

Aligning large vertebrate genomes that are structurally complex poses a variety of problems not encountered on smaller scales. Such genomes are rich in repetitive elements and contain multiple segmental duplications, which increases the difficulty of identifying true orthologous SNA segments in alignments. The sizes of the sequences make many alignment algorithms designed for comparing single proteins extremely inefficient when processing large genomic intervals. We integrated both local and global alignment tools and developed a suite of programs for automatically aligning large vertebrate genomes and identifying conserved non-coding regions in the alignments. Our method uses the BLAT local alignment program tomore » find anchors on the base genome to identify regions of possible homology for a query sequence. These regions are postprocessed to find the best candidates which are then globally aligned using the AVID global alignment program. In the last step conserved non-coding segments are identified using VISTA. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. The GenomeVISTA software is a suite of Perl programs that is built on a MySQL database platform. The scheduler gets control data from the database, builds a queve of jobs, and dispatches them to a PC cluster for execution. The main program, running on each node of the cluster, processes individual sequences. A Perl library acts as an interface between the database and the above programs. The use of a separate library allows the programs to function independently of the database schema. The library also improves on the standard Perl MySQL database interfere package by providing auto-reconnect functionality and improved error handling.« less
Asymmetric distances for binary embeddings.

PubMed

Gordo, Albert; Perronnin, Florent; Gong, Yunchao; Lazebnik, Svetlana

2014-01-01

In large-scale query-by-example retrieval, embedding image signatures in a binary space offers two benefits: data compression and search efficiency. While most embedding algorithms binarize both query and database signatures, it has been noted that this is not strictly a requirement. Indeed, asymmetric schemes that binarize the database signatures but not the query still enjoy the same two benefits but may provide superior accuracy. In this work, we propose two general asymmetric distances that are applicable to a wide variety of embedding techniques including locality sensitive hashing (LSH), locality sensitive binary codes (LSBC), spectral hashing (SH), PCA embedding (PCAE), PCAE with random rotations (PCAE-RR), and PCAE with iterative quantization (PCAE-ITQ). We experiment on four public benchmarks containing up to 1M images and show that the proposed asymmetric distances consistently lead to large improvements over the symmetric Hamming distance for all binary embedding techniques.
RBscore&NBench: a high-level web server for nucleic acid binding residues prediction with a large-scale benchmarking database.

PubMed

Miao, Zhichao; Westhof, Eric

2016-07-08

RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
High-Performance Secure Database Access Technologies for HEP Grids

DOE Office of Scientific and Technical Information (OSTI.GOV)

Matthew Vranicar; John Weicher

2006-04-17

The Large Hadron Collider (LHC) at the CERN Laboratory will become the largest scientific instrument in the world when it starts operations in 2007. Large Scale Analysis Computer Systems (computational grids) are required to extract rare signals of new physics from petabytes of LHC detector data. In addition to file-based event data, LHC data processing applications require access to large amounts of data in relational databases: detector conditions, calibrations, etc. U.S. high energy physicists demand efficient performance of grid computing applications in LHC physics research where world-wide remote participation is vital to their success. To empower physicists with data-intensive analysismore » capabilities a whole hyperinfrastructure of distributed databases cross-cuts a multi-tier hierarchy of computational grids. The crosscutting allows separation of concerns across both the global environment of a federation of computational grids and the local environment of a physicist’s computer used for analysis. Very few efforts are on-going in the area of database and grid integration research. Most of these are outside of the U.S. and rely on traditional approaches to secure database access via an extraneous security layer separate from the database system core, preventing efficient data transfers. Our findings are shared by the Database Access and Integration Services Working Group of the Global Grid Forum, who states that "Research and development activities relating to the Grid have generally focused on applications where data is stored in files. However, in many scientific and commercial domains, database management systems have a central role in data storage, access, organization, authorization, etc, for numerous applications.” There is a clear opportunity for a technological breakthrough, requiring innovative steps to provide high-performance secure database access technologies for grid computing. We believe that an innovative database architecture where the secure authorization is pushed into the database engine will eliminate inefficient data transfer bottlenecks. Furthermore, traditionally separated database and security layers provide an extra vulnerability, leaving a weak clear-text password authorization as the only protection on the database core systems. Due to the legacy limitations of the systems’ security models, the allowed passwords often can not even comply with the DOE password guideline requirements. We see an opportunity for the tight integration of the secure authorization layer with the database server engine resulting in both improved performance and improved security. Phase I has focused on the development of a proof-of-concept prototype using Argonne National Laboratory’s (ANL) Argonne Tandem-Linac Accelerator System (ATLAS) project as a test scenario. By developing a grid-security enabled version of the ATLAS project’s current relation database solution, MySQL, PIOCON Technologies aims to offer a more efficient solution to secure database access.« less
Continuous evolutionary change in Plio-Pleistocene mammals of eastern Africa

NASA Astrophysics Data System (ADS)

Bibi, Faysal; Kiessling, Wolfgang

2015-08-01

Much debate has revolved around the question of whether the mode of evolutionary and ecological turnover in the fossil record of African mammals was continuous or pulsed, and the degree to which faunal turnover tracked changes in global climate. Here, we assembled and analyzed large specimen databases of the fossil record of eastern African Bovidae (antelopes) and Turkana Basin large mammals. Our results indicate that speciation and extinction proceeded continuously throughout the Pliocene and Pleistocene, as did increases in the relative abundance of arid-adapted bovids, and in bovid body mass. Species durations were similar among clades with different ecological attributes. Occupancy patterns were unimodal, with long and nearly symmetrical origination and extinction phases. A single origination pulse may be present at 2.0-1.75 Ma, but besides this, there is no evidence that evolutionary or ecological changes in the eastern African record tracked rapid, 100,000-y-scale changes in global climate. Rather, eastern African large mammal evolution tracked global or regional climatic trends at long (million year) time scales, while local, basin-scale changes (e.g., tectonic or hydrographic) and biotic interactions ruled at shorter timescales.
Breaking free from chemical spreadsheets.

PubMed

Segall, Matthew; Champness, Ed; Leeding, Chris; Chisholm, James; Hunt, Peter; Elliott, Alex; Garcia-Martinez, Hector; Foster, Nick; Dowling, Samuel

2015-09-01

Drug discovery scientists often consider compounds and data in terms of groups, such as chemical series, and relationships, representing similarity or structural transformations, to aid compound optimisation. This is often supported by chemoinformatics algorithms, for example clustering and matched molecular pair analysis. However, chemistry software packages commonly present these data as spreadsheets or form views that make it hard to find relevant patterns or compare related compounds conveniently. Here, we review common data visualisation and analysis methods used to extract information from chemistry data. We introduce a new framework that enables scientists to work flexibly with drug discovery data to reflect their thought processes and interact with the output of algorithms to identify key structure-activity relationships and guide further optimisation intuitively. Copyright © 2015 Elsevier Ltd. All rights reserved.
Large-Scale Quantitative Analysis of Painting Arts

PubMed Central

Kim, Daniel; Son, Seung-Woo; Jeong, Hawoong

2014-01-01

Scientists have made efforts to understand the beauty of painting art in their own languages. As digital image acquisition of painting arts has made rapid progress, researchers have come to a point where it is possible to perform statistical analysis of a large-scale database of artistic paints to make a bridge between art and science. Using digital image processing techniques, we investigate three quantitative measures of images – the usage of individual colors, the variety of colors, and the roughness of the brightness. We found a difference in color usage between classical paintings and photographs, and a significantly low color variety of the medieval period. Interestingly, moreover, the increment of roughness exponent as painting techniques such as chiaroscuro and sfumato have advanced is consistent with historical circumstances. PMID:25501877
Large-eddy simulation of the urban boundary layer in the MEGAPOLI Paris Plume experiment

NASA Astrophysics Data System (ADS)

Esau, Igor

2010-05-01

This study presents results from the specific large-eddy simulation study of the urban boundary layer in the MEGAPOLI Paris Plume field campaign. We used LESNIC and PALM codes, MEGAPOLI city morphology database, nudging to the observed meteorological conditions during the Paris Plume campaign and some concentration measurements from that campaign to simulate and better understand the nature of the urban boundary layer on scales larger then the street canyon scales. The primary attention was paid to turbulence self-organization and structure-to-surface interaction. The study has been aimed to demonstrate feasibility and estimate required resources for such research. Therefore, at this stage we do not compare the simulation with other relevant studies as well as we do not formulate the theoretical conclusions.
Impact of systemic sclerosis oral manifestations on patients' health-related quality of life: a systematic review.

PubMed

Smirani, Rawen; Truchetet, Marie-Elise; Poursac, Nicolas; Naveau, Adrien; Schaeverbeke, Thierry; Devillard, Raphaël

2018-06-01

Oropharyngeal features are frequent and often understated in the treatment clinical guidelines of systemic sclerosis in spite of important consequences on comfort, esthetics, nutrition and daily life. The aim of this systematic review was to assess a correlation between the oropharyngeal manifestations of systemic sclerosis and patients' health-related quality of life. A systematic search was conducted using four databases [PubMed ® , Cochrane Database ® , Dentistry & Oral Sciences Source ® , and SCOPUS ® ] up to January 2018, according to the Preferred reporting items for systematic reviews and meta analyses. Grey literature and hand search were also included. Study selection, risk bias assessment (Newcastle-Ottawa scale) and data extraction were performed by two independent reviewers. The review protocol was registered on PROSPERO database with the code CRD42018085994. From 375 screened studies, 6 cross-sectional studies were included in the systematic review. The total number of patients included per study ranged from 84 to 178. These studies reported a statistically significant association between oropharyngeal manifestations of systemic sclerosis (mainly assessed by maximal mouth opening and the mouth handicap in systemic sclerosis scale) and an impaired quality of life (measured by different scales). Studies were unequal concerning risk of bias mostly because of low level of evidence, different recruiting sources of samples, and different scales to assess the quality of life. This systematic review demonstrates a correlation between oropharyngeal manifestations of systemic sclerosis and impaired quality of life, despite the low level of evidence of included studies. Large-scaled studies are needed to provide stronger evidence of this association. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Large-scale patterns of insect and disease activity in the conterminous United States and Alaska from the National Insect and Disease Detection Survey Database, 2010

Treesearch

Kevin M. Potter; Jeanine L. Paschke

2013-01-01

Analyzing patterns of forest pest infestations, diseases occurrences, forest declines and related biotic stress factors is necessary to monitor the health of forested ecosystems and their potential impacts on forest structure, composition, biodiversity, and species distributions (Castello and others 1995). Introduced nonnative insects and diseases, in particular, can...
Stability and Change in Interests: A Longitudinal Study of Adolescents from Grades 8 through 12

ERIC Educational Resources Information Center

Tracey, Terence J. G.; Robbins, Steven B.; Hofsess, Christy D.

2005-01-01

The pattern of RIASEC interests and academic skills were assessed longitudinally from a large-scale national database at three time points: eight grade, 10th grade, and 12th grade. Validation and cross-validation samples of 1000 males and 1000 females in each set were used to test the pattern of these scores over time relative to mean changes,…
Statistical Literacy in Data Revolution Era: Building Blocks and Instructional Dilemmas

ERIC Educational Resources Information Center

Prodromou, Theodosia; Dunne, Tim

2017-01-01

The data revolution has given citizens access to enormous large-scale open databases. In order to take into account the full complexity of data, we have to change the way we think in terms of the nature of data and its availability, the ways in which it is displayed and used, and the skills that are required for its interpretation. Substantial…
MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring

NASA Technical Reports Server (NTRS)

Saeed, M.; Lieu, C.; Raber, G.; Mark, R. G.

2002-01-01

Development and evaluation of Intensive Care Unit (ICU) decision-support systems would be greatly facilitated by the availability of a large-scale ICU patient database. Following our previous efforts with the MIMIC (Multi-parameter Intelligent Monitoring for Intensive Care) Database, we have leveraged advances in networking and storage technologies to develop a far more massive temporal database, MIMIC II. MIMIC II is an ongoing effort: data is continuously and prospectively archived from all ICU patients in our hospital. MIMIC II now consists of over 800 ICU patient records including over 120 gigabytes of data and is growing. A customized archiving system was used to store continuously up to four waveforms and 30 different parameters from ICU patient monitors. An integrated user-friendly relational database was developed for browsing of patients' clinical information (lab results, fluid balance, medications, nurses' progress notes). Based upon its unprecedented size and scope, MIMIC II will prove to be an important resource for intelligent patient monitoring research, and will support efforts in medical data mining and knowledge-discovery.
A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

NASA Astrophysics Data System (ADS)

Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

2017-10-01

In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.
The Coral Trait Database, a curated database of trait information for coral species from the global oceans

NASA Astrophysics Data System (ADS)

Madin, Joshua S.; Anderson, Kristen D.; Andreasen, Magnus Heide; Bridge, Tom C. L.; Cairns, Stephen D.; Connolly, Sean R.; Darling, Emily S.; Diaz, Marcela; Falster, Daniel S.; Franklin, Erik C.; Gates, Ruth D.; Hoogenboom, Mia O.; Huang, Danwei; Keith, Sally A.; Kosnik, Matthew A.; Kuo, Chao-Yang; Lough, Janice M.; Lovelock, Catherine E.; Luiz, Osmar; Martinelli, Julieta; Mizerek, Toni; Pandolfi, John M.; Pochon, Xavier; Pratchett, Morgan S.; Putnam, Hollie M.; Roberts, T. Edward; Stat, Michael; Wallace, Carden C.; Widman, Elizabeth; Baird, Andrew H.

2016-03-01

Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism’s function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research.

The Coral Trait Database, a curated database of trait information for coral species from the global oceans

PubMed Central

Madin, Joshua S.; Anderson, Kristen D.; Andreasen, Magnus Heide; Bridge, Tom C.L.; Cairns, Stephen D.; Connolly, Sean R.; Darling, Emily S.; Diaz, Marcela; Falster, Daniel S.; Franklin, Erik C.; Gates, Ruth D.; Hoogenboom, Mia O.; Huang, Danwei; Keith, Sally A.; Kosnik, Matthew A.; Kuo, Chao-Yang; Lough, Janice M.; Lovelock, Catherine E.; Luiz, Osmar; Martinelli, Julieta; Mizerek, Toni; Pandolfi, John M.; Pochon, Xavier; Pratchett, Morgan S.; Putnam, Hollie M.; Roberts, T. Edward; Stat, Michael; Wallace, Carden C.; Widman, Elizabeth; Baird, Andrew H.

2016-01-01

Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism’s function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research. PMID:27023900
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W

2010-01-01

GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bi-monthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI homepage: www.ncbi.nlm.nih.gov.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W

2009-01-01

GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank(R) staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
The Coral Trait Database, a curated database of trait information for coral species from the global oceans.

PubMed

Madin, Joshua S; Anderson, Kristen D; Andreasen, Magnus Heide; Bridge, Tom C L; Cairns, Stephen D; Connolly, Sean R; Darling, Emily S; Diaz, Marcela; Falster, Daniel S; Franklin, Erik C; Gates, Ruth D; Harmer, Aaron; Hoogenboom, Mia O; Huang, Danwei; Keith, Sally A; Kosnik, Matthew A; Kuo, Chao-Yang; Lough, Janice M; Lovelock, Catherine E; Luiz, Osmar; Martinelli, Julieta; Mizerek, Toni; Pandolfi, John M; Pochon, Xavier; Pratchett, Morgan S; Putnam, Hollie M; Roberts, T Edward; Stat, Michael; Wallace, Carden C; Widman, Elizabeth; Baird, Andrew H

2016-03-29

Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism's function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research.
The MIND PALACE: A Multi-Spectral Imaging and Spectroscopy Database for Planetary Science

NASA Astrophysics Data System (ADS)

Eshelman, E.; Doloboff, I.; Hara, E. K.; Uckert, K.; Sapers, H. M.; Abbey, W.; Beegle, L. W.; Bhartia, R.

2017-12-01

The Multi-Instrument Database (MIND) is the web-based home to a well-characterized set of analytical data collected by a suite of deep-UV fluorescence/Raman instruments built at the Jet Propulsion Laboratory (JPL). Samples derive from a growing body of planetary surface analogs, mineral and microbial standards, meteorites, spacecraft materials, and other astrobiologically relevant materials. In addition to deep-UV spectroscopy, datasets stored in MIND are obtained from a variety of analytical techniques obtained over multiple spatial and spectral scales including electron microscopy, optical microscopy, infrared spectroscopy, X-ray fluorescence, and direct fluorescence imaging. Multivariate statistical analysis techniques, primarily Principal Component Analysis (PCA), are used to guide interpretation of these large multi-analytical spectral datasets. Spatial co-referencing of integrated spectral/visual maps is performed using QGIS (geographic information system software). Georeferencing techniques transform individual instrument data maps into a layered co-registered data cube for analysis across spectral and spatial scales. The body of data in MIND is intended to serve as a permanent, reliable, and expanding database of deep-UV spectroscopy datasets generated by this unique suite of JPL-based instruments on samples of broad planetary science interest.
Development of a database for the verification of trans-ionospheric remote sensing systems

NASA Astrophysics Data System (ADS)

Leitinger, R.

2005-08-01

Remote sensing systems need verification by means of in-situ data or by means of model data. In the case of ionospheric occultation inversion, ionosphere tomography and other imaging methods on the basis of satellite-to-ground or satellite-to-satellite electron content, the availability of in-situ data with adequate spatial and temporal co-location is a very rare case, indeed. Therefore the method of choice for verification is to produce artificial electron content data with realistic properties, subject these data to the inversion/retrieval method, compare the results with model data and apply a suitable type of “goodness of fit” classification. Inter-comparison of inversion/retrieval methods should be done with sets of artificial electron contents in a “blind” (or even “double blind”) way. The set up of a relevant database for the COST 271 Action is described. One part of the database will be made available to everyone interested in testing of inversion/retrieval methods. The artificial electron content data are calculated by means of large-scale models that are “modulated” in a realistic way to include smaller scale and dynamic structures, like troughs and traveling ionospheric disturbances.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2007-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (www.ncbi.nlm.nih.gov).
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2005-01-01

GenBank is a comprehensive database that contains publicly available DNA sequences for more than 165,000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2006-01-01

GenBank (R) is a comprehensive database that contains publicly available DNA sequences for more than 205 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the Web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at www.ncbi.nlm.nih.gov.
A psycholinguistic database for traditional Chinese character naming.

PubMed

Chang, Ya-Ning; Hsu, Chun-Hsien; Tsai, Jie-Li; Chen, Chien-Liang; Lee, Chia-Ying

2016-03-01

In this study, we aimed to provide a large-scale set of psycholinguistic norms for 3,314 traditional Chinese characters, along with their naming reaction times (RTs), collected from 140 Chinese speakers. The lexical and semantic variables in the database include frequency, regularity, familiarity, consistency, number of strokes, homophone density, semantic ambiguity rating, phonetic combinability, semantic combinability, and the number of disyllabic compound words formed by a character. Multiple regression analyses were conducted to examine the predictive powers of these variables for the naming RTs. The results demonstrated that these variables could account for a significant portion of variance (55.8%) in the naming RTs. An additional multiple regression analysis was conducted to demonstrate the effects of consistency and character frequency. Overall, the regression results were consistent with the findings of previous studies on Chinese character naming. This database should be useful for research into Chinese language processing, Chinese education, or cross-linguistic comparisons. The database can be accessed via an online inquiry system (http://ball.ling.sinica.edu.tw/namingdatabase/index.html).
EDULISS: a small-molecule database with data-mining and pharmacophore searching capabilities

PubMed Central

Hsin, Kun-Yi; Morgan, Hugh P.; Shave, Steven R.; Hinton, Andrew C.; Taylor, Paul; Walkinshaw, Malcolm D.

2011-01-01

We present the relational database EDULISS (EDinburgh University Ligand Selection System), which stores structural, physicochemical and pharmacophoric properties of small molecules. The database comprises a collection of over 4 million commercially available compounds from 28 different suppliers. A user-friendly web-based interface for EDULISS (available at http://eduliss.bch.ed.ac.uk/) has been established providing a number of data-mining possibilities. For each compound a single 3D conformer is stored along with over 1600 calculated descriptor values (molecular properties). A very efficient method for unique compound recognition, especially for a large scale database, is demonstrated by making use of small subgroups of the descriptors. Many of the shape and distance descriptors are held as pre-calculated bit strings permitting fast and efficient similarity and pharmacophore searches which can be used to identify families of related compounds for biological testing. Two ligand searching applications are given to demonstrate how EDULISS can be used to extract families of molecules with selected structural and biophysical features. PMID:21051336
SPSmart: adapting population based SNP genotype databases for fast and comprehensive web access.

PubMed

Amigo, Jorge; Salas, Antonio; Phillips, Christopher; Carracedo, Angel

2008-10-10

In the last five years large online resources of human variability have appeared, notably HapMap, Perlegen and the CEPH foundation. These databases of genotypes with population information act as catalogues of human diversity, and are widely used as reference sources for population genetics studies. Although many useful conclusions may be extracted by querying databases individually, the lack of flexibility for combining data from within and between each database does not allow the calculation of key population variability statistics. We have developed a novel tool for accessing and combining large-scale genomic databases of single nucleotide polymorphisms (SNPs) in widespread use in human population genetics: SPSmart (SNPs for Population Studies). A fast pipeline creates and maintains a data mart from the most commonly accessed databases of genotypes containing population information: data is mined, summarized into the standard statistical reference indices, and stored into a relational database that currently handles as many as 4 x 10(9) genotypes and that can be easily extended to new database initiatives. We have also built a web interface to the data mart that allows the browsing of underlying data indexed by population and the combining of populations, allowing intuitive and straightforward comparison of population groups. All the information served is optimized for web display, and most of the computations are already pre-processed in the data mart to speed up the data browsing and any computational treatment requested. In practice, SPSmart allows populations to be combined into user-defined groups, while multiple databases can be accessed and compared in a few simple steps from a single query. It performs the queries rapidly and gives straightforward graphical summaries of SNP population variability through visual inspection of allele frequencies outlined in standard pie-chart format. In addition, full numerical description of the data is output in statistical results panels that include common population genetics metrics such as heterozygosity, Fst and In.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation.

PubMed

Dudek, Christian-Alexander; Dannheim, Henning; Schomburg, Dietmar

2017-01-01

The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de.
BrEPS 2.0: Optimization of sequence pattern prediction for enzyme annotation

PubMed Central

Schomburg, Dietmar

2017-01-01

The prediction of gene functions is crucial for a large number of different life science areas. Faster high throughput sequencing techniques generate more and larger datasets. The manual annotation by classical wet-lab experiments is not suitable for these large amounts of data. We showed earlier that the automatic sequence pattern-based BrEPS protocol, based on manually curated sequences, can be used for the prediction of enzymatic functions of genes. The growing sequence databases provide the opportunity for more reliable patterns, but are also a challenge for the implementation of automatic protocols. We reimplemented and optimized the BrEPS pattern generation to be applicable for larger datasets in an acceptable timescale. Primary improvement of the new BrEPS protocol is the enhanced data selection step. Manually curated annotations from Swiss-Prot are used as reliable source for function prediction of enzymes observed on protein level. The pool of sequences is extended by highly similar sequences from TrEMBL and SwissProt. This allows us to restrict the selection of Swiss-Prot entries, without losing the diversity of sequences needed to generate significant patterns. Additionally, a supporting pattern type was introduced by extending the patterns at semi-conserved positions with highly similar amino acids. Extended patterns have an increased complexity, increasing the chance to match more sequences, without losing the essential structural information of the pattern. To enhance the usability of the database, we introduced enzyme function prediction based on consensus EC numbers and IUBMB enzyme nomenclature. BrEPS is part of the Braunschweig Enzyme Database (BRENDA) and is available on a completely redesigned website and as download. The database can be downloaded and used with the BrEPScmd command line tool for large scale sequence analysis. The BrEPS website and downloads for the database creation tool, command line tool and database are freely accessible at http://breps.tu-bs.de. PMID:28750104
Evaluation of Online Information Sources on Alien Species in Europe: The Need of Harmonization and Integration

NASA Astrophysics Data System (ADS)

Gatto, Francesca; Katsanevakis, Stelios; Vandekerkhove, Jochen; Zenetos, Argyro; Cardoso, Ana Cristina

2013-06-01

Europe is severely affected by alien invasions, which impact biodiversity, ecosystem services, economy, and human health. A large number of national, regional, and global online databases provide information on the distribution, pathways of introduction, and impacts of alien species. The sufficiency and efficiency of the current online information systems to assist the European policy on alien species was investigated by a comparative analysis of occurrence data across 43 online databases. Large differences among databases were found which are partially explained by variations in their taxonomical, environmental, and geographical scopes but also by the variable efforts for continuous updates and by inconsistencies on the definition of "alien" or "invasive" species. No single database covered all European environments, countries, and taxonomic groups. In many European countries national databases do not exist, which greatly affects the quality of reported information. To be operational and useful to scientists, managers, and policy makers, online information systems need to be regularly updated through continuous monitoring on a country or regional level. We propose the creation of a network of online interoperable web services through which information in distributed resources can be accessed, aggregated and then used for reporting and further analysis at different geographical and political scales, as an efficient approach to increase the accessibility of information. Harmonization, standardization, conformity on international standards for nomenclature, and agreement on common definitions of alien and invasive species are among the necessary prerequisites.
Fire Detection Organizing Questions

NASA Technical Reports Server (NTRS)

2004-01-01

Verified models of fire precursor transport in low and partial gravity: a. Development of models for large-scale transport in reduced gravity. b. Validated CFD simulations of transport of fire precursors. c. Evaluation of the effect of scale on transport and reduced gravity fires. Advanced fire detection system for gaseous and particulate pre-fire and fire signaturesa: a. Quantification of pre-fire pyrolysis products in microgravity. b. Suite of gas and particulate sensors. c. Reduced gravity evaluation of candidate detector technologies. d. Reduced gravity verification of advanced fire detection system. e. Validated database of fire and pre-fire signatures in low and partial gravity.
TheCellMap.org: A Web-Accessible Database for Visualizing and Mining the Global Yeast Genetic Interaction Network

PubMed Central

Usaj, Matej; Tan, Yizhao; Wang, Wen; VanderSluis, Benjamin; Zou, Albert; Myers, Chad L.; Costanzo, Michael; Andrews, Brenda; Boone, Charles

2017-01-01

Providing access to quantitative genomic data is key to ensure large-scale data validation and promote new discoveries. TheCellMap.org serves as a central repository for storing and analyzing quantitative genetic interaction data produced by genome-scale Synthetic Genetic Array (SGA) experiments with the budding yeast Saccharomyces cerevisiae. In particular, TheCellMap.org allows users to easily access, visualize, explore, and functionally annotate genetic interactions, or to extract and reorganize subnetworks, using data-driven network layouts in an intuitive and interactive manner. PMID:28325812
TheCellMap.org: A Web-Accessible Database for Visualizing and Mining the Global Yeast Genetic Interaction Network.

PubMed

Usaj, Matej; Tan, Yizhao; Wang, Wen; VanderSluis, Benjamin; Zou, Albert; Myers, Chad L; Costanzo, Michael; Andrews, Brenda; Boone, Charles

2017-05-05

Providing access to quantitative genomic data is key to ensure large-scale data validation and promote new discoveries. TheCellMap.org serves as a central repository for storing and analyzing quantitative genetic interaction data produced by genome-scale Synthetic Genetic Array (SGA) experiments with the budding yeast Saccharomyces cerevisiae In particular, TheCellMap.org allows users to easily access, visualize, explore, and functionally annotate genetic interactions, or to extract and reorganize subnetworks, using data-driven network layouts in an intuitive and interactive manner. Copyright © 2017 Usaj et al.
Effects of local and large-scale climate patterns on estuarine resident fishes: The example of Pomatoschistus microps and Pomatoschistus minutus

NASA Astrophysics Data System (ADS)

Nyitrai, Daniel; Martinho, Filipe; Dolbeth, Marina; Rito, João; Pardal, Miguel A.

2013-12-01

Large-scale and local climate patterns are known to influence several aspects of the life cycle of marine fish. In this paper, we used a 9-year database (2003-2011) to analyse the populations of two estuarine resident fishes, Pomatoschistus microps and Pomatoschistus minutus, in order to determine their relationships with varying environmental stressors operating over local and large scales. This study was performed in the Mondego estuary, Portugal. Firstly, the variations in abundance, growth, population structure and secondary production were evaluated. These species appeared in high densities in the beginning of the study period, with subsequent occasional high annual density peaks, while their secondary production was lower in dry years. The relationships between yearly fish abundance and the environmental variables were evaluated separately for both species using Spearman correlation analysis, considering the yearly abundance peaks for the whole population, juveniles and adults. Among the local climate patterns, precipitation, river runoff, salinity and temperature were used in the analyses, and North Atlantic Oscillation (NAO) index and sea surface temperature (SST) were tested as large-scale factors. For P. microps, precipitation and NAO were the significant factors explaining abundance of the whole population, the adults and the juveniles as well. Regarding P. minutus, for the whole population, juveniles and adults river runoff was the significant predictor. The results for both species suggest a differential influence of climate patterns on the various life cycle stages, confirming also the importance of estuarine resident fishes as indicators of changes in local and large-scale climate patterns, related to global climate change.
Data-Driven High-Throughput Prediction of the 3D Structure of Small Molecules: Review and Progress

PubMed Central

Andronico, Alessio; Randall, Arlo; Benz, Ryan W.; Baldi, Pierre

2011-01-01

Accurate prediction of the 3D structure of small molecules is essential in order to understand their physical, chemical, and biological properties including how they interact with other molecules. Here we survey the field of high-throughput methods for 3D structure prediction and set up new target specifications for the next generation of methods. We then introduce COSMOS, a novel data-driven prediction method that utilizes libraries of fragment and torsion angle parameters. We illustrate COSMOS using parameters extracted from the Cambridge Structural Database (CSD) by analyzing their distribution and then evaluating the system’s performance in terms of speed, coverage, and accuracy. Results show that COSMOS represents a significant improvement when compared to the state-of-the-art, particularly in terms of coverage of complex molecular structures, including metal-organics. COSMOS can predict structures for 96.4% of the molecules in the CSD [99.6% organic, 94.6% metal-organic] whereas the widely used commercial method CORINA predicts structures for 68.5% [98.5% organic, 51.6% metal-organic]. On the common subset of molecules predicted by both methods COSMOS makes predictions with an average speed per molecule of 0.15s [0.10s organic, 0.21s metal-organic], and an average RMSD of 1.57Å [1.26Å organic, 1.90Å metal-organic], and CORINA makes predictions with an average speed per molecule of 0.13s [0.18s organic, 0.08s metal-organic], and an average RMSD of 1.60Å [1.13Å organic, 2.11Å metal-organic]. COSMOS is available through the ChemDB chemoinformatics web portal at: http://cdb.ics.uci.edu/. PMID:21417267

Database for the geologic map of the Chelan 30-minute by 60-minute quadrangle, Washington (I-1661)

USGS Publications Warehouse

Tabor, R.W.; Frizzell, V.A.; Whetten, J.T.; Waitt, R.B.; Swanson, D.A.; Byerly, G.R.; Booth, D.B.; Hetherington, M.J.; Zartman, R.E.

2006-01-01

This digital map database has been prepared by R. W. Tabor from the published Geologic map of the Chelan 30-Minute Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
Database for the geologic map of the Snoqualmie Pass 30-minute by 60-minute quadrangle, Washington (I-2538)

USGS Publications Warehouse

Tabor, R.W.; Frizzell, V.A.; Booth, D.B.; Waitt, R.B.

2006-01-01

This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Snoqualmie Pass 30' X 60' Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
Geologic Map of the Wenatchee 1:100,000 Quadrangle, Central Washington: A Digital Database

USGS Publications Warehouse

Tabor, R.W.; Waitt, R.B.; Frizzell, V.A.; Swanson, D.A.; Byerly, G.R.; Bentley, R.D.

2005-01-01

This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Wenatchee 1:100,000 Quadrangle, Central Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
An Evaluation of Database Solutions to Spatial Object Association

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, V S; Kurc, T; Saltz, J

2008-06-24

Object association is a common problem encountered in many applications. Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two datasets based on their positions in a common spatial coordinate system--one of the datasets may correspond to a catalog of objects observed over time in a multi-dimensional domain; the other dataset may consist of objects observed in a snapshot of the domain at a time point. The use of database management systems to the solve the object association problem provides portability across different platforms and also greater flexibility. Increasingmore » dataset sizes in today's applications, however, have made object association a data/compute-intensive problem that requires targeted optimizations for efficient execution. In this work, we investigate how database-based crossmatch algorithms can be deployed on different database system architectures and evaluate the deployments to understand the impact of architectural choices on crossmatch performance and associated trade-offs. We investigate the execution of two crossmatch algorithms on (1) a parallel database system with active disk style processing capabilities, (2) a high-throughput network database (MySQL Cluster), and (3) shared-nothing databases with replication. We have conducted our study in the context of a large-scale astronomy application with real use-case scenarios.« less
Degree program changes and curricular flexibility: Addressing long held beliefs about student progression

NASA Astrophysics Data System (ADS)

Ricco, George Dante

In higher education and in engineering education in particular, changing majors is generally considered a negative event - or at least an event with negative consequences. An emergent field of study within engineering education revolves around understanding the factors and processes driving student changes of major. Of key importance to further the field of change of major research is a grasp of large scale phenomena occurring throughout multiple systems, knowledge of previous attempts at describing such issues, and the adoption of metrics to probe them effectively. The problem posed is exacerbated by the drive in higher education institutions and among state legislatures to understand and reduce time-to-degree and student attrition. With these factors in mind, insights into large-scale processes that affect student progression are essential to evaluating the success or failure of programs. The goals of this work include describing the current educational research on switchers, identifying core concepts and stumbling blocks in my treatment of switchers, and using the Multiple Institutional Database for Investigating Engineering Longitudinal Development (MIDFIELD) to explore how those who change majors perform as a function of large-scale academic pathways within and without the engineering context. To accomplish these goals, it was first necessary to delve into a recent history of the treatment of switchers within the literature and categorize their approach. While three categories of papers exist in the literature concerning change of major, all three may or may not be applicable to a given database of students or even a single institution. Furthermore, while the term has been coined in the literature, no portable metric for discussing large-scale navigational flexibility exists in engineering education. What such a metric would look like will be discussed as well as the delimitations involved. The results and subsequent discussion will include a description of changes of major, how they may or may not have a deleterious effect on one's academic pathway, the special context of changes of major in the pathways of students within first-year engineering programs students labeled as undecided, an exploration of curricular flexibility by the construction of a novel metric, and proposed future work.
The Computing and Data Grid Approach: Infrastructure for Distributed Science Applications

NASA Technical Reports Server (NTRS)

Johnston, William E.

2002-01-01

With the advent of Grids - infrastructure for using and managing widely distributed computing and data resources in the science environment - there is now an opportunity to provide a standard, large-scale, computing, data, instrument, and collaboration environment for science that spans many different projects and provides the required infrastructure and services in a relatively uniform and supportable way. Grid technology has evolved over the past several years to provide the services and infrastructure needed for building 'virtual' systems and organizations. We argue that Grid technology provides an excellent basis for the creation of the integrated environments that can combine the resources needed to support the large- scale science projects located at multiple laboratories and universities. We present some science case studies that indicate that a paradigm shift in the process of science will come about as a result of Grids providing transparent and secure access to advanced and integrated information and technologies infrastructure: powerful computing systems, large-scale data archives, scientific instruments, and collaboration tools. These changes will be in the form of services that can be integrated with the user's work environment, and that enable uniform and highly capable access to these computers, data, and instruments, regardless of the location or exact nature of these resources. These services will integrate transient-use resources like computing systems, scientific instruments, and data caches (e.g., as they are needed to perform a simulation or analyze data from a single experiment); persistent-use resources. such as databases, data catalogues, and archives, and; collaborators, whose involvement will continue for the lifetime of a project or longer. While we largely address large-scale science in this paper, Grids, particularly when combined with Web Services, will address a broad spectrum of science scenarios. both large and small scale.
Introducing the Global Fire WEather Database (GFWED)

NASA Astrophysics Data System (ADS)

Field, R. D.

2015-12-01

The Canadian Fire Weather Index (FWI) System is the mostly widely used fire danger rating system in the world. We have developed a global database of daily FWI System calculations beginning in 1980 called the Global Fire WEather Database (GFWED) gridded to a spatial resolution of 0.5° latitude by 2/3° longitude. Input weather data were obtained from the NASA Modern Era Retrospective-Analysis for Research (MERRA), and two different estimates of daily precipitation from rain gauges over land. FWI System Drought Code calculations from the gridded datasets were compared to calculations from individual weather station data for a representative set of 48 stations in North, Central and South America, Europe, Russia, Southeast Asia and Australia. Agreement between gridded calculations and the station-based calculations tended to be most different at low latitudes for strictly MERRA-based calculations. Strong biases could be seen in either direction: MERRA DC over the Mato Grosso in Brazil reached unrealistically high values exceeding DC=1500 during the dry season but was too low over Southeast Asia during the dry season. These biases are consistent with those previously-identified in MERRA's precipitation and reinforce the need to consider alternative sources of precipitation data. GFWED is being used by researchers around the world for analyzing historical relationships between fire weather and fire activity at large scales, in identifying large-scale atmosphere-ocean controls on fire weather, and calibration of FWI-based fire prediction models. These applications will be discussed. More information on GFWED can be found at http://data.giss.nasa.gov/impacts/gfwed/
ClearedLeavesDB: an online database of cleared plant leaf images

PubMed Central

2014-01-01

Background Leaf vein networks are critical to both the structure and function of leaves. A growing body of recent work has linked leaf vein network structure to the physiology, ecology and evolution of land plants. In the process, multiple institutions and individual researchers have assembled collections of cleared leaf specimens in which vascular bundles (veins) are rendered visible. In an effort to facilitate analysis and digitally preserve these specimens, high-resolution images are usually created, either of entire leaves or of magnified leaf subsections. In a few cases, collections of digital images of cleared leaves are available for use online. However, these collections do not share a common platform nor is there a means to digitally archive cleared leaf images held by individual researchers (in addition to those held by institutions). Hence, there is a growing need for a digital archive that enables online viewing, sharing and disseminating of cleared leaf image collections held by both institutions and individual researchers. Description The Cleared Leaf Image Database (ClearedLeavesDB), is an online web-based resource for a community of researchers to contribute, access and share cleared leaf images. ClearedLeavesDB leverages resources of large-scale, curated collections while enabling the aggregation of small-scale collections within the same online platform. ClearedLeavesDB is built on Drupal, an open source content management platform. It allows plant biologists to store leaf images online with corresponding meta-data, share image collections with a user community and discuss images and collections via a common forum. We provide tools to upload processed images and results to the database via a web services client application that can be downloaded from the database. Conclusions We developed ClearedLeavesDB, a database focusing on cleared leaf images that combines interactions between users and data via an intuitive web interface. The web interface allows storage of large collections and integrates with leaf image analysis applications via an open application programming interface (API). The open API allows uploading of processed images and other trait data to the database, further enabling distribution and documentation of analyzed data within the community. The initial database is seeded with nearly 19,000 cleared leaf images representing over 40 GB of image data. Extensible storage and growth of the database is ensured by using the data storage resources of the iPlant Discovery Environment. ClearedLeavesDB can be accessed at http://clearedleavesdb.org. PMID:24678985
ClearedLeavesDB: an online database of cleared plant leaf images.

PubMed

Das, Abhiram; Bucksch, Alexander; Price, Charles A; Weitz, Joshua S

2014-03-28

Leaf vein networks are critical to both the structure and function of leaves. A growing body of recent work has linked leaf vein network structure to the physiology, ecology and evolution of land plants. In the process, multiple institutions and individual researchers have assembled collections of cleared leaf specimens in which vascular bundles (veins) are rendered visible. In an effort to facilitate analysis and digitally preserve these specimens, high-resolution images are usually created, either of entire leaves or of magnified leaf subsections. In a few cases, collections of digital images of cleared leaves are available for use online. However, these collections do not share a common platform nor is there a means to digitally archive cleared leaf images held by individual researchers (in addition to those held by institutions). Hence, there is a growing need for a digital archive that enables online viewing, sharing and disseminating of cleared leaf image collections held by both institutions and individual researchers. The Cleared Leaf Image Database (ClearedLeavesDB), is an online web-based resource for a community of researchers to contribute, access and share cleared leaf images. ClearedLeavesDB leverages resources of large-scale, curated collections while enabling the aggregation of small-scale collections within the same online platform. ClearedLeavesDB is built on Drupal, an open source content management platform. It allows plant biologists to store leaf images online with corresponding meta-data, share image collections with a user community and discuss images and collections via a common forum. We provide tools to upload processed images and results to the database via a web services client application that can be downloaded from the database. We developed ClearedLeavesDB, a database focusing on cleared leaf images that combines interactions between users and data via an intuitive web interface. The web interface allows storage of large collections and integrates with leaf image analysis applications via an open application programming interface (API). The open API allows uploading of processed images and other trait data to the database, further enabling distribution and documentation of analyzed data within the community. The initial database is seeded with nearly 19,000 cleared leaf images representing over 40 GB of image data. Extensible storage and growth of the database is ensured by using the data storage resources of the iPlant Discovery Environment. ClearedLeavesDB can be accessed at http://clearedleavesdb.org.
What if we took a global look?

NASA Astrophysics Data System (ADS)

Ouellet Dallaire, C.; Lehner, B.

2014-12-01

Freshwater resources are facing unprecedented pressures. In hope to cope with this, Environmental Hydrology, Freshwater Biology, and Fluvial Geomorphology have defined conceptual approaches such as "environmental flow requirements", "instream flow requirements" or "normative flow regime" to define appropriate flow regime to maintain a given ecological status. These advances in the fields of freshwater resources management are asking scientists to create bridges across disciplines. Holistic and multi-scales approaches are becoming more and more common in water sciences research. The intrinsic nature of river systems demands these approaches to account for the upstream-downstream link of watersheds. Before recent technological developments, large scale analyses were cumbersome and, often, the necessary data was unavailable. However, new technologies, both for information collection and computing capacity, enable a high resolution look at the global scale. For rivers around the world, this new outlook is facilitated by the hydrologically relevant geo-spatial database HydroSHEDS. This database now offers more than 24 millions of kilometers of rivers, some never mapped before, at the click of a fingertip. Large and, even, global scale assessments can now be used to compare rivers around the world. A river classification framework was developed using HydroSHEDS called GloRiC (Global River Classification). This framework advocates for holistic approach to river systems by using sub-classifications drawn from six disciplines related to river sciences: Hydrology, Physiography and climate, Geomorphology, Chemistry, Biology and Human impact. Each of these disciplines brings complementary information on the rivers that is relevant at different scales. A first version of a global river reach classification was produced at the 500m resolution. Variables used in the classification have influence on processes involved at different scales (ex. topography index vs. pH). However, all variables are computed at the same high spatial resolution. This way, we can have a global look at local phenomenon.
Extreme Precipitation and High-Impact Landslides

NASA Technical Reports Server (NTRS)

Kirschbaum, Dalia; Adler, Robert; Huffman, George; Peters-Lidard, Christa

2012-01-01

It is well known that extreme or prolonged rainfall is the dominant trigger of landslides; however, there remain large uncertainties in characterizing the distribution of these hazards and meteorological triggers at the global scale. Researchers have evaluated the spatiotemporal distribution of extreme rainfall and landslides at local and regional scale primarily using in situ data, yet few studies have mapped rainfall-triggered landslide distribution globally due to the dearth of landslide data and consistent precipitation information. This research uses a newly developed Global Landslide Catalog (GLC) and a 13-year satellite-based precipitation record from Tropical Rainfall Measuring Mission (TRMM) data. For the first time, these two unique products provide the foundation to quantitatively evaluate the co-occurence of precipitation and rainfall-triggered landslides globally. The GLC, available from 2007 to the present, contains information on reported rainfall-triggered landslide events around the world using online media reports, disaster databases, etc. When evaluating this database, we observed that 2010 had a large number of high-impact landslide events relative to previous years. This study considers how variations in extreme and prolonged satellite-based rainfall are related to the distribution of landslides over the same time scales for three active landslide areas: Central America, the Himalayan Arc, and central-eastern China. Several test statistics confirm that TRMM rainfall generally scales with the observed increase in landslide reports and fatal events for 2010 and previous years over each region. These findings suggest that the co-occurrence of satellite precipitation and landslide reports may serve as a valuable indicator for characterizing the spatiotemporal distribution of landslide-prone areas in order to establish a global rainfall-triggered landslide climatology. This research also considers the sources for this extreme rainfall, citing teleconnections from ENSO as likely contributors to regional precipitation variability. This work demonstrates the potential for using satellite-based precipitation estimates to identify potentially active landslide areas at the global scale in order to improve landslide cataloging and quantify landslide triggering at daily, monthly and yearly time scales.
The Cancer Epidemiology Descriptive Cohort Database: A Tool to Support Population-Based Interdisciplinary Research

PubMed Central

Kennedy, Amy E.; Khoury, Muin J.; Ioannidis, John P.A.; Brotzman, Michelle; Miller, Amy; Lane, Crystal; Lai, Gabriel Y.; Rogers, Scott D.; Harvey, Chinonye; Elena, Joanne W.; Seminara, Daniela

2017-01-01

Background We report on the establishment of a web-based Cancer Epidemiology Descriptive Cohort Database (CEDCD). The CEDCD’s goals are to enhance awareness of resources, facilitate interdisciplinary research collaborations, and support existing cohorts for the study of cancer-related outcomes. Methods Comprehensive descriptive data were collected from large cohorts established to study cancer as primary outcome using a newly developed questionnaire. These included an inventory of baseline and follow-up data, biospecimens, genomics, policies, and protocols. Additional descriptive data extracted from publicly available sources were also collected. This information was entered in a searchable and publicly accessible database. We summarized the descriptive data across cohorts and reported the characteristics of this resource. Results As of December 2015, the CEDCD includes data from 46 cohorts representing more than 6.5 million individuals (29% ethnic/racial minorities). Overall, 78% of the cohorts have collected blood at least once, 57% at multiple time points, and 46% collected tissue samples. Genotyping has been performed by 67% of the cohorts, while 46% have performed whole-genome or exome sequencing in subsets of enrolled individuals. Information on medical conditions other than cancer has been collected in more than 50% of the cohorts. More than 600,000 incident cancer cases and more than 40,000 prevalent cases are reported, with 24 cancer sites represented. Conclusions The CEDCD assembles detailed descriptive information on a large number of cancer cohorts in a searchable database. Impact Information from the CEDCD may assist the interdisciplinary research community by facilitating identification of well-established population resources and large-scale collaborative and integrative research. PMID:27439404
Computational Modeling as a Design Tool in Microelectronics Manufacturing

NASA Technical Reports Server (NTRS)

Meyyappan, Meyya; Arnold, James O. (Technical Monitor)

1997-01-01

Plans to introduce pilot lines or fabs for 300 mm processing are in progress. The IC technology is simultaneously moving towards 0.25/0.18 micron. The convergence of these two trends places unprecedented stringent demands on processes and equipments. More than ever, computational modeling is called upon to play a complementary role in equipment and process design. The pace in hardware/process development needs a matching pace in software development: an aggressive move towards developing "virtual reactors" is desirable and essential to reduce design cycle and costs. This goal has three elements: reactor scale model, feature level model, and database of physical/chemical properties. With these elements coupled, the complete model should function as a design aid in a CAD environment. This talk would aim at the description of various elements. At the reactor level, continuum, DSMC(or particle) and hybrid models will be discussed and compared using examples of plasma and thermal process simulations. In microtopography evolution, approaches such as level set methods compete with conventional geometric models. Regardless of the approach, the reliance on empricism is to be eliminated through coupling to reactor model and computational surface science. This coupling poses challenging issues of orders of magnitude variation in length and time scales. Finally, database development has fallen behind; current situation is rapidly aggravated by the ever newer chemistries emerging to meet process metrics. The virtual reactor would be a useless concept without an accompanying reliable database that consists of: thermal reaction pathways and rate constants, electron-molecule cross sections, thermochemical properties, transport properties, and finally, surface data on the interaction of radicals, atoms and ions with various surfaces. Large scale computational chemistry efforts are critical as experiments alone cannot meet database needs due to the difficulties associated with such controlled experiments and costs.
The role of Natural Flood Management in managing floods in large scale basins during extreme events

NASA Astrophysics Data System (ADS)

Quinn, Paul; Owen, Gareth; ODonnell, Greg; Nicholson, Alex; Hetherington, David

2016-04-01

There is a strong evidence database showing the negative impacts of land use intensification and soil degradation in NW European river basins on hydrological response and to flood impact downstream. However, the ability to target zones of high runoff production and the extent to which we can manage flood risk using nature-based flood management solution are less known. A move to planting more trees and having less intense farmed landscapes is part of natural flood management (NFM) solutions and these methods suggest that flood risk can be managed in alternative and more holistic ways. So what local NFM management methods should be used, where in large scale basin should they be deployed and how does flow is propagate to any point downstream? Generally, how much intervention is needed and will it compromise food production systems? If we are observing record levels of rainfall and flow, for example during Storm Desmond in Dec 2015 in the North West of England, what other flood management options are really needed to complement our traditional defences in large basins for the future? In this paper we will show examples of NFM interventions in the UK that have impacted at local scale sites. We will demonstrate the impact of interventions at local, sub-catchment (meso-scale) and finally at the large scale. These tools include observations, process based models and more generalised Flood Impact Models. Issues of synchronisation and the design level of protection will be debated. By reworking observed rainfall and discharge (runoff) for observed extreme events in the River Eden and River Tyne, during Storm Desmond, we will show how much flood protection is needed in large scale basins. The research will thus pose a number of key questions as to how floods may have to be managed in large scale basins in the future. We will seek to support a method of catchment systems engineering that holds water back across the whole landscape as a major opportunity to management water in large scale basins in the future. The broader benefits of engineering landscapes to hold water for pollution control, sediment loss and drought minimisation will also be shown.
SensorDB: a virtual laboratory for the integration, visualization and analysis of varied biological sensor data.

PubMed

Salehi, Ali; Jimenez-Berni, Jose; Deery, David M; Palmer, Doug; Holland, Edward; Rozas-Larraondo, Pablo; Chapman, Scott C; Georgakopoulos, Dimitrios; Furbank, Robert T

2015-01-01

To our knowledge, there is no software or database solution that supports large volumes of biological time series sensor data efficiently and enables data visualization and analysis in real time. Existing solutions for managing data typically use unstructured file systems or relational databases. These systems are not designed to provide instantaneous response to user queries. Furthermore, they do not support rapid data analysis and visualization to enable interactive experiments. In large scale experiments, this behaviour slows research discovery, discourages the widespread sharing and reuse of data that could otherwise inform critical decisions in a timely manner and encourage effective collaboration between groups. In this paper we present SensorDB, a web based virtual laboratory that can manage large volumes of biological time series sensor data while supporting rapid data queries and real-time user interaction. SensorDB is sensor agnostic and uses web-based, state-of-the-art cloud and storage technologies to efficiently gather, analyse and visualize data. Collaboration and data sharing between different agencies and groups is thereby facilitated. SensorDB is available online at http://sensordb.csiro.au.
Validity of deterministic record linkage using multiple indirect personal identifiers: linking a large registry to claims data.

PubMed

Setoguchi, Soko; Zhu, Ying; Jalbert, Jessica J; Williams, Lauren A; Chen, Chih-Ying

2014-05-01

Linking patient registries with administrative databases can enhance the utility of the databases for epidemiological and comparative effectiveness research. However, registries often lack direct personal identifiers, and the validity of record linkage using multiple indirect personal identifiers is not well understood. Using a large contemporary national cardiovascular device registry and 100% Medicare inpatient data, we linked hospitalization-level records. The main outcomes were the validity measures of several deterministic linkage rules using multiple indirect personal identifiers compared with rules using both direct and indirect personal identifiers. Linkage rules using 2 or 3 indirect, patient-level identifiers (ie, date of birth, sex, admission date) and hospital ID produced linkages with sensitivity of 95% and specificity of 98% compared with a gold standard linkage rule using a combination of both direct and indirect identifiers. Ours is the first large-scale study to validate the performance of deterministic linkage rules without direct personal identifiers. When linking hospitalization-level records in the absence of direct personal identifiers, provider information is necessary for successful linkage. © 2014 American Heart Association, Inc.
Intact mass detection, interpretation, and visualization to automate Top-Down proteomics on a large scale

PubMed Central

Durbin, Kenneth R.; Tran, John C.; Zamdborg, Leonid; Sweet, Steve M. M.; Catherman, Adam D.; Lee, Ji Eun; Li, Mingxi; Kellie, John F.; Kelleher, Neil L.

2011-01-01

Applying high-throughput Top-Down MS to an entire proteome requires a yet-to-be-established model for data processing. Since Top-Down is becoming possible on a large scale, we report our latest software pipeline dedicated to capturing the full value of intact protein data in automated fashion. For intact mass detection, we combine algorithms for processing MS1 data from both isotopically resolved (FT) and charge-state resolved (ion trap) LC-MS data, which are then linked to their fragment ions for database searching using ProSight. Automated determination of human keratin and tubulin isoforms is one result. Optimized for the intricacies of whole proteins, new software modules visualize proteome-scale data based on the LC retention time and intensity of intact masses and enable selective detection of PTMs to automatically screen for acetylation, phosphorylation, and methylation. Software functionality was demonstrated using comparative LC-MS data from yeast strains in addition to human cells undergoing chemical stress. We further these advances as a key aspect of realizing Top-Down MS on a proteomic scale. PMID:20848673
Large-scale patterns of insect and disease activity in the Conterminous United States and Alaska from the National Insect and Disease Detection Survey Database, 2007 and 2008

Treesearch

Kevin M. Potter

2012-01-01

Analyzing patterns of forest pest infestation is necessary for monitoring the health of forested ecosystems because of the impacts that insects and diseases can have on forest structure, composition, biodiversity, and species distributions (Castello and others 1995). In particular, introduced nonnative insects and diseases can extensively damage the diversity, ecology...
Validation databases for simulation models: aboveground biomass and net primary productive, (NPP) estimation using eastwide FIA data

Treesearch

Jennifer C. Jenkins; Richard A. Birdsey

2000-01-01

As interest grows in the role of forest growth in the carbon cycle, and as simulation models are applied to predict future forest productivity at large spatial scales, the need for reliable and field-based data for evaluation of model estimates is clear. We created estimates of potential forest biomass and annual aboveground production for the Chesapeake Bay watershed...
LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis

DTIC Science & Technology

2016-05-23

LLMapReduce works with several schedulers such as SLURM, Grid Engine and LSF. Keywords—LLMapReduce; map-reduce; performance; scheduler; Grid Engine ...SLURM; LSF I. INTRODUCTION Large scale computing is currently dominated by four ecosystems: supercomputing, database, enterprise , and big data [1...interconnects [6]), High performance math libraries (e.g., BLAS [7, 8], LAPACK [9], ScaLAPACK [10]) designed to exploit special processing hardware, High

Proceedings of the Annual Meeting of the Association for Education in Journalism and Mass Communication (74th, Boston, Massachusetts, August 7-10, 1991). Part VI: Technology and the Mass Media.

ERIC Educational Resources Information Center

Association for Education in Journalism and Mass Communication.

The Technology and the Media section of the proceedings contains the following 18 papers: "What's Wrong with This Picture?: Attitudes of Photographic Editors at Daily Newspapers and Their Tolerance toward Digital Manipulation" (Shiela Reaves); "Strategies for the Analysis of Large-Scale Databases in Computer-Assisted Investigative…
Design and Implementation of a Metadata-rich File System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ames, S; Gokhale, M B; Maltzahn, C

2010-01-19

Despite continual improvements in the performance and reliability of large scale file systems, the management of user-defined file system metadata has changed little in the past decade. The mismatch between the size and complexity of large scale data stores and their ability to organize and query their metadata has led to a de facto standard in which raw data is stored in traditional file systems, while related, application-specific metadata is stored in relational databases. This separation of data and semantic metadata requires considerable effort to maintain consistency and can result in complex, slow, and inflexible system operation. To address thesemore » problems, we have developed the Quasar File System (QFS), a metadata-rich file system in which files, user-defined attributes, and file relationships are all first class objects. In contrast to hierarchical file systems and relational databases, QFS defines a graph data model composed of files and their relationships. QFS incorporates Quasar, an XPATH-extended query language for searching the file system. Results from our QFS prototype show the effectiveness of this approach. Compared to the de facto standard, the QFS prototype shows superior ingest performance and comparable query performance on user metadata-intensive operations and superior performance on normal file metadata operations.« less
Ordinal feature selection for iris and palmprint recognition.

PubMed

Sun, Zhenan; Wang, Libin; Tan, Tieniu

2014-09-01

Ordinal measures have been demonstrated as an effective feature representation model for iris and palmprint recognition. However, ordinal measures are a general concept of image analysis and numerous variants with different parameter settings, such as location, scale, orientation, and so on, can be derived to construct a huge feature space. This paper proposes a novel optimization formulation for ordinal feature selection with successful applications to both iris and palmprint recognition. The objective function of the proposed feature selection method has two parts, i.e., misclassification error of intra and interclass matching samples and weighted sparsity of ordinal feature descriptors. Therefore, the feature selection aims to achieve an accurate and sparse representation of ordinal measures. And, the optimization subjects to a number of linear inequality constraints, which require that all intra and interclass matching pairs are well separated with a large margin. Ordinal feature selection is formulated as a linear programming (LP) problem so that a solution can be efficiently obtained even on a large-scale feature pool and training database. Extensive experimental results demonstrate that the proposed LP formulation is advantageous over existing feature selection methods, such as mRMR, ReliefF, Boosting, and Lasso for biometric recognition, reporting state-of-the-art accuracy on CASIA and PolyU databases.
Proteogenomic database construction driven from large scale RNA-seq data.

PubMed

Woo, Sunghee; Cha, Seong Won; Merrihew, Gennifer; He, Yupeng; Castellana, Natalie; Guest, Clark; MacCoss, Michael; Bafna, Vineet

2014-01-03

The advent of inexpensive RNA-seq technologies and other deep sequencing technologies for RNA has the promise to radically improve genomic annotation, providing information on transcribed regions and splicing events in a variety of cellular conditions. Using MS-based proteogenomics, many of these events can be confirmed directly at the protein level. However, the integration of large amounts of redundant RNA-seq data and mass spectrometry data poses a challenging problem. Our paper addresses this by construction of a compact database that contains all useful information expressed in RNA-seq reads. Applying our method to cumulative C. elegans data reduced 496.2 GB of aligned RNA-seq SAM files to 410 MB of splice graph database written in FASTA format. This corresponds to 1000× compression of data size, without loss of sensitivity. We performed a proteogenomics study using the custom data set, using a completely automated pipeline, and identified a total of 4044 novel events, including 215 novel genes, 808 novel exons, 12 alternative splicings, 618 gene-boundary corrections, 245 exon-boundary changes, 938 frame shifts, 1166 reverse strands, and 42 translated UTRs. Our results highlight the usefulness of transcript + proteomic integration for improved genome annotations.
Geologic map of Chickasaw National Recreation Area, Murray County, Oklahoma

USGS Publications Warehouse

Blome, Charles D.; Lidke, David J.; Wahl, Ronald R.; Golab, James A.

2013-01-01

This 1:24,000-scale geologic map is a compilation of previous geologic maps and new geologic mapping of areas in and around Chickasaw National Recreation Area. The geologic map includes revisions of numerous unit contacts and faults and a number of previously “undifferentiated” rock units were subdivided in some areas. Numerous circular-shaped hills in and around Chickasaw National Recreation Area are probably the result of karst-related collapse and may represent the erosional remnants of large, exhumed sinkholes. Geospatial registration of existing, smaller scale (1:72,000- and 1:100,000-scale) geologic maps of the area and construction of an accurate Geographic Information System (GIS) database preceded 2 years of fieldwork wherein previously mapped geology (unit contacts and faults) was verified and new geologic mapping was carried out. The geologic map of Chickasaw National Recreation Area and this pamphlet include information pertaining to how the geologic units and structural features in the map area relate to the formation of the northern Arbuckle Mountains and its Arbuckle-Simpson aquifer. The development of an accurate geospatial GIS database and the use of a handheld computer in the field greatly increased both the accuracy and efficiency in producing the 1:24,000-scale geologic map.
Automation of a N-S S and C Database Generation for the Harrier in Ground Effect

NASA Technical Reports Server (NTRS)

Murman, Scott M.; Chaderjian, Neal M.; Pandya, Shishir; Kwak, Dochan (Technical Monitor)

2001-01-01

A method of automating the generation of a time-dependent, Navier-Stokes static stability and control database for the Harrier aircraft in ground effect is outlined. Reusable, lightweight components arc described which allow different facets of the computational fluid dynamic simulation process to utilize a consistent interface to a remote database. These components also allow changes and customizations to easily be facilitated into the solution process to enhance performance, without relying upon third-party support. An analysis of the multi-level parallel solver OVERFLOW-MLP is presented, and the results indicate that it is feasible to utilize large numbers of processors (= 100) even with a grid system with relatively small number of cells (= 10(exp 6)). A more detailed discussion of the simulation process, as well as refined data for the scaling of the OVERFLOW-MLP flow solver will be included in the full paper.
The Developmental Lexicon Project: A behavioral database to investigate visual word recognition across the lifespan.

PubMed

Schröter, Pauline; Schroeder, Sascha

2017-12-01

With the Developmental Lexicon Project (DeveL), we present a large-scale study that was conducted to collect data on visual word recognition in German across the lifespan. A total of 800 children from Grades 1 to 6, as well as two groups of younger and older adults, participated in the study and completed a lexical decision and a naming task. We provide a database for 1,152 German words, comprising behavioral data from seven different stages of reading development, along with sublexical and lexical characteristics for all stimuli. The present article describes our motivation for this project, explains the methods we used to collect the data, and reports analyses on the reliability of our results. In addition, we explored developmental changes in three marker effects in psycholinguistic research: word length, word frequency, and orthographic similarity. The database is available online.
A priori testing of subgrid-scale models for the velocity-pressure and vorticity-velocity formulations

NASA Technical Reports Server (NTRS)

Winckelmans, G. S.; Lund, T. S.; Carati, D.; Wray, A. A.

1996-01-01

Subgrid-scale models for Large Eddy Simulation (LES) in both the velocity-pressure and the vorticity-velocity formulations were evaluated and compared in a priori tests using spectral Direct Numerical Simulation (DNS) databases of isotropic turbulence: 128(exp 3) DNS of forced turbulence (Re(sub(lambda))=95.8) filtered, using the sharp cutoff filter, to both 32(exp 3) and 16(exp 3) synthetic LES fields; 512(exp 3) DNS of decaying turbulence (Re(sub(Lambda))=63.5) filtered to both 64(exp 3) and 32(exp 3) LES fields. Gaussian and top-hat filters were also used with the 128(exp 3) database. Different LES models were evaluated for each formulation: eddy-viscosity models, hyper eddy-viscosity models, mixed models, and scale-similarity models. Correlations between exact versus modeled subgrid-scale quantities were measured at three levels: tensor (traceless), vector (solenoidal 'force'), and scalar (dissipation) levels, and for both cases of uniform and variable coefficient(s). Different choices for the 1/T scaling appearing in the eddy-viscosity were also evaluated. It was found that the models for the vorticity-velocity formulation produce higher correlations with the filtered DNS data than their counterpart in the velocity-pressure formulation. It was also found that the hyper eddy-viscosity model performs better than the eddy viscosity model, in both formulations.
Fast large-scale object retrieval with binary quantization

NASA Astrophysics Data System (ADS)

Zhou, Shifu; Zeng, Dan; Shen, Wei; Zhang, Zhijiang; Tian, Qi

2015-11-01

The objective of large-scale object retrieval systems is to search for images that contain the target object in an image database. Where state-of-the-art approaches rely on global image representations to conduct searches, we consider many boxes per image as candidates to search locally in a picture. In this paper, a feature quantization algorithm called binary quantization is proposed. In binary quantization, a scale-invariant feature transform (SIFT) feature is quantized into a descriptive and discriminative bit-vector, which allows itself to adapt to the classic inverted file structure for box indexing. The inverted file, which stores the bit-vector and box ID where the SIFT feature is located inside, is compact and can be loaded into the main memory for efficient box indexing. We evaluate our approach on available object retrieval datasets. Experimental results demonstrate that the proposed approach is fast and achieves excellent search quality. Therefore, the proposed approach is an improvement over state-of-the-art approaches for object retrieval.
Development of the Transport Class Model (TCM) Aircraft Simulation From a Sub-Scale Generic Transport Model (GTM) Simulation

NASA Technical Reports Server (NTRS)

Hueschen, Richard M.

2011-01-01

A six degree-of-freedom, flat-earth dynamics, non-linear, and non-proprietary aircraft simulation was developed that is representative of a generic mid-sized twin-jet transport aircraft. The simulation was developed from a non-proprietary, publicly available, subscale twin-jet transport aircraft simulation using scaling relationships and a modified aerodynamic database. The simulation has an extended aerodynamics database with aero data outside the normal transport-operating envelope (large angle-of-attack and sideslip values). The simulation has representative transport aircraft surface actuator models with variable rate-limits and generally fixed position limits. The simulation contains a generic 40,000 lb sea level thrust engine model. The engine model is a first order dynamic model with a variable time constant that changes according to simulation conditions. The simulation provides a means for interfacing a flight control system to use the simulation sensor variables and to command the surface actuators and throttle position of the engine model.
Intermittency measurement in two-dimensional bacterial turbulence

NASA Astrophysics Data System (ADS)

Qiu, Xiang; Ding, Long; Huang, Yongxiang; Chen, Ming; Lu, Zhiming; Liu, Yulu; Zhou, Quan

2016-06-01

In this paper, an experimental velocity database of a bacterial collective motion, e.g., Bacillus subtilis, in turbulent phase with volume filling fraction 84 % provided by Professor Goldstein at Cambridge University (UK), was analyzed to emphasize the scaling behavior of this active turbulence system. This was accomplished by performing a Hilbert-based methodology analysis to retrieve the scaling property without the β -limitation. A dual-power-law behavior separated by the viscosity scale ℓν was observed for the q th -order Hilbert moment Lq(k ) . This dual-power-law belongs to an inverse-cascade since the scaling range is above the injection scale R , e.g., the bacterial body length. The measured scaling exponents ζ (q ) of both the small-scale (k >kν ) and large-scale (k
CanvasDB: a local database infrastructure for analysis of targeted- and whole genome re-sequencing projects

PubMed Central

Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf

2014-01-01

CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB PMID:25281234
CanvasDB: a local database infrastructure for analysis of targeted- and whole genome re-sequencing projects.

PubMed

Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf

2014-01-01

CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB. © The Author(s) 2014. Published by Oxford University Press.
Improvements of the Penalty Avoiding Rational Policy Making Algorithm and an Application to the Othello Game

NASA Astrophysics Data System (ADS)

Miyazaki, Kazuteru; Tsuboi, Sougo; Kobayashi, Shigenobu

The purpose of reinforcement learning is to learn an optimal policy in general. However, in 2-players games such as the othello game, it is important to acquire a penalty avoiding policy. In this paper, we focus on formation of a penalty avoiding policy based on the Penalty Avoiding Rational Policy Making algorithm [Miyazaki 01]. In applying it to large-scale problems, we are confronted with the curse of dimensionality. We introduce several ideas and heuristics to overcome the combinational explosion in large-scale problems. First, we propose an algorithm to save the memory by calculation of state transition. Second, we describe how to restrict exploration by two type knowledge; KIFU database and evaluation funcion. We show that our learning player can always defeat against the well-known othello game program KITTY.
Methods comparison for microsatellite marker development: Different isolation methods, different yield efficiency

NASA Astrophysics Data System (ADS)

Zhan, Aibin; Bao, Zhenmin; Hu, Xiaoli; Lu, Wei; Hu, Jingjie

2009-06-01

Microsatellite markers have become one kind of the most important molecular tools used in various researches. A large number of microsatellite markers are required for the whole genome survey in the fields of molecular ecology, quantitative genetics and genomics. Therefore, it is extremely necessary to select several versatile, low-cost, efficient and time- and labor-saving methods to develop a large panel of microsatellite markers. In this study, we used Zhikong scallop ( Chlamys farreri) as the target species to compare the efficiency of the five methods derived from three strategies for microsatellite marker development. The results showed that the strategy of constructing small insert genomic DNA library resulted in poor efficiency, while the microsatellite-enriched strategy highly improved the isolation efficiency. Although the mining public database strategy is time- and cost-saving, it is difficult to obtain a large number of microsatellite markers, mainly due to the limited sequence data of non-model species deposited in public databases. Based on the results in this study, we recommend two methods, microsatellite-enriched library construction method and FIASCO-colony hybridization method, for large-scale microsatellite marker development. Both methods were derived from the microsatellite-enriched strategy. The experimental results obtained from Zhikong scallop also provide the reference for microsatellite marker development in other species with large genomes.
Adverse Events Associated with Prolonged Antibiotic Use

PubMed Central

Meropol, Sharon B.; Chan, K. Arnold; Chen, Zhen; Finkelstein, Jonathan A.; Hennessy, Sean; Lautenbach, Ebbing; Platt, Richard; Schech, Stephanie D.; Shatin, Deborah; Metlay, Joshua P.

2014-01-01

Purpose The Infectious Diseases Society of America and US CDC recommend 60 days of ciprofloxacin, doxycycline or amoxicillin for anthrax prophylaxis. It is not possible to determine severe adverse drug event (ADE) risks from the few people thus far exposed to anthrax prophylaxis. This study’s objective was to estimate risks of severe ADEs associated with long-term ciprofloxacin, doxycycline and amoxicillin exposure using 3 large databases: one electronic medical record (General Practice Research Database) and two claims databases (UnitedHealthcare, HMO Research Network). Methods We include office visit, hospital admission and prescription data for 1/1/1999–6/30/2001. Exposure variable was oral antibiotic person-days (pds). Primary outcome was hospitalization during exposure with ADE diagnoses: anaphylaxis, phototoxicity, hepatotoxicity, nephrotoxicity, seizures, ventricular arrhythmia or infectious colitis. Results We randomly sampled 999,773, 1,047,496 and 1,819,004 patients from Databases A, B and C respectively. 33,183 amoxicillin, 15,250 ciprofloxacin and 50,171 doxycycline prescriptions continued ≥30 days. ADE hospitalizations during long-term exposure were not observed in Database A. ADEs during long-term amoxicillin were seen only in Database C with 5 ADEs or 1.2(0.4–2.7) ADEs/100,000 pds exposure. Long-term ciprofloxacin showed 3 and 4 ADEs with 5.7(1.2–16.6) and 3.5(1.0–9.0) ADEs/100,000 pds in Databases B and C, respectively. Only Database B had ADEs during long-term doxycycline with 3 ADEs or 0.9(0.2–2.6) ADEs/100,000 pds. For most events, the incidence rate ratio, comparing >28 vs.1–28 pds exposure was <1, showing limited evidence for cumulative dose-related ADEs from long-term exposure. Conclusions Long-term amoxicillin, ciprofloxacin and doxycycline appears safe, supporting use of these medications if needed for large-scale post-exposure anthrax prophylaxis. PMID:18215001
Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in Omics studies and "Big data" biology.

PubMed

Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy

2013-08-01

Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.
Scaling laws and fluctuations in the statistics of word frequencies

NASA Astrophysics Data System (ADS)

Gerlach, Martin; Altmann, Eduardo G.

2014-11-01

In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps’ law). Analyzing the fluctuations around this average in three large databases (Google-ngram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylor's law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps’ and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipf's law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.
bioNerDS: exploring bioinformatics’ database and software use through literature mining

PubMed Central

2013-01-01

Background Biology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology. Results We have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing. Abstract Conclusions We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/. PMID:23768135
Visualising biological data: a semantic approach to tool and database integration

PubMed Central

Pettifer, Steve; Thorne, David; McDermott, Philip; Marsh, James; Villéger, Alice; Kell, Douglas B; Attwood, Teresa K

2009-01-01

Motivation In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customised for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. Methods To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. Results The toolkit, named Utopia, is freely available from . PMID:19534744

High throughput profile-profile based fold recognition for the entire human proteome.

PubMed

McGuffin, Liam J; Smith, Richard T; Bryson, Kevin; Sørensen, Søren-Aksel; Jones, David T

2006-06-07

In order to maintain the most comprehensive structural annotation databases we must carry out regular updates for each proteome using the latest profile-profile fold recognition methods. The ability to carry out these updates on demand is necessary to keep pace with the regular updates of sequence and structure databases. Providing the highest quality structural models requires the most intensive profile-profile fold recognition methods running with the very latest available sequence databases and fold libraries. However, running these methods on such a regular basis for every sequenced proteome requires large amounts of processing power. In this paper we describe and benchmark the JYDE (Job Yield Distribution Environment) system, which is a meta-scheduler designed to work above cluster schedulers, such as Sun Grid Engine (SGE) or Condor. We demonstrate the ability of JYDE to distribute the load of genomic-scale fold recognition across multiple independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software in order to annotate the latest version of the Human proteome against the latest sequence and structure databases in as short a time as possible. We show that our JYDE system is able to scale to large numbers of intensive fold recognition jobs running across several independent computer clusters. Using our JYDE system we have been able to annotate 99.9% of the protein sequences within the Human proteome in less than 24 hours, by harnessing over 500 CPUs from 3 independent Grid domains. This study clearly demonstrates the feasibility of carrying out on demand high quality structural annotations for the proteomes of major eukaryotic organisms. Specifically, we have shown that it is now possible to provide complete regular updates of profile-profile based fold recognition models for entire eukaryotic proteomes, through the use of Grid middleware such as JYDE.
Frequency and pattern of Chinese herbal medicine prescriptions for urticaria in Taiwan during 2009: analysis of the national health insurance database

PubMed Central

2013-01-01

Background Large-scale pharmaco-epidemiological studies of Chinese herbal medicine (CHM) for treatment of urticaria are few, even though clinical trials showed some CHM are effective. The purpose of this study was to explore the frequencies and patterns of CHM prescriptions for urticaria by analysing the population-based CHM database in Taiwan. Methods This study was linked to and processed through the complete traditional CHM database of the National Health Insurance Research Database in Taiwan during 2009. We calculated the frequencies and patterns of CHM prescriptions used for treatment of urticaria, of which the diagnosis was defined as the single ICD-9 Code of 708. Frequent itemset mining, as applied to data mining, was used to analyse co-prescription of CHM for patients with urticaria. Results There were 37,386 subjects who visited traditional Chinese Medicine clinics for urticaria in Taiwan during 2009 and received a total of 95,765 CHM prescriptions. Subjects between 18 and 35 years of age comprised the largest number of those treated (32.76%). In addition, women used CHM for urticaria more frequently than men (female:male = 1.94:1). There was an average of 5.54 items prescribed in the form of either individual Chinese herbs or a formula in a single CHM prescription for urticaria. Bai-Xian-Pi (Dictamnus dasycarpus Turcz) was the most commonly prescribed single Chinese herb while Xiao-Feng San was the most commonly prescribed Chinese herbal formula. The most commonly prescribed CHM drug combination was Xiao-Feng San plus Bai-Xian-Pi while the most commonly prescribed triple drug combination was Xiao-Feng San, Bai-Xian-Pi, and Di-Fu Zi (Kochia scoparia). Conclusions In view of the popularity of CHM such as Xiao-Feng San prescribed for the wind-heat pattern of urticaria in this study, a large-scale, randomized clinical trial is warranted to research their efficacy and safety. PMID:23947955
Frequency and pattern of Chinese herbal medicine prescriptions for urticaria in Taiwan during 2009: analysis of the national health insurance database.

PubMed

Chien, Pei-Shan; Tseng, Yu-Fang; Hsu, Yao-Chin; Lai, Yu-Kai; Weng, Shih-Feng

2013-08-15

Large-scale pharmaco-epidemiological studies of Chinese herbal medicine (CHM) for treatment of urticaria are few, even though clinical trials showed some CHM are effective. The purpose of this study was to explore the frequencies and patterns of CHM prescriptions for urticaria by analysing the population-based CHM database in Taiwan. This study was linked to and processed through the complete traditional CHM database of the National Health Insurance Research Database in Taiwan during 2009. We calculated the frequencies and patterns of CHM prescriptions used for treatment of urticaria, of which the diagnosis was defined as the single ICD-9 Code of 708. Frequent itemset mining, as applied to data mining, was used to analyse co-prescription of CHM for patients with urticaria. There were 37,386 subjects who visited traditional Chinese Medicine clinics for urticaria in Taiwan during 2009 and received a total of 95,765 CHM prescriptions. Subjects between 18 and 35 years of age comprised the largest number of those treated (32.76%). In addition, women used CHM for urticaria more frequently than men (female:male = 1.94:1). There was an average of 5.54 items prescribed in the form of either individual Chinese herbs or a formula in a single CHM prescription for urticaria. Bai-Xian-Pi (Dictamnus dasycarpus Turcz) was the most commonly prescribed single Chinese herb while Xiao-Feng San was the most commonly prescribed Chinese herbal formula. The most commonly prescribed CHM drug combination was Xiao-Feng San plus Bai-Xian-Pi while the most commonly prescribed triple drug combination was Xiao-Feng San, Bai-Xian-Pi, and Di-Fu Zi (Kochia scoparia). In view of the popularity of CHM such as Xiao-Feng San prescribed for the wind-heat pattern of urticaria in this study, a large-scale, randomized clinical trial is warranted to research their efficacy and safety.
Visualising biological data: a semantic approach to tool and database integration.

PubMed

Pettifer, Steve; Thorne, David; McDermott, Philip; Marsh, James; Villéger, Alice; Kell, Douglas B; Attwood, Teresa K

2009-06-16

In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customized for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. The toolkit, named Utopia, is freely available from http://utopia.cs.man.ac.uk/.
Database for the geologic map of the Sauk River 30-minute by 60-minute quadrangle, Washington (I-2592)

USGS Publications Warehouse

Tabor, R.W.; Booth, D.B.; Vance, J.A.; Ford, A.B.

2006-01-01

This digital map database has been prepared by R.W. Tabor from the published Geologic map of the Sauk River 30- by 60 Minute Quadrangle, Washington. Together with the accompanying text files as PDF, it provides information on the geologic structure and stratigraphy of the area covered. The database delineates map units that are identified by general age and lithology following the stratigraphic nomenclature of the U.S. Geological Survey. The authors mapped most of the bedrock geology at 1:100,000 scale, but compiled most Quaternary units at 1:24,000 scale. The Quaternary contacts and structural data have been much simplified for the 1:100,000-scale map and database. The spatial resolution (scale) of the database is 1:100,000 or smaller. This database depicts the distribution of geologic materials and structures at a regional (1:100,000) scale. The report is intended to provide geologic information for the regional study of materials properties, earthquake shaking, landslide potential, mineral hazards, seismic velocity, and earthquake faults. In addition, the report contains information and interpretations about the regional geologic history and framework. However, the regional scale of this report does not provide sufficient detail for site development purposes.
GIS-project: geodynamic globe for global monitoring of geological processes

NASA Astrophysics Data System (ADS)

Ryakhovsky, V.; Rundquist, D.; Gatinsky, Yu.; Chesalova, E.

2003-04-01

A multilayer geodynamic globe at the scale 1:10,000,000 was created at the end of the nineties in the GIS Center of the Vernadsky Museum. A special soft-and-hardware complex was elaborated for its visualization with a set of multitarget object directed databases. The globe includes separate thematic covers represented by digital sets of spatial geological, geochemical, and geophysical information (maps, schemes, profiles, stratigraphic columns, arranged databases etc.). At present the largest databases included in the globe program are connected with petrochemical and isotopic data on magmatic rocks of the World Ocean and with the large and supperlarge mineral deposits. Software by the Environmental Scientific Research Institute (ESRI), USA as well as ArcScan vectrorizator were used for covers digitizing and database adaptation (ARC/INFO 7.0, 8.0). All layers of the geoinformational project were obtained by scanning of separate objects and their transfer to the real geographic co-ordinates of an equiintermediate conic projection. Then the covers were projected on plane degree-system geographic co-ordinates. Some attributive databases were formed for each thematic layer, and in the last stage all covers were combined into the single information system. Separate digital covers represent mathematical descriptions of geological objects and relations between them, such as Earth's altimetry, active fault systems, seismicity etc. Some grounds of the cartographic generalization were taken into consideration in time of covers compilation with projection and co-ordinate systems precisely answered a given scale. The globe allows us to carry out in the interactive regime the formation of coordinated with each other object-oriented databases and thematic covers directly connected with them. They can be spread for all the Earth and the near-Earth space, and for the most well known parts of divergent and convergent boundaries of the lithosphere plates. Such covers and time series reflect in diagram form a total combination and dynamics of data on the geological structure, geophysical fields, seismicity, geomagnetism, composition of rock complexes, and metalloge-ny of different areas on the Earth's surface. They give us possibility to scale, detail, and develop 3D spatial visualization. Information filling the covers could be replenished as in the existing so in newly formed databases with new data. The integrated analyses of the data allows us more precisely to define our ideas on regularities in development of lithosphere and mantle unhomogeneities using some original technologies. It also enables us to work out 3D digital models for geodynamic development of tectonic zones in convergent and divergent plate boundaries with the purpose of integrated monitoring of mineral resources and establishing correlation between seismicity, magmatic activity, and metallogeny in time-spatial co-ordinates. The created multifold geoinformation system gives a chance to execute an integral analyses of geoinformation flows in the interactive regime and, in particular, to establish some regularities in the time-spatial distribution and dynamics of main structural units in the lithosphere, as well as illuminate the connection between stages of their development and epochs of large and supperlarge mineral deposit formation. Now we try to use the system for prediction of large oil and gas concentration in the main sedimentary basins. The work was supported by RFBR, (grants 93-07-14680, 96-07-89499, 99-07-90030, 00-15-98535, 02-07-90140) and MTC.
Long-term citizen-collected data reveal geographical patterns and temporal trends in lake water clarity

USGS Publications Warehouse

Lottig, Noah R.; Wagner, Tyler; Henry, Emily N.; Cheruvelil, Kendra Spence; Webster, Katherine E.; Downing, John A.; Stow, Craig A.

2014-01-01

We compiled a lake-water clarity database using publically available, citizen volunteer observations made between 1938 and 2012 across eight states in the Upper Midwest, USA. Our objectives were to determine (1) whether temporal trends in lake-water clarity existed across this large geographic area and (2) whether trends were related to the lake-specific characteristics of latitude, lake size, or time period the lake was monitored. Our database consisted of >140,000 individual Secchi observations from 3,251 lakes that we summarized per lake-year, resulting in 21,020 summer averages. Using Bayesian hierarchical modeling, we found approximately a 1% per year increase in water clarity (quantified as Secchi depth) for the entire population of lakes. On an individual lake basis, 7% of lakes showed increased water clarity and 4% showed decreased clarity. Trend direction and strength were related to latitude and median sample date. Lakes in the southern part of our study-region had lower average annual summer water clarity, more negative long-term trends, and greater inter-annual variability in water clarity compared to northern lakes. Increasing trends were strongest for lakes with median sample dates earlier in the period of record (1938–2012). Our ability to identify specific mechanisms for these trends is currently hampered by the lack of a large, multi-thematic database of variables that drive water clarity (e.g., climate, land use/cover). Our results demonstrate, however, that citizen science can provide the critical monitoring data needed to address environmental questions at large spatial and long temporal scales. Collaborations among citizens, research scientists, and government agencies may be important for developing the data sources and analytical tools necessary to move toward an understanding of the factors influencing macro-scale patterns such as those shown here for lake water clarity.
Application of advanced data collection and quality assurance methods in open prospective study - a case study of PONS project.

PubMed

Wawrzyniak, Zbigniew M; Paczesny, Daniel; Mańczuk, Marta; Zatoński, Witold A

2011-01-01

Large-scale epidemiologic studies can assess health indicators differentiating social groups and important health outcomes of the incidence and mortality of cancer, cardiovascular disease, and others, to establish a solid knowledgebase for the prevention management of premature morbidity and mortality causes. This study presents new advanced methods of data collection and data management systems with current data quality control and security to ensure high quality data assessment of health indicators in the large epidemiologic PONS study (The Polish-Norwegian Study). The material for experiment is the data management design of the large-scale population study in Poland (PONS) and the managed processes are applied into establishing a high quality and solid knowledge. The functional requirements of the PONS study data collection, supported by the advanced IT web-based methods, resulted in medical data of a high quality, data security, with quality data assessment, control process and evolution monitoring are fulfilled and shared by the IT system. Data from disparate and deployed sources of information are integrated into databases via software interfaces, and archived by a multi task secure server. The practical and implemented solution of modern advanced database technologies and remote software/hardware structure successfully supports the research of the big PONS study project. Development and implementation of follow-up control of the consistency and quality of data analysis and the processes of the PONS sub-databases have excellent measurement properties of data consistency of more than 99%. The project itself, by tailored hardware/software application, shows the positive impact of Quality Assurance (QA) on the quality of outcomes analysis results, effective data management within a shorter time. This efficiency ensures the quality of the epidemiological data and indicators of health by the elimination of common errors of research questionnaires and medical measurements.
KA-SB: from data integration to large scale reasoning

PubMed Central

Roldán-García, María del Mar; Navas-Delgado, Ismael; Kerzazi, Amine; Chniber, Othmane; Molina-Castro, Joaquín; Aldana-Montes, José F

2009-01-01

Background The analysis of information in the biological domain is usually focused on the analysis of data from single on-line data sources. Unfortunately, studying a biological process requires having access to disperse, heterogeneous, autonomous data sources. In this context, an analysis of the information is not possible without the integration of such data. Methods KA-SB is a querying and analysis system for final users based on combining a data integration solution with a reasoner. Thus, the tool has been created with a process divided into two steps: 1) KOMF, the Khaos Ontology-based Mediator Framework, is used to retrieve information from heterogeneous and distributed databases; 2) the integrated information is crystallized in a (persistent and high performance) reasoner (DBOWL). This information could be further analyzed later (by means of querying and reasoning). Results In this paper we present a novel system that combines the use of a mediation system with the reasoning capabilities of a large scale reasoner to provide a way of finding new knowledge and of analyzing the integrated information from different databases, which is retrieved as a set of ontology instances. This tool uses a graphical query interface to build user queries easily, which shows a graphical representation of the ontology and allows users o build queries by clicking on the ontology concepts. Conclusion These kinds of systems (based on KOMF) will provide users with very large amounts of information (interpreted as ontology instances once retrieved), which cannot be managed using traditional main memory-based reasoners. We propose a process for creating persistent and scalable knowledgebases from sets of OWL instances obtained by integrating heterogeneous data sources with KOMF. This process has been applied to develop a demo tool , which uses the BioPax Level 3 ontology as the integration schema, and integrates UNIPROT, KEGG, CHEBI, BRENDA and SABIORK databases. PMID:19796402
High dimensional biological data retrieval optimization with NoSQL technology.

PubMed

Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike

2014-01-01

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
High dimensional biological data retrieval optimization with NoSQL technology

PubMed Central

2014-01-01

Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data. PMID:25435347
Long-Term Citizen-Collected Data Reveal Geographical Patterns and Temporal Trends in Lake Water Clarity

PubMed Central

Lottig, Noah R.; Wagner, Tyler; Norton Henry, Emily; Spence Cheruvelil, Kendra; Webster, Katherine E.; Downing, John A.; Stow, Craig A.

2014-01-01

We compiled a lake-water clarity database using publically available, citizen volunteer observations made between 1938 and 2012 across eight states in the Upper Midwest, USA. Our objectives were to determine (1) whether temporal trends in lake-water clarity existed across this large geographic area and (2) whether trends were related to the lake-specific characteristics of latitude, lake size, or time period the lake was monitored. Our database consisted of >140,000 individual Secchi observations from 3,251 lakes that we summarized per lake-year, resulting in 21,020 summer averages. Using Bayesian hierarchical modeling, we found approximately a 1% per year increase in water clarity (quantified as Secchi depth) for the entire population of lakes. On an individual lake basis, 7% of lakes showed increased water clarity and 4% showed decreased clarity. Trend direction and strength were related to latitude and median sample date. Lakes in the southern part of our study-region had lower average annual summer water clarity, more negative long-term trends, and greater inter-annual variability in water clarity compared to northern lakes. Increasing trends were strongest for lakes with median sample dates earlier in the period of record (1938–2012). Our ability to identify specific mechanisms for these trends is currently hampered by the lack of a large, multi-thematic database of variables that drive water clarity (e.g., climate, land use/cover). Our results demonstrate, however, that citizen science can provide the critical monitoring data needed to address environmental questions at large spatial and long temporal scales. Collaborations among citizens, research scientists, and government agencies may be important for developing the data sources and analytical tools necessary to move toward an understanding of the factors influencing macro-scale patterns such as those shown here for lake water clarity. PMID:24788722
The Porcelain Crab Transcriptome and PCAD, the Porcelain Crab Microarray and Sequence Database

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tagmount, Abderrahmane; Wang, Mei; Lindquist, Erika

2010-01-27

Background: With the emergence of a completed genome sequence of the freshwater crustacean Daphnia pulex, construction of genomic-scale sequence databases for additional crustacean sequences are important for comparative genomics and annotation. Porcelain crabs, genus Petrolisthes, have been powerful crustacean models for environmental and evolutionary physiology with respect to thermal adaptation and understanding responses of marine organisms to climate change. Here, we present a large-scale EST sequencing and cDNA microarray database project for the porcelain crab Petrolisthes cinctipes. Methodology/Principal Findings: A set of ~;;30K unique sequences (UniSeqs) representing ~;;19K clusters were generated from ~;;98K high quality ESTs from a set ofmore » tissue specific non-normalized and mixed-tissue normalized cDNA libraries from the porcelain crab Petrolisthes cinctipes. Homology for each UniSeq was assessed using BLAST, InterProScan, GO and KEGG database searches. Approximately 66percent of the UniSeqs had homology in at least one of the databases. All EST and UniSeq sequences along with annotation results and coordinated cDNA microarray datasets have been made publicly accessible at the Porcelain Crab Array Database (PCAD), a feature-enriched version of the Stanford and Longhorn Array Databases.Conclusions/Significance: The EST project presented here represents the third largest sequencing effort for any crustacean, and the largest effort for any crab species. Our assembly and clustering results suggest that our porcelain crab EST data set is equally diverse to the much larger EST set generated in the Daphnia pulex genome sequencing project, and thus will be an important resource to the Daphnia research community. Our homology results support the pancrustacea hypothesis and suggest that Malacostraca may be ancestral to Branchiopoda and Hexapoda. Our results also suggest that our cDNA microarrays cover as much of the transcriptome as can reasonably be captured in EST library sequencing approaches, and thus represent a rich resource for studies of environmental genomics.« less
A Web-based Distributed Voluntary Computing Platform for Large Scale Hydrological Computations

NASA Astrophysics Data System (ADS)

Demir, I.; Agliamzanov, R.

2014-12-01

Distributed volunteer computing can enable researchers and scientist to form large parallel computing environments to utilize the computing power of the millions of computers on the Internet, and use them towards running large scale environmental simulations and models to serve the common good of local communities and the world. Recent developments in web technologies and standards allow client-side scripting languages to run at speeds close to native application, and utilize the power of Graphics Processing Units (GPU). Using a client-side scripting language like JavaScript, we have developed an open distributed computing framework that makes it easy for researchers to write their own hydrologic models, and run them on volunteer computers. Users will easily enable their websites for visitors to volunteer sharing their computer resources to contribute running advanced hydrological models and simulations. Using a web-based system allows users to start volunteering their computational resources within seconds without installing any software. The framework distributes the model simulation to thousands of nodes in small spatial and computational sizes. A relational database system is utilized for managing data connections and queue management for the distributed computing nodes. In this paper, we present a web-based distributed volunteer computing platform to enable large scale hydrological simulations and model runs in an open and integrated environment.
Expanding the chemical information science gateway.

PubMed

Bajorath, Jürgen

2017-01-01

Broadly defined, chemical information science (CIS) covers chemical structure and data analysis including biological activity data as well as processing, organization, and retrieval of any form of chemical information. The CIS Gateway (CISG) of F1000Research was created to communicate research involving the entire spectrum of chemical information, including chem(o)informatics. CISG provides a forum for high-quality publications and a meaningful alternative to conventional journals. This gateway is supported by leading experts in the field recognizing the need for open science and a flexible publication platform enabling off-the-beaten path contributions. This editorial aims to further rationalize the scope of CISG, position it within its scientific environment, and open it up to a wider audience. Chemical information science is an interdisciplinary field with high potential to interface with experimental work.
Expanding the chemical information science gateway

PubMed Central

Bajorath, Jürgen

2017-01-01

Broadly defined, chemical information science (CIS) covers chemical structure and data analysis including biological activity data as well as processing, organization, and retrieval of any form of chemical information. The CIS Gateway (CISG) of F1000Research was created to communicate research involving the entire spectrum of chemical information, including chem(o)informatics. CISG provides a forum for high-quality publications and a meaningful alternative to conventional journals. This gateway is supported by leading experts in the field recognizing the need for open science and a flexible publication platform enabling off-the-beaten path contributions. This editorial aims to further rationalize the scope of CISG, position it within its scientific environment, and open it up to a wider audience. Chemical information science is an interdisciplinary field with high potential to interface with experimental work. PMID:29043072
Private and Efficient Query Processing on Outsourced Genomic Databases.

PubMed

Ghasemi, Reza; Al Aziz, Md Momin; Mohammed, Noman; Dehkordi, Massoud Hadian; Jiang, Xiaoqian

2017-09-01

Applications of genomic studies are spreading rapidly in many domains of science and technology such as healthcare, biomedical research, direct-to-consumer services, and legal and forensic. However, there are a number of obstacles that make it hard to access and process a big genomic database for these applications. First, sequencing genomic sequence is a time consuming and expensive process. Second, it requires large-scale computation and storage systems to process genomic sequences. Third, genomic databases are often owned by different organizations, and thus, not available for public usage. Cloud computing paradigm can be leveraged to facilitate the creation and sharing of big genomic databases for these applications. Genomic data owners can outsource their databases in a centralized cloud server to ease the access of their databases. However, data owners are reluctant to adopt this model, as it requires outsourcing the data to an untrusted cloud service provider that may cause data breaches. In this paper, we propose a privacy-preserving model for outsourcing genomic data to a cloud. The proposed model enables query processing while providing privacy protection of genomic databases. Privacy of the individuals is guaranteed by permuting and adding fake genomic records in the database. These techniques allow cloud to evaluate count and top-k queries securely and efficiently. Experimental results demonstrate that a count and a top-k query over 40 Single Nucleotide Polymorphisms (SNPs) in a database of 20 000 records takes around 100 and 150 s, respectively.
Private and Efficient Query Processing on Outsourced Genomic Databases

PubMed Central

Ghasemi, Reza; Al Aziz, Momin; Mohammed, Noman; Dehkordi, Massoud Hadian; Jiang, Xiaoqian

2017-01-01

Applications of genomic studies are spreading rapidly in many domains of science and technology such as healthcare, biomedical research, direct-to-consumer services, and legal and forensic. However, there are a number of obstacles that make it hard to access and process a big genomic database for these applications. First, sequencing genomic sequence is a time-consuming and expensive process. Second, it requires large-scale computation and storage systems to processes genomic sequences. Third, genomic databases are often owned by different organizations and thus not available for public usage. Cloud computing paradigm can be leveraged to facilitate the creation and sharing of big genomic databases for these applications. Genomic data owners can outsource their databases in a centralized cloud server to ease the access of their databases. However, data owners are reluctant to adopt this model, as it requires outsourcing the data to an untrusted cloud service provider that may cause data breaches. In this paper, we propose a privacy-preserving model for outsourcing genomic data to a cloud. The proposed model enables query processing while providing privacy protection of genomic databases. Privacy of the individuals is guaranteed by permuting and adding fake genomic records in the database. These techniques allow cloud to evaluate count and top-k queries securely and efficiently. Experimental results demonstrate that a count and a top-k query over 40 SNPs in a database of 20,000 records takes around 100 and 150 seconds, respectively. PMID:27834660
Enabling virtual screening of potent and safer antimicrobial agents against noma: mtk-QSBER model for simultaneous prediction of antibacterial activities and ADMET properties.

PubMed

Speck-Planche, Alejandro; Cordeiro, M N D S

2015-01-01

Neglected diseases are infections that thrive mainly among underdeveloped countries, particularly those belonging to regions found in Asia, Africa, and America. One of the most complex diseases is noma, a dangerous health condition characterized by a polymicrobial and opportunistic nature. The search for potent and safer antibacterial agents against this disease is therefore a goal of particular interest. Chemoinformatics can be used to rationalize the discovery of drug candidates, diminishing time and financial resources. However, in the case of noma, there is no in silico model available for its use in the discovery of efficacious antibacterial agents. This work is devoted to report the first mtk-QSBER model, which integrates dissimilar kinds of chemical and biological data. The model was generated with the aim of simultaneously predicting activity against bacteria present in noma, and ADMET (absorption, distribution, metabolism, elimination, toxicity) parameters. The mtk-QSBER model was constructed by employing a large and heterogeneous dataset of chemicals and displayed accuracies higher than 90% in both training and prediction sets. We confirmed the practical applicability of the model by predicting multiple profiles of the investigational antibacterial drug delafloxacin, and the predictions converged with the experimental reports. To date, this is the first model focused on the virtual search for desirable anti-noma agents.
Preparing Laboratory and Real-World EEG Data for Large-Scale Analysis: A Containerized Approach

PubMed Central

Bigdely-Shamlo, Nima; Makeig, Scott; Robbins, Kay A.

2016-01-01

Large-scale analysis of EEG and other physiological measures promises new insights into brain processes and more accurate and robust brain–computer interface models. However, the absence of standardized vocabularies for annotating events in a machine understandable manner, the welter of collection-specific data organizations, the difficulty in moving data across processing platforms, and the unavailability of agreed-upon standards for preprocessing have prevented large-scale analyses of EEG. Here we describe a “containerized” approach and freely available tools we have developed to facilitate the process of annotating, packaging, and preprocessing EEG data collections to enable data sharing, archiving, large-scale machine learning/data mining and (meta-)analysis. The EEG Study Schema (ESS) comprises three data “Levels,” each with its own XML-document schema and file/folder convention, plus a standardized (PREP) pipeline to move raw (Data Level 1) data to a basic preprocessed state (Data Level 2) suitable for application of a large class of EEG analysis methods. Researchers can ship a study as a single unit and operate on its data using a standardized interface. ESS does not require a central database and provides all the metadata data necessary to execute a wide variety of EEG processing pipelines. The primary focus of ESS is automated in-depth analysis and meta-analysis EEG studies. However, ESS can also encapsulate meta-information for the other modalities such as eye tracking, that are increasingly used in both laboratory and real-world neuroimaging. ESS schema and tools are freely available at www.eegstudy.org and a central catalog of over 850 GB of existing data in ESS format is available at studycatalog.org. These tools and resources are part of a larger effort to enable data sharing at sufficient scale for researchers to engage in truly large-scale EEG analysis and data mining (BigEEG.org). PMID:27014048

Detecting and characterizing high-frequency oscillations in epilepsy: a case study of big data analysis

NASA Astrophysics Data System (ADS)

Huang, Liang; Ni, Xuan; Ditto, William L.; Spano, Mark; Carney, Paul R.; Lai, Ying-Cheng

2017-01-01

We develop a framework to uncover and analyse dynamical anomalies from massive, nonlinear and non-stationary time series data. The framework consists of three steps: preprocessing of massive datasets to eliminate erroneous data segments, application of the empirical mode decomposition and Hilbert transform paradigm to obtain the fundamental components embedded in the time series at distinct time scales, and statistical/scaling analysis of the components. As a case study, we apply our framework to detecting and characterizing high-frequency oscillations (HFOs) from a big database of rat electroencephalogram recordings. We find a striking phenomenon: HFOs exhibit on-off intermittency that can be quantified by algebraic scaling laws. Our framework can be generalized to big data-related problems in other fields such as large-scale sensor data and seismic data analysis.
Part 2 of a Computational Study of a Drop-Laden Mixing Layer

NASA Technical Reports Server (NTRS)

Okongo, Nora; Bellan, Josette

2004-01-01

This second of three reports on a computational study of a mixing layer laden with evaporating liquid drops presents the evaluation of Large Eddy Simulation (LES) models. The LES models were evaluated on an existing database that had been generated using Direct Numerical Simulation (DNS). The DNS method and the database are described in the first report of this series, Part 1 of a Computational Study of a Drop-Laden Mixing Layer (NPO-30719), NASA Tech Briefs, Vol. 28, No.7 (July 2004), page 59. The LES equations, which are derived by applying a spatial filter to the DNS set, govern the evolution of the larger scales of the flow and can therefore be solved on a coarser grid. Consistent with the reduction in grid points, the DNS drops would be represented by fewer drops, called computational drops in the LES context. The LES equations contain terms that cannot be directly computed on the coarser grid and that must instead be modeled. Two types of models are necessary: (1) those for the filtered source terms representing the effects of drops on the filtered flow field and (2) those for the sub-grid scale (SGS) fluxes arising from filtering the convective terms in the DNS equations. All of the filtered-sourceterm models that were developed were found to overestimate the filtered source terms. For modeling the SGS fluxes, constant-coefficient Smagorinsky, gradient, and scale-similarity models were assessed and calibrated on the DNS database. The Smagorinsky model correlated poorly with the SGS fluxes, whereas the gradient and scale-similarity models were well correlated with the SGS quantities that they represented.
Including the Group Quarters Population in the US Synthesized Population Database

PubMed Central

Chasteen, Bernadette M.; Wheaton, William D.; Cooley, Philip C.; Ganapathi, Laxminarayana; Wagener, Diane K.

2011-01-01

In 2005, RTI International researchers developed methods to generate synthesized population data on US households for the US Synthesized Population Database. These data are used in agent-based modeling, which simulates large-scale social networks to test how changes in the behaviors of individuals affect the overall network. Group quarters are residences where individuals live in close proximity and interact frequently. Although the Synthesized Population Database represents the population living in households, data for the nation’s group quarters residents are not easily quantified because of US Census Bureau reporting methods designed to protect individuals’ privacy. Including group quarters population data can be an important factor in agent-based modeling because the number of residents and the frequency of their interactions are variables that directly affect modeling results. Particularly with infectious disease modeling, the increased frequency of agent interaction may increase the probability of infectious disease transmission between individuals and the probability of disease outbreaks. This report reviews our methods to synthesize data on group quarters residents to match US Census Bureau data. Our goal in developing the Group Quarters Population Database was to enable its use with RTI’s US Synthesized Population Database in the Modeling of Infectious Diseases Agent Study. PMID:21841972
A DBMS architecture for global change research

NASA Astrophysics Data System (ADS)

Hachem, Nabil I.; Gennert, Michael A.; Ward, Matthew O.

1993-08-01

The goal of this research is the design and development of an integrated system for the management of very large scientific databases, cartographic/geographic information processing, and exploratory scientific data analysis for global change research. The system will represent both spatial and temporal knowledge about natural and man-made entities on the eath's surface, following an object-oriented paradigm. A user will be able to derive, modify, and apply, procedures to perform operations on the data, including comparison, derivation, prediction, validation, and visualization. This work represents an effort to extend the database technology with an intrinsic class of operators, which is extensible and responds to the growing needs of scientific research. Of significance is the integration of many diverse forms of data into the database, including cartography, geography, hydrography, hypsography, images, and urban planning data. Equally important is the maintenance of metadata, that is, data about the data, such as coordinate transformation parameters, map scales, and audit trails of previous processing operations. This project will impact the fields of geographical information systems and global change research as well as the database community. It will provide an integrated database management testbed for scientific research, and a testbed for the development of analysis tools to understand and predict global change.
ChlamyCyc: an integrative systems biology database and web-portal for Chlamydomonas reinhardtii.

PubMed

May, Patrick; Christian, Jan-Ole; Kempa, Stefan; Walther, Dirk

2009-05-04

The unicellular green alga Chlamydomonas reinhardtii is an important eukaryotic model organism for the study of photosynthesis and plant growth. In the era of modern high-throughput technologies there is an imperative need to integrate large-scale data sets from high-throughput experimental techniques using computational methods and database resources to provide comprehensive information about the molecular and cellular organization of a single organism. In the framework of the German Systems Biology initiative GoFORSYS, a pathway database and web-portal for Chlamydomonas (ChlamyCyc) was established, which currently features about 250 metabolic pathways with associated genes, enzymes, and compound information. ChlamyCyc was assembled using an integrative approach combining the recently published genome sequence, bioinformatics methods, and experimental data from metabolomics and proteomics experiments. We analyzed and integrated a combination of primary and secondary database resources, such as existing genome annotations from JGI, EST collections, orthology information, and MapMan classification. ChlamyCyc provides a curated and integrated systems biology repository that will enable and assist in systematic studies of fundamental cellular processes in Chlamydomonas. The ChlamyCyc database and web-portal is freely available under http://chlamycyc.mpimp-golm.mpg.de.
Mining the Mind Research Network: A Novel Framework for Exploring Large Scale, Heterogeneous Translational Neuroscience Research Data Sources

PubMed Central

Bockholt, Henry J.; Scully, Mark; Courtney, William; Rachakonda, Srinivas; Scott, Adam; Caprihan, Arvind; Fries, Jill; Kalyanam, Ravi; Segall, Judith M.; de la Garza, Raul; Lane, Susan; Calhoun, Vince D.

2009-01-01

A neuroinformatics (NI) system is critical to brain imaging research in order to shorten the time between study conception and results. Such a NI system is required to scale well when large numbers of subjects are studied. Further, when multiple sites participate in research projects organizational issues become increasingly difficult. Optimized NI applications mitigate these problems. Additionally, NI software enables coordination across multiple studies, leveraging advantages potentially leading to exponential research discoveries. The web-based, Mind Research Network (MRN), database system has been designed and improved through our experience with 200 research studies and 250 researchers from seven different institutions. The MRN tools permit the collection, management, reporting and efficient use of large scale, heterogeneous data sources, e.g., multiple institutions, multiple principal investigators, multiple research programs and studies, and multimodal acquisitions. We have collected and analyzed data sets on thousands of research participants and have set up a framework to automatically analyze the data, thereby making efficient, practical data mining of this vast resource possible. This paper presents a comprehensive framework for capturing and analyzing heterogeneous neuroscience research data sources that has been fully optimized for end-users to perform novel data mining. PMID:20461147
A Methodology for Integrated, Multiregional Life Cycle Assessment Scenarios under Large-Scale Technological Change.

PubMed

Gibon, Thomas; Wood, Richard; Arvesen, Anders; Bergesen, Joseph D; Suh, Sangwon; Hertwich, Edgar G

2015-09-15

Climate change mitigation demands large-scale technological change on a global level and, if successfully implemented, will significantly affect how products and services are produced and consumed. In order to anticipate the life cycle environmental impacts of products under climate mitigation scenarios, we present the modeling framework of an integrated hybrid life cycle assessment model covering nine world regions. Life cycle assessment databases and multiregional input-output tables are adapted using forecasted changes in technology and resources up to 2050 under a 2 °C scenario. We call the result of this modeling "technology hybridized environmental-economic model with integrated scenarios" (THEMIS). As a case study, we apply THEMIS in an integrated environmental assessment of concentrating solar power. Life-cycle greenhouse gas emissions for this plant range from 33 to 95 g CO2 eq./kWh across different world regions in 2010, falling to 30-87 g CO2 eq./kWh in 2050. Using regional life cycle data yields insightful results. More generally, these results also highlight the need for systematic life cycle frameworks that capture the actual consequences and feedback effects of large-scale policies in the long term.
Large-scale Exploration of Neuronal Morphologies Using Deep Learning and Augmented Reality.

PubMed

Li, Zhongyu; Butler, Erik; Li, Kang; Lu, Aidong; Ji, Shuiwang; Zhang, Shaoting

2018-02-12

Recently released large-scale neuron morphological data has greatly facilitated the research in neuroinformatics. However, the sheer volume and complexity of these data pose significant challenges for efficient and accurate neuron exploration. In this paper, we propose an effective retrieval framework to address these problems, based on frontier techniques of deep learning and binary coding. For the first time, we develop a deep learning based feature representation method for the neuron morphological data, where the 3D neurons are first projected into binary images and then learned features using an unsupervised deep neural network, i.e., stacked convolutional autoencoders (SCAEs). The deep features are subsequently fused with the hand-crafted features for more accurate representation. Considering the exhaustive search is usually very time-consuming in large-scale databases, we employ a novel binary coding method to compress feature vectors into short binary codes. Our framework is validated on a public data set including 58,000 neurons, showing promising retrieval precision and efficiency compared with state-of-the-art methods. In addition, we develop a novel neuron visualization program based on the techniques of augmented reality (AR), which can help users take a deep exploration of neuron morphologies in an interactive and immersive manner.
Seeing is believing: on the use of image databases for visually exploring plant organelle dynamics.

PubMed

Mano, Shoji; Miwa, Tomoki; Nishikawa, Shuh-ichi; Mimura, Tetsuro; Nishimura, Mikio

2009-12-01

Organelle dynamics vary dramatically depending on cell type, developmental stage and environmental stimuli, so that various parameters, such as size, number and behavior, are required for the description of the dynamics of each organelle. Imaging techniques are superior to other techniques for describing organelle dynamics because these parameters are visually exhibited. Therefore, as the results can be seen immediately, investigators can more easily grasp organelle dynamics. At present, imaging techniques are emerging as fundamental tools in plant organelle research, and the development of new methodologies to visualize organelles and the improvement of analytical tools and equipment have allowed the large-scale generation of image and movie data. Accordingly, image databases that accumulate information on organelle dynamics are an increasingly indispensable part of modern plant organelle research. In addition, image databases are potentially rich data sources for computational analyses, as image and movie data reposited in the databases contain valuable and significant information, such as size, number, length and velocity. Computational analytical tools support image-based data mining, such as segmentation, quantification and statistical analyses, to extract biologically meaningful information from each database and combine them to construct models. In this review, we outline the image databases that are dedicated to plant organelle research and present their potential as resources for image-based computational analyses.
Source attribution using FLEXPART and carbon monoxide emission inventories for the IAGOS In-situ Observation database

NASA Astrophysics Data System (ADS)

Fontaine, Alain; Sauvage, Bastien; Pétetin, Hervé; Auby, Antoine; Boulanger, Damien; Thouret, Valerie

2016-04-01

Since 1994, the IAGOS program (In-Service Aircraft for a Global Observing System http://www.iagos.org) and its predecessor MOZAIC has produced in-situ measurements of the atmospheric composition during more than 46000 commercial aircraft flights. In order to help analyzing these observations and further understanding the processes driving their evolution, we developed a modelling tool SOFT-IO quantifying their source/receptor link. We improved the methodology used by Stohl et al. (2003), based on the FLEXPART plume dispersion model, to simulate the contributions of anthropogenic and biomass burning emissions from the ECCAD database (http://eccad.aeris-data.fr) to the measured carbon monoxide mixing ratio along each IAGOS flight. Thanks to automated processes, contributions are simulated for the last 20 days before observation, separating individual contributions from the different source regions. The main goal is to supply add-value products to the IAGOS database showing pollutants geographical origin and emission type. Using this information, it may be possible to link trends in the atmospheric composition to changes in the transport pathways and to the evolution of emissions. This tool could be used for statistical validation as well as for inter-comparisons of emission inventories using large amounts of data, as Lagrangian models are able to bring the global scale emissions down to a smaller scale, where they can be directly compared to the in-situ observations from the IAGOS database.
Studying the Sky/Planets Can Drown You in Images: Machine Learning Solutions at JPL/Caltech

NASA Technical Reports Server (NTRS)

Fayyad, U. M.

1995-01-01

JPL is working to develop a domain-independent system capable of small-scale object recognition in large image databases for science analysis. Two applications discussed are the cataloging of three billion sky objects in the Sky Image Cataloging and Analysis Tool (SKICAT) and the detection of possibly one million small volcanoes visible in the Magellan synthetic aperture radar images of Venus (JPL Adaptive Recognition Tool, JARTool).
Modeling and Databases for Teaching Petrology

NASA Astrophysics Data System (ADS)

Asher, P.; Dutrow, B.

2003-12-01

With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.
Subgrid-scale scalar flux modelling based on optimal estimation theory and machine-learning procedures

NASA Astrophysics Data System (ADS)

Vollant, A.; Balarac, G.; Corre, C.

2017-09-01

New procedures are explored for the development of models in the context of large eddy simulation (LES) of a passive scalar. They rely on the combination of the optimal estimator theory with machine-learning algorithms. The concept of optimal estimator allows to identify the most accurate set of parameters to be used when deriving a model. The model itself can then be defined by training an artificial neural network (ANN) on a database derived from the filtering of direct numerical simulation (DNS) results. This procedure leads to a subgrid scale model displaying good structural performance, which allows to perform LESs very close to the filtered DNS results. However, this first procedure does not control the functional performance so that the model can fail when the flow configuration differs from the training database. Another procedure is then proposed, where the model functional form is imposed and the ANN used only to define the model coefficients. The training step is a bi-objective optimisation in order to control both structural and functional performances. The model derived from this second procedure proves to be more robust. It also provides stable LESs for a turbulent plane jet flow configuration very far from the training database but over-estimates the mixing process in that case.
Composing Data Parallel Code for a SPARQL Graph Engine

DOE Office of Scientific and Technical Information (OSTI.GOV)

Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste

Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less
A MySQL Based EPICS Archiver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Christopher Slominski

2009-10-01

Archiving a large fraction of the EPICS signals within the Jefferson Lab (JLAB) Accelerator control system is vital for postmortem and real-time analysis of the accelerator performance. This analysis is performed on a daily basis by scientists, operators, engineers, technicians, and software developers. Archiving poses unique challenges due to the magnitude of the control system. A MySQL Archiving system (Mya) was developed to scale to the needs of the control system; currently archiving 58,000 EPICS variables, updating at a rate of 11,000 events per second. In addition to the large collection rate, retrieval of the archived data must also bemore » fast and robust. Archived data retrieval clients obtain data at a rate over 100,000 data points per second. Managing the data in a relational database provides a number of benefits. This paper describes an archiving solution that uses an open source database and standard off the shelf hardware to reach high performance archiving needs. Mya has been in production at Jefferson Lab since February of 2007.« less
Using the Saccharomyces Genome Database (SGD) for analysis of genomic information

PubMed Central

Skrzypek, Marek S.; Hirschman, Jodi

2011-01-01

Analysis of genomic data requires access to software tools that place the sequence-derived information in the context of biology. The Saccharomyces Genome Database (SGD) integrates functional information about budding yeast genes and their products with a set of analysis tools that facilitate exploring their biological details. This unit describes how the various types of functional data available at SGD can be searched, retrieved, and analyzed. Starting with the guided tour of the SGD Home page and Locus Summary page, this unit highlights how to retrieve data using YeastMine, how to visualize genomic information with GBrowse, how to explore gene expression patterns with SPELL, and how to use Gene Ontology tools to characterize large-scale datasets. PMID:21901739
Computing Properties of Hadrons, Nuclei and Nuclear Matter from Quantum Chromodynamics

DOE Office of Scientific and Technical Information (OSTI.GOV)

Savage, Martin J.

This project was part of a coordinated software development effort which the nuclear physics lattice QCD community pursues in order to ensure that lattice calculations can make optimal use of present, and forthcoming leadership-class and dedicated hardware, including those of the national laboratories, and prepares for the exploitation of future computational resources in the exascale era. The UW team improved and extended software libraries used in lattice QCD calculations related to multi-nucleon systems, enhanced production running codes related to load balancing multi-nucleon production on large-scale computing platforms, and developed SQLite (addressable database) interfaces to efficiently archive and analyze multi-nucleon datamore » and developed a Mathematica interface for the SQLite databases.« less
Performance evaluation of redundant disk array support for transaction recovery

NASA Technical Reports Server (NTRS)

Mourad, Antoine N.; Fuchs, W. Kent; Saab, Daniel G.

1991-01-01

Redundant disk arrays provide a way of achieving rapid recovery from media failures with a relatively low storage cost for large scale data systems requiring high availability. Here, we propose a method for using redundant disk arrays to support rapid recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, we show that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
[Effects of soil data and map scale on assessment of total phosphorus storage in upland soils.

PubMed

Li, Heng Rong; Zhang, Li Ming; Li, Xiao di; Yu, Dong Sheng; Shi, Xue Zheng; Xing, Shi He; Chen, Han Yue

2016-06-01

Accurate assessment of total phosphorus storage in farmland soils is of great significance to sustainable agricultural and non-point source pollution control. However, previous studies haven't considered the estimation errors from mapping scales and various databases with different sources of soil profile data. In this study, a total of 393×10 4 hm 2 of upland in the 29 counties (or cities) of North Jiangsu was cited as a case for study. Analysis was performed of how the four sources of soil profile data, namely, "Soils of County", "Soils of Prefecture", "Soils of Province" and "Soils of China", and the six scales, i.e. 1:50000, 1:250000, 1:500000, 1:1000000, 1:4000000 and1:10000000, used in the 24 soil databases established for the four soil journals, affected assessment of soil total phosphorus. Compared with the most detailed 1:50000 soil database established with 983 upland soil profiles, relative deviation of the estimates of soil total phosphorus density (STPD) and soil total phosphorus storage (STPS) from the other soil databases varied from 4.8% to 48.9% and from 1.6% to 48.4%, respectively. The estimated STPD and STPS based on the 1:50000 database of "Soils of County" and most of the estimates based on the databases of each scale in "Soils of County" and "Soils of Prefecture" were different, with the significance levels of P＜0.001 or P＜0.05. Extremely significant differences (P＜0.001) existed between the estimates based on the 1:50000 database of "Soils of County" and the estimates based on the databases of each scale in "Soils of Province" and "Soils of China". This study demonstrated the significance of appropriate soil data sources and appropriate mapping scales in estimating STPS.
MEGALEX: A megastudy of visual and auditory word recognition.

PubMed

Ferrand, Ludovic; Méot, Alain; Spinelli, Elsa; New, Boris; Pallier, Christophe; Bonin, Patrick; Dufau, Stéphane; Mathôt, Sebastiaan; Grainger, Jonathan

2018-06-01

Using the megastudy approach, we report a new database (MEGALEX) of visual and auditory lexical decision times and accuracy rates for tens of thousands of words. We collected visual lexical decision data for 28,466 French words and the same number of pseudowords, and auditory lexical decision data for 17,876 French words and the same number of pseudowords (synthesized tokens were used for the auditory modality). This constitutes the first large-scale database for auditory lexical decision, and the first database to enable a direct comparison of word recognition in different modalities. Different regression analyses were conducted to illustrate potential ways to exploit this megastudy database. First, we compared the proportions of variance accounted for by five word frequency measures. Second, we conducted item-level regression analyses to examine the relative importance of the lexical variables influencing performance in the different modalities (visual and auditory). Finally, we compared the similarities and differences between the two modalities. All data are freely available on our website ( https://sedufau.shinyapps.io/megalex/ ) and are searchable at www.lexique.org , inside the Open Lexique search engine.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.