Wildenhain, Jan; Spitzer, Michaela; Dolma, Sonam; Jarvik, Nick; White, Rachel; Roy, Marcia; Griffiths, Emma; Bellows, David S.; Wright, Gerard D.; Tyers, Mike
2016-01-01
The network structure of biological systems suggests that effective therapeutic intervention may require combinations of agents that act synergistically. However, a dearth of systematic chemical combination datasets have limited the development of predictive algorithms for chemical synergism. Here, we report two large datasets of linked chemical-genetic and chemical-chemical interactions in the budding yeast Saccharomyces cerevisiae. We screened 5,518 unique compounds against 242 diverse yeast gene deletion strains to generate an extended chemical-genetic matrix (CGM) of 492,126 chemical-gene interaction measurements. This CGM dataset contained 1,434 genotype-specific inhibitors, termed cryptagens. We selected 128 structurally diverse cryptagens and tested all pairwise combinations to generate a benchmark dataset of 8,128 pairwise chemical-chemical interaction tests for synergy prediction, termed the cryptagen matrix (CM). An accompanying database resource called ChemGRID was developed to enable analysis, visualisation and downloads of all data. The CGM and CM datasets will facilitate the benchmarking of computational approaches for synergy prediction, as well as chemical structure-activity relationship models for anti-fungal drug discovery. PMID:27874849
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
The Isprs Benchmark on Indoor Modelling
NASA Astrophysics Data System (ADS)
Khoshelham, K.; Díaz Vilariño, L.; Peter, M.; Kang, Z.; Acharya, D.
2017-09-01
Automated generation of 3D indoor models from point cloud data has been a topic of intensive research in recent years. While results on various datasets have been reported in literature, a comparison of the performance of different methods has not been possible due to the lack of benchmark datasets and a common evaluation framework. The ISPRS benchmark on indoor modelling aims to address this issue by providing a public benchmark dataset and an evaluation framework for performance comparison of indoor modelling methods. In this paper, we present the benchmark dataset comprising several point clouds of indoor environments captured by different sensors. We also discuss the evaluation and comparison of indoor modelling methods based on manually created reference models and appropriate quality evaluation criteria. The benchmark dataset is available for download at: http://www2.isprs.org/commissions/comm4/wg5/benchmark-on-indoor-modelling.html.
A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images.
Vázquez, David; Bernal, Jorge; Sánchez, F Javier; Fernández-Esparrach, Gloria; López, Antonio M; Romero, Adriana; Drozdzal, Michal; Courville, Aaron
2017-01-01
Colorectal cancer (CRC) is the third cause of cancer death worldwide. Currently, the standard approach to reduce CRC-related mortality is to perform regular screening in search for polyps and colonoscopy is the screening tool of choice. The main limitations of this screening procedure are polyp miss rate and the inability to perform visual assessment of polyp malignancy. These drawbacks can be reduced by designing decision support systems (DSS) aiming to help clinicians in the different stages of the procedure by providing endoluminal scene segmentation. Thus, in this paper, we introduce an extended benchmark of colonoscopy image segmentation, with the hope of establishing a new strong benchmark for colonoscopy image analysis research. The proposed dataset consists of 4 relevant classes to inspect the endoluminal scene, targeting different clinical needs. Together with the dataset and taking advantage of advances in semantic segmentation literature, we provide new baselines by training standard fully convolutional networks (FCNs). We perform a comparative study to show that FCNs significantly outperform, without any further postprocessing, prior results in endoluminal scene segmentation, especially with respect to polyp segmentation and localization.
Benchmark Dataset for Whole Genome Sequence Compression.
C L, Biji; S Nair, Achuthsankar
2017-01-01
The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.
Gururaj, Anupama E.; Chen, Xiaoling; Pournejati, Saeid; Alter, George; Hersh, William R.; Demner-Fushman, Dina; Ohno-Machado, Lucila
2017-01-01
Abstract The rapid proliferation of publicly available biomedical datasets has provided abundant resources that are potentially of value as a means to reproduce prior experiments, and to generate and explore novel hypotheses. However, there are a number of barriers to the re-use of such datasets, which are distributed across a broad array of dataset repositories, focusing on different data types and indexed using different terminologies. New methods are needed to enable biomedical researchers to locate datasets of interest within this rapidly expanding information ecosystem, and new resources are needed for the formal evaluation of these methods as they emerge. In this paper, we describe the design and generation of a benchmark for information retrieval of biomedical datasets, which was developed and used for the 2016 bioCADDIE Dataset Retrieval Challenge. In the tradition of the seminal Cranfield experiments, and as exemplified by the Text Retrieval Conference (TREC), this benchmark includes a corpus (biomedical datasets), a set of queries, and relevance judgments relating these queries to elements of the corpus. This paper describes the process through which each of these elements was derived, with a focus on those aspects that distinguish this benchmark from typical information retrieval reference sets. Specifically, we discuss the origin of our queries in the context of a larger collaborative effort, the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium, and the distinguishing features of biomedical dataset retrieval as a task. The resulting benchmark set has been made publicly available to advance research in the area of biomedical dataset retrieval. Database URL: https://biocaddie.org/benchmark-data PMID:29220453
Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.
2017-01-01
Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115
Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S
2017-01-01
As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
Federal Register 2010, 2011, 2012, 2013, 2014
2010-08-24
... evaluates potential datasets and recommends which datasets are appropriate for assessment analyses. The... points to datasets incorporated in the original SEDAR benchmark assessment and run the benchmark... Webinar II November 22, 2010; 10 a.m. - 1 p.m.; SEDAR Update Assessment Webinar III Using updated datasets...
Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors.
Martin, Shawn; Pratt, Harry D; Anderson, Travis M
2017-07-01
We seek to optimize Ionic liquids (ILs) for application to redox flow batteries. As part of this effort, we have developed a computational method for suggesting ILs with high conductivity and low viscosity. Since ILs consist of cation-anion pairs, we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18 th , 2014. The dataset consists of 4,329 measurements taken from 165 ILs made up of 72 cations and 34 anions. We benchmark our QSPRs on the known values in the dataset then extend our predictions to screen all 2,448 possible cation-anion pairs in the dataset. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors
Martin, Shawn; Pratt, III, Harry D.; Anderson, Travis M.
2017-02-21
We seek to optimize Ionic liquids (ILs) for application to redox flow batteries. As part of this effort, we have developed a computational method for suggesting ILs with high conductivity and low viscosity. Since ILs consist of cation-anion pairs, we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18th, 2014. The dataset consists of 4,329 measurements taken from 165 ILs made upmore » of 72 cations and 34 anions. In conclusion, we benchmark our QSPRs on the known values in the dataset then extend our predictions to screen all 2,448 possible cation-anion pairs in the dataset.« less
Screening for High Conductivity/Low Viscosity Ionic Liquids Using Product Descriptors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Martin, Shawn; Pratt, III, Harry D.; Anderson, Travis M.
We seek to optimize Ionic liquids (ILs) for application to redox flow batteries. As part of this effort, we have developed a computational method for suggesting ILs with high conductivity and low viscosity. Since ILs consist of cation-anion pairs, we consider a method for treating ILs as pairs using product descriptors for QSPRs, a concept borrowed from the prediction of protein-protein interactions in bioinformatics. We demonstrate the method by predicting electrical conductivity, viscosity, and melting point on a dataset taken from the ILThermo database on June 18th, 2014. The dataset consists of 4,329 measurements taken from 165 ILs made upmore » of 72 cations and 34 anions. In conclusion, we benchmark our QSPRs on the known values in the dataset then extend our predictions to screen all 2,448 possible cation-anion pairs in the dataset.« less
PMLB: a large benchmark suite for machine learning evaluation and comparison.
Olson, Randal S; La Cava, William; Orzechowski, Patryk; Urbanowicz, Ryan J; Moore, Jason H
2017-01-01
The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.
Decoys Selection in Benchmarking Datasets: Overview and Perspectives
Réau, Manon; Langenfeld, Florent; Zagury, Jean-François; Lagarde, Nathalie; Montes, Matthieu
2018-01-01
Virtual Screening (VS) is designed to prospectively help identifying potential hits, i.e., compounds capable of interacting with a given target and potentially modulate its activity, out of large compound collections. Among the variety of methodologies, it is crucial to select the protocol that is the most adapted to the query/target system under study and that yields the most reliable output. To this aim, the performance of VS methods is commonly evaluated and compared by computing their ability to retrieve active compounds in benchmarking datasets. The benchmarking datasets contain a subset of known active compounds together with a subset of decoys, i.e., assumed non-active molecules. The composition of both the active and the decoy compounds subsets is critical to limit the biases in the evaluation of the VS methods. In this review, we focus on the selection of decoy compounds that has considerably changed over the years, from randomly selected compounds to highly customized or experimentally validated negative compounds. We first outline the evolution of decoys selection in benchmarking databases as well as current benchmarking databases that tend to minimize the introduction of biases, and secondly, we propose recommendations for the selection and the design of benchmarking datasets. PMID:29416509
Generation of openEHR Test Datasets for Benchmarking.
El Helou, Samar; Karvonen, Tuukka; Yamamoto, Goshiro; Kume, Naoto; Kobayashi, Shinji; Kondo, Eiji; Hiragi, Shusuke; Okamoto, Kazuya; Tamura, Hiroshi; Kuroda, Tomohiro
2017-01-01
openEHR is a widely used EHR specification. Given its technology-independent nature, different approaches for implementing openEHR data repositories exist. Public openEHR datasets are needed to conduct benchmark analyses over different implementations. To address their current unavailability, we propose a method for generating openEHR test datasets that can be publicly shared and used.
Willemse, Elias J; Joubert, Johan W
2016-09-01
In this article we present benchmark datasets for the Mixed Capacitated Arc Routing Problem under Time restrictions with Intermediate Facilities (MCARPTIF). The problem is a generalisation of the Capacitated Arc Routing Problem (CARP), and closely represents waste collection routing. Four different test sets are presented, each consisting of multiple instance files, and which can be used to benchmark different solution approaches for the MCARPTIF. An in-depth description of the datasets can be found in "Constructive heuristics for the Mixed Capacity Arc Routing Problem under Time Restrictions with Intermediate Facilities" (Willemseand Joubert, 2016) [2] and "Splitting procedures for the Mixed Capacitated Arc Routing Problem under Time restrictions with Intermediate Facilities" (Willemseand Joubert, in press) [4]. The datasets are publicly available from "Library of benchmark test sets for variants of the Capacitated Arc Routing Problem under Time restrictions with Intermediate Facilities" (Willemse and Joubert, 2016) [3].
How does spatial extent of fMRI datasets affect independent component analysis decomposition?
Aragri, Adriana; Scarabino, Tommaso; Seifritz, Erich; Comani, Silvia; Cirillo, Sossio; Tedeschi, Gioacchino; Esposito, Fabrizio; Di Salle, Francesco
2006-09-01
Spatial independent component analysis (sICA) of functional magnetic resonance imaging (fMRI) time series can generate meaningful activation maps and associated descriptive signals, which are useful to evaluate datasets of the entire brain or selected portions of it. Besides computational implications, variations in the input dataset combined with the multivariate nature of ICA may lead to different spatial or temporal readouts of brain activation phenomena. By reducing and increasing a volume of interest (VOI), we applied sICA to different datasets from real activation experiments with multislice acquisition and single or multiple sensory-motor task-induced blood oxygenation level-dependent (BOLD) signal sources with different spatial and temporal structure. Using receiver operating characteristics (ROC) methodology for accuracy evaluation and multiple regression analysis as benchmark, we compared sICA decompositions of reduced and increased VOI fMRI time-series containing auditory, motor and hemifield visual activation occurring separately or simultaneously in time. Both approaches yielded valid results; however, the results of the increased VOI approach were spatially more accurate compared to the results of the decreased VOI approach. This is consistent with the capability of sICA to take advantage of extended samples of statistical observations and suggests that sICA is more powerful with extended rather than reduced VOI datasets to delineate brain activity. (c) 2006 Wiley-Liss, Inc.
Yamada, Kazunori D.; Tomii, Kentaro; Katoh, Kazutaka
2016-01-01
Motivation: Large multiple sequence alignments (MSAs), consisting of thousands of sequences, are becoming more and more common, due to advances in sequencing technologies. The MAFFT MSA program has several options for building large MSAs, but their performances have not been sufficiently assessed yet, because realistic benchmarking of large MSAs has been difficult. Recently, such assessments have been made possible through the HomFam and ContTest benchmark protein datasets. Along with the development of these datasets, an interesting theory was proposed: chained guide trees increase the accuracy of MSAs of structurally conserved regions. This theory challenges the basis of progressive alignment methods and needs to be examined by being compared with other known methods including computationally intensive ones. Results: We used HomFam, ContTest and OXFam (an extended version of OXBench) to evaluate several methods enabled in MAFFT: (1) a progressive method with approximate guide trees, (2) a progressive method with chained guide trees, (3) a combination of an iterative refinement method and a progressive method and (4) a less approximate progressive method that uses a rigorous guide tree and consistency score. Other programs, Clustal Omega and UPP, available for large MSAs, were also included into the comparison. The effect of method 2 (chained guide trees) was positive in ContTest but negative in HomFam and OXFam. Methods 3 and 4 increased the benchmark scores more consistently than method 2 for the three datasets, suggesting that they are safer to use. Availability and Implementation: http://mafft.cbrc.jp/alignment/software/ Contact: katoh@ifrec.osaka-u.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27378296
Wu, Zhenqin; Ramsundar, Bharath; Feinberg, Evan N.; Gomes, Joseph; Geniesse, Caleb; Pappu, Aneesh S.; Leswing, Karl
2017-01-01
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm. PMID:29629118
A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.
Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai
2017-10-01
This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mendoza, Paul Michael
2016-08-31
The project goals seek to develop applications in order to automate MCNP criticality benchmark execution; create a dataset containing static benchmark information; combine MCNP output with benchmark information; and fit and visually represent data.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers.
Teodoro, Douglas; Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio
2018-01-01
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.
ORBDA: An openEHR benchmark dataset for performance assessment of electronic health record servers
Sundvall, Erik; João Junior, Mario; Ruch, Patrick; Miranda Freire, Sergio
2018-01-01
The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms. PMID:29293556
Ó Conchúir, Shane; Barlow, Kyle A; Pache, Roland A; Ollikainen, Noah; Kundert, Kale; O'Meara, Matthew J; Smith, Colin A; Kortemme, Tanja
2015-01-01
The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.
A benchmark for vehicle detection on wide area motion imagery
NASA Astrophysics Data System (ADS)
Catrambone, Joseph; Amzovski, Ismail; Liang, Pengpeng; Blasch, Erik; Sheaff, Carolyn; Wang, Zhonghai; Chen, Genshe; Ling, Haibin
2015-05-01
Wide area motion imagery (WAMI) has been attracting an increased amount of research attention due to its large spatial and temporal coverage. An important application includes moving target analysis, where vehicle detection is often one of the first steps before advanced activity analysis. While there exist many vehicle detection algorithms, a thorough evaluation of them on WAMI data still remains a challenge mainly due to the lack of an appropriate benchmark data set. In this paper, we address a research need by presenting a new benchmark for wide area motion imagery vehicle detection data. The WAMI benchmark is based on the recently available Wright-Patterson Air Force Base (WPAFB09) dataset and the Temple Resolved Uncertainty Target History (TRUTH) associated target annotation. Trajectory annotations were provided in the original release of the WPAFB09 dataset, but detailed vehicle annotations were not available with the dataset. In addition, annotations of static vehicles, e.g., in parking lots, are also not identified in the original release. Addressing these issues, we re-annotated the whole dataset with detailed information for each vehicle, including not only a target's location, but also its pose and size. The annotated WAMI data set should be useful to community for a common benchmark to compare WAMI detection, tracking, and identification methods.
Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.
Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui
2017-10-01
The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.
Benchmarking neuromorphic vision: lessons learnt from computer vision
Tan, Cheston; Lallee, Stephane; Orchard, Garrick
2015-01-01
Neuromorphic Vision sensors have improved greatly since the first silicon retina was presented almost three decades ago. They have recently matured to the point where they are commercially available and can be operated by laymen. However, despite improved availability of sensors, there remains a lack of good datasets, while algorithms for processing spike-based visual data are still in their infancy. On the other hand, frame-based computer vision algorithms are far more mature, thanks in part to widely accepted datasets which allow direct comparison between algorithms and encourage competition. We are presented with a unique opportunity to shape the development of Neuromorphic Vision benchmarks and challenges by leveraging what has been learnt from the use of datasets in frame-based computer vision. Taking advantage of this opportunity, in this paper we review the role that benchmarks and challenges have played in the advancement of frame-based computer vision, and suggest guidelines for the creation of Neuromorphic Vision benchmarks and challenges. We also discuss the unique challenges faced when benchmarking Neuromorphic Vision algorithms, particularly when attempting to provide direct comparison with frame-based computer vision. PMID:26528120
MicroRNA array normalization: an evaluation using a randomized dataset as the benchmark.
Qin, Li-Xuan; Zhou, Qin
2014-01-01
MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.
MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark
Qin, Li-Xuan; Zhou, Qin
2014-01-01
MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays. PMID:24905456
Mahmood, Khalid; Jung, Chol-Hee; Philip, Gayle; Georgeson, Peter; Chung, Jessica; Pope, Bernard J; Park, Daniel J
2017-05-16
Genetic variant effect prediction algorithms are used extensively in clinical genomics and research to determine the likely consequences of amino acid substitutions on protein function. It is vital that we better understand their accuracies and limitations because published performance metrics are confounded by serious problems of circularity and error propagation. Here, we derive three independent, functionally determined human mutation datasets, UniFun, BRCA1-DMS and TP53-TA, and employ them, alongside previously described datasets, to assess the pre-eminent variant effect prediction tools. Apparent accuracies of variant effect prediction tools were influenced significantly by the benchmarking dataset. Benchmarking with the assay-determined datasets UniFun and BRCA1-DMS yielded areas under the receiver operating characteristic curves in the modest ranges of 0.52 to 0.63 and 0.54 to 0.75, respectively, considerably lower than observed for other, potentially more conflicted datasets. These results raise concerns about how such algorithms should be employed, particularly in a clinical setting. Contemporary variant effect prediction tools are unlikely to be as accurate at the general prediction of functional impacts on proteins as reported prior. Use of functional assay-based datasets that avoid prior dependencies promises to be valuable for the ongoing development and accurate benchmarking of such tools.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mardirossian, Narbe; Head-Gordon, Martin
Benchmark datasets of non-covalent interactions are essential for assessing the performance of density functionals and other quantum chemistry approaches. In a recent blind test, Taylor et al. benchmarked 14 methods on a new dataset consisting of 10 dimer potential energy curves calculated using coupled cluster with singles, doubles, and perturbative triples (CCSD(T)) at the complete basis set (CBS) limit (80 data points in total). Finally, the dataset is particularly interesting because compressed, near-equilibrium, and stretched regions of the potential energy surface are extensively sampled.
Mardirossian, Narbe; Head-Gordon, Martin
2016-11-09
Benchmark datasets of non-covalent interactions are essential for assessing the performance of density functionals and other quantum chemistry approaches. In a recent blind test, Taylor et al. benchmarked 14 methods on a new dataset consisting of 10 dimer potential energy curves calculated using coupled cluster with singles, doubles, and perturbative triples (CCSD(T)) at the complete basis set (CBS) limit (80 data points in total). Finally, the dataset is particularly interesting because compressed, near-equilibrium, and stretched regions of the potential energy surface are extensively sampled.
ClimateNet: A Machine Learning dataset for Climate Science Research
NASA Astrophysics Data System (ADS)
Prabhat, M.; Biard, J.; Ganguly, S.; Ames, S.; Kashinath, K.; Kim, S. K.; Kahou, S.; Maharaj, T.; Beckham, C.; O'Brien, T. A.; Wehner, M. F.; Williams, D. N.; Kunkel, K.; Collins, W. D.
2017-12-01
Deep Learning techniques have revolutionized commercial applications in Computer vision, speech recognition and control systems. The key for all of these developments was the creation of a curated, labeled dataset ImageNet, for enabling multiple research groups around the world to develop methods, benchmark performance and compete with each other. The success of Deep Learning can be largely attributed to the broad availability of this dataset. Our empirical investigations have revealed that Deep Learning is similarly poised to benefit the task of pattern detection in climate science. Unfortunately, labeled datasets, a key pre-requisite for training, are hard to find. Individual research groups are typically interested in specialized weather patterns, making it hard to unify, and share datasets across groups and institutions. In this work, we are proposing ClimateNet: a labeled dataset that provides labeled instances of extreme weather patterns, as well as associated raw fields in model and observational output. We develop a schema in NetCDF to enumerate weather pattern classes/types, store bounding boxes, and pixel-masks. We are also working on a TensorFlow implementation to natively import such NetCDF datasets, and are providing a reference convolutional architecture for binary classification tasks. Our hope is that researchers in Climate Science, as well as ML/DL, will be able to use (and extend) ClimateNet to make rapid progress in the application of Deep Learning for Climate Science research.
Benchmarking of Typical Meteorological Year datasets dedicated to Concentrated-PV systems
NASA Astrophysics Data System (ADS)
Realpe, Ana Maria; Vernay, Christophe; Pitaval, Sébastien; Blanc, Philippe; Wald, Lucien; Lenoir, Camille
2016-04-01
Accurate analysis of meteorological and pyranometric data for long-term analysis is the basis of decision-making for banks and investors, regarding solar energy conversion systems. This has led to the development of methodologies for the generation of Typical Meteorological Years (TMY) datasets. The most used method for solar energy conversion systems was proposed in 1978 by the Sandia Laboratory (Hall et al., 1978) considering a specific weighted combination of different meteorological variables with notably global, diffuse horizontal and direct normal irradiances, air temperature, wind speed, relative humidity. In 2012, a new approach was proposed in the framework of the European project FP7 ENDORSE. It introduced the concept of "driver" that is defined by the user as an explicit function of the pyranometric and meteorological relevant variables to improve the representativeness of the TMY datasets with respect the specific solar energy conversion system of interest. The present study aims at comparing and benchmarking different TMY datasets considering a specific Concentrated-PV (CPV) system as the solar energy conversion system of interest. Using long-term (15+ years) time-series of high quality meteorological and pyranometric ground measurements, three types of TMY datasets generated by the following methods: the Sandia method, a simplified driver with DNI as the only representative variable and a more sophisticated driver. The latter takes into account the sensitivities of the CPV system with respect to the spectral distribution of the solar irradiance and wind speed. Different TMY datasets from the three methods have been generated considering different numbers of years in the historical dataset, ranging from 5 to 15 years. The comparisons and benchmarking of these TMY datasets are conducted considering the long-term time series of simulated CPV electric production as a reference. The results of this benchmarking clearly show that the Sandia method is not suitable for CPV systems. For these systems, the TMY datasets obtained using dedicated drivers (DNI only or more precise one) are more representative to derive TMY datasets from limited long-term meteorological dataset.
Local feature saliency classifier for real-time intrusion monitoring
NASA Astrophysics Data System (ADS)
Buch, Norbert; Velastin, Sergio A.
2014-07-01
We propose a texture saliency classifier to detect people in a video frame by identifying salient texture regions. The image is classified into foreground and background in real time. No temporal image information is used during the classification. The system is used for the task of detecting people entering a sterile zone, which is a common scenario for visual surveillance. Testing is performed on the Imagery Library for Intelligent Detection Systems sterile zone benchmark dataset of the United Kingdom's Home Office. The basic classifier is extended by fusing its output with simple motion information, which significantly outperforms standard motion tracking. A lower detection time can be achieved by combining texture classification with Kalman filtering. The fusion approach running at 10 fps gives the highest result of F1=0.92 for the 24-h test dataset. The paper concludes with a detailed analysis of the computation time required for the different parts of the algorithm.
Watson, Nathanial E; Parsons, Brendon A; Synovec, Robert E
2016-08-12
Performance of tile-based Fisher Ratio (F-ratio) data analysis, recently developed for discovery-based studies using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC-TOFMS), is evaluated with a metabolomics dataset that had been previously analyzed in great detail, but while taking a brute force approach. The previously analyzed data (referred to herein as the benchmark dataset) were intracellular extracts from Saccharomyces cerevisiae (yeast), either metabolizing glucose (repressed) or ethanol (derepressed), which define the two classes in the discovery-based analysis to find metabolites that are statistically different in concentration between the two classes. Beneficially, this previously analyzed dataset provides a concrete means to validate the tile-based F-ratio software. Herein, we demonstrate and validate the significant benefits of applying tile-based F-ratio analysis. The yeast metabolomics data are analyzed more rapidly in about one week versus one year for the prior studies with this dataset. Furthermore, a null distribution analysis is implemented to statistically determine an adequate F-ratio threshold, whereby the variables with F-ratio values below the threshold can be ignored as not class distinguishing, which provides the analyst with confidence when analyzing the hit table. Forty-six of the fifty-four benchmarked changing metabolites were discovered by the new methodology while consistently excluding all but one of the benchmarked nineteen false positive metabolites previously identified. Copyright © 2016 Elsevier B.V. All rights reserved.
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset
2014-12-23
publications for benchmarking prognostics algorithms. The turbofan degradation datasets have received over seven thousand unique downloads in the last five...approaches that researchers have taken to implement prognostics using these turbofan datasets. Some unique characteristics of these datasets are also...Description of the five turbofan degradation datasets available from NASA repository. Datasets #Fault Modes #Conditions #Train Units #Test Units
Benchmarking Deep Learning Models on Large Healthcare Datasets.
Purushotham, Sanjay; Meng, Chuizheng; Che, Zhengping; Liu, Yan
2018-06-04
Deep learning models (aka Deep Neural Networks) have revolutionized many fields including computer vision, natural language processing, speech recognition, and is being increasingly used in clinical healthcare applications. However, few works exist which have benchmarked the performance of the deep learning models with respect to the state-of-the-art machine learning models and prognostic scoring systems on publicly available healthcare datasets. In this paper, we present the benchmarking results for several clinical prediction tasks such as mortality prediction, length of stay prediction, and ICD-9 code group prediction using Deep Learning models, ensemble of machine learning models (Super Learner algorithm), SAPS II and SOFA scores. We used the Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4) publicly available dataset, which includes all patients admitted to an ICU at the Beth Israel Deaconess Medical Center from 2001 to 2012, for the benchmarking tasks. Our results show that deep learning models consistently outperform all the other approaches especially when the 'raw' clinical time series data is used as input features to the models. Copyright © 2018 Elsevier Inc. All rights reserved.
Sczyrba, Alexander; Hofmann, Peter; Belmann, Peter; Koslicki, David; Janssen, Stefan; Dröge, Johannes; Gregor, Ivan; Majda, Stephan; Fiedler, Jessika; Dahms, Eik; Bremges, Andreas; Fritz, Adrian; Garrido-Oter, Ruben; Jørgensen, Tue Sparholt; Shapiro, Nicole; Blood, Philip D.; Gurevich, Alexey; Bai, Yang; Turaev, Dmitrij; DeMaere, Matthew Z.; Chikhi, Rayan; Nagarajan, Niranjan; Quince, Christopher; Meyer, Fernando; Balvočiūtė, Monika; Hansen, Lars Hestbjerg; Sørensen, Søren J.; Chia, Burton K. H.; Denis, Bertrand; Froula, Jeff L.; Wang, Zhong; Egan, Robert; Kang, Dongwan Don; Cook, Jeffrey J.; Deltel, Charles; Beckstette, Michael; Lemaitre, Claire; Peterlongo, Pierre; Rizk, Guillaume; Lavenier, Dominique; Wu, Yu-Wei; Singer, Steven W.; Jain, Chirag; Strous, Marc; Klingenberg, Heiner; Meinicke, Peter; Barton, Michael; Lingner, Thomas; Lin, Hsin-Hung; Liao, Yu-Chieh; Silva, Genivaldo Gueiros Z.; Cuevas, Daniel A.; Edwards, Robert A.; Saha, Surya; Piro, Vitor C.; Renard, Bernhard Y.; Pop, Mihai; Klenk, Hans-Peter; Göker, Markus; Kyrpides, Nikos C.; Woyke, Tanja; Vorholt, Julia A.; Schulze-Lefert, Paul; Rubin, Edward M.; Darling, Aaron E.; Rattei, Thomas; McHardy, Alice C.
2018-01-01
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions. PMID:28967888
Federal Register 2010, 2011, 2012, 2013, 2014
2010-09-02
... is a data report which compiles and evaluates potential datasets and recommends which datasets are... add additional data points to datasets incorporated in the original SEDAR benchmark assessment and run... Conference Call Using updated datasets adopted during the Data Webinar, participants will employ assessment...
Dataset-Driven Research to Support Learning and Knowledge Analytics
ERIC Educational Resources Information Center
Verbert, Katrien; Manouselis, Nikos; Drachsler, Hendrik; Duval, Erik
2012-01-01
In various research areas, the availability of open datasets is considered as key for research and application purposes. These datasets are used as benchmarks to develop new algorithms and to compare them to other algorithms in given settings. Finding such available datasets for experimentation can be a challenging task in technology enhanced…
A benchmark for comparison of cell tracking algorithms
Maška, Martin; Ulman, Vladimír; Svoboda, David; Matula, Pavel; Matula, Petr; Ederra, Cristina; Urbiola, Ainhoa; España, Tomás; Venkatesan, Subramanian; Balak, Deepak M.W.; Karas, Pavel; Bolcková, Tereza; Štreitová, Markéta; Carthel, Craig; Coraluppi, Stefano; Harder, Nathalie; Rohr, Karl; Magnusson, Klas E. G.; Jaldén, Joakim; Blau, Helen M.; Dzyubachyk, Oleh; Křížek, Pavel; Hagen, Guy M.; Pastor-Escuredo, David; Jimenez-Carretero, Daniel; Ledesma-Carbayo, Maria J.; Muñoz-Barrutia, Arrate; Meijering, Erik; Kozubek, Michal; Ortiz-de-Solorzano, Carlos
2014-01-01
Motivation: Automatic tracking of cells in multidimensional time-lapse fluorescence microscopy is an important task in many biomedical applications. A novel framework for objective evaluation of cell tracking algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2013 Cell Tracking Challenge. In this article, we present the logistics, datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. Results: The main contributions of the challenge include the creation of a comprehensive video dataset repository and the definition of objective measures for comparison and ranking of the algorithms. With this benchmark, six algorithms covering a variety of segmentation and tracking paradigms have been compared and ranked based on their performance on both synthetic and real datasets. Given the diversity of the datasets, we do not declare a single winner of the challenge. Instead, we present and discuss the results for each individual dataset separately. Availability and implementation: The challenge Web site (http://www.codesolorzano.com/celltrackingchallenge) provides access to the training and competition datasets, along with the ground truth of the training videos. It also provides access to Windows and Linux executable files of the evaluation software and most of the algorithms that competed in the challenge. Contact: codesolorzano@unav.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24526711
Antibody-protein interactions: benchmark datasets and prediction tools evaluation
Ponomarenko, Julia V; Bourne, Philip E
2007-01-01
Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It is an open question as to whether ultimately discriminatory features can be found. PMID:17910770
Food Recognition: A New Dataset, Experiments, and Results.
Ciocca, Gianluigi; Napoletano, Paolo; Schettini, Raimondo
2017-05-01
We propose a new dataset for the evaluation of food recognition algorithms that can be used in dietary monitoring applications. Each image depicts a real canteen tray with dishes and foods arranged in different ways. Each tray contains multiple instances of food classes. The dataset contains 1027 canteen trays for a total of 3616 food instances belonging to 73 food classes. The food on the tray images has been manually segmented using carefully drawn polygonal boundaries. We have benchmarked the dataset by designing an automatic tray analysis pipeline that takes a tray image as input, finds the regions of interest, and predicts for each region the corresponding food class. We have experimented with three different classification strategies using also several visual descriptors. We achieve about 79% of food and tray recognition accuracy using convolutional-neural-networks-based features. The dataset, as well as the benchmark framework, are available to the research community.
MIPS bacterial genomes functional annotation benchmark dataset.
Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen
2005-05-15
Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab
De Hertogh, Benoît; De Meulder, Bertrand; Berger, Fabrice; Pierre, Michael; Bareke, Eric; Gaigneaux, Anthoula; Depiereux, Eric
2010-01-11
Recent reanalysis of spike-in datasets underscored the need for new and more accurate benchmark datasets for statistical microarray analysis. We present here a fresh method using biologically-relevant data to evaluate the performance of statistical methods. Our novel method ranks the probesets from a dataset composed of publicly-available biological microarray data and extracts subset matrices with precise information/noise ratios. Our method can be used to determine the capability of different methods to better estimate variance for a given number of replicates. The mean-variance and mean-fold change relationships of the matrices revealed a closer approximation of biological reality. Performance analysis refined the results from benchmarks published previously.We show that the Shrinkage t test (close to Limma) was the best of the methods tested, except when two replicates were examined, where the Regularized t test and the Window t test performed slightly better. The R scripts used for the analysis are available at http://urbm-cluster.urbm.fundp.ac.be/~bdemeulder/.
Assessment of composite motif discovery methods.
Klepper, Kjetil; Sandve, Geir K; Abul, Osman; Johansen, Jostein; Drablos, Finn
2008-02-26
Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery - discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery.
Bellot, Pau; Olsen, Catharina; Salembier, Philippe; Oliveras-Vergés, Albert; Meyer, Patrick E
2015-09-29
In the last decade, a great number of methods for reconstructing gene regulatory networks from expression data have been proposed. However, very few tools and datasets allow to evaluate accurately and reproducibly those methods. Hence, we propose here a new tool, able to perform a systematic, yet fully reproducible, evaluation of transcriptional network inference methods. Our open-source and freely available Bioconductor package aggregates a large set of tools to assess the robustness of network inference algorithms against different simulators, topologies, sample sizes and noise intensities. The benchmarking framework that uses various datasets highlights the specialization of some methods toward network types and data. As a result, it is possible to identify the techniques that have broad overall performances.
Li, Zhucui; Lu, Yan; Guo, Yufeng; Cao, Haijie; Wang, Qinhong; Shui, Wenqing
2018-10-31
Data analysis represents a key challenge for untargeted metabolomics studies and it commonly requires extensive processing of more than thousands of metabolite peaks included in raw high-resolution MS data. Although a number of software packages have been developed to facilitate untargeted data processing, they have not been comprehensively scrutinized in the capability of feature detection, quantification and marker selection using a well-defined benchmark sample set. In this study, we acquired a benchmark dataset from standard mixtures consisting of 1100 compounds with specified concentration ratios including 130 compounds with significant variation of concentrations. Five software evaluated here (MS-Dial, MZmine 2, XCMS, MarkerView, and Compound Discoverer) showed similar performance in detection of true features derived from compounds in the mixtures. However, significant differences between untargeted metabolomics software were observed in relative quantification of true features in the benchmark dataset. MZmine 2 outperformed the other software in terms of quantification accuracy and it reported the most true discriminating markers together with the fewest false markers. Furthermore, we assessed selection of discriminating markers by different software using both the benchmark dataset and a real-case metabolomics dataset to propose combined usage of two software for increasing confidence of biomarker identification. Our findings from comprehensive evaluation of untargeted metabolomics software would help guide future improvements of these widely used bioinformatics tools and enable users to properly interpret their metabolomics results. Copyright © 2018 Elsevier B.V. All rights reserved.
Design of an Evolutionary Approach for Intrusion Detection
2013-01-01
A novel evolutionary approach is proposed for effective intrusion detection based on benchmark datasets. The proposed approach can generate a pool of noninferior individual solutions and ensemble solutions thereof. The generated ensembles can be used to detect the intrusions accurately. For intrusion detection problem, the proposed approach could consider conflicting objectives simultaneously like detection rate of each attack class, error rate, accuracy, diversity, and so forth. The proposed approach can generate a pool of noninferior solutions and ensembles thereof having optimized trade-offs values of multiple conflicting objectives. In this paper, a three-phase, approach is proposed to generate solutions to a simple chromosome design in the first phase. In the first phase, a Pareto front of noninferior individual solutions is approximated. In the second phase of the proposed approach, the entire solution set is further refined to determine effective ensemble solutions considering solution interaction. In this phase, another improved Pareto front of ensemble solutions over that of individual solutions is approximated. The ensemble solutions in improved Pareto front reported improved detection results based on benchmark datasets for intrusion detection. In the third phase, a combination method like majority voting method is used to fuse the predictions of individual solutions for determining prediction of ensemble solution. Benchmark datasets, namely, KDD cup 1999 and ISCX 2012 dataset, are used to demonstrate and validate the performance of the proposed approach for intrusion detection. The proposed approach can discover individual solutions and ensemble solutions thereof with a good support and a detection rate from benchmark datasets (in comparison with well-known ensemble methods like bagging and boosting). In addition, the proposed approach is a generalized classification approach that is applicable to the problem of any field having multiple conflicting objectives, and a dataset can be represented in the form of labelled instances in terms of its features. PMID:24376390
Validation project. This report describes the procedure used to generate the noise models output dataset , and then it compares that dataset to the...benchmark, the Engineer Research and Development Centers Long-Range Sound Propagation dataset . It was found that the models consistently underpredict the
Deriving empirical benchmarks from existing monitoring datasets for rangeland adaptive management
USDA-ARS?s Scientific Manuscript database
Under adaptive management, goals and decisions for managing rangeland resources are shaped by requirements like the Bureau of Land Management’s (BLM’s) Land Health Standards, which specify desired conditions. Without formalized, quantitative benchmarks for triggering management actions, adaptive man...
APPLICATION OF BENCHMARK DOSE METHODOLOGY TO DATA FROM PRENATAL DEVELOPMENTAL TOXICITY STUDIES
The benchmark dose (BMD) concept was applied to 246 conventional developmental toxicity datasets from government, industry and commercial laboratories. Five modeling approaches were used, two generic and three specific to developmental toxicity (DT models). BMDs for both quantal ...
A benchmark testing ground for integrating homology modeling and protein docking.
Bohnuud, Tanggis; Luo, Lingqi; Wodak, Shoshana J; Bonvin, Alexandre M J J; Weng, Zhiping; Vajda, Sandor; Schueler-Furman, Ora; Kozakov, Dima
2017-01-01
Protein docking procedures carry out the task of predicting the structure of a protein-protein complex starting from the known structures of the individual protein components. More often than not, however, the structure of one or both components is not known, but can be derived by homology modeling on the basis of known structures of related proteins deposited in the Protein Data Bank (PDB). Thus, the problem is to develop methods that optimally integrate homology modeling and docking with the goal of predicting the structure of a complex directly from the amino acid sequences of its component proteins. One possibility is to use the best available homology modeling and docking methods. However, the models built for the individual subunits often differ to a significant degree from the bound conformation in the complex, often much more so than the differences observed between free and bound structures of the same protein, and therefore additional conformational adjustments, both at the backbone and side chain levels need to be modeled to achieve an accurate docking prediction. In particular, even homology models of overall good accuracy frequently include localized errors that unfavorably impact docking results. The predicted reliability of the different regions in the model can also serve as a useful input for the docking calculations. Here we present a benchmark dataset that should help to explore and solve combined modeling and docking problems. This dataset comprises a subset of the experimentally solved 'target' complexes from the widely used Docking Benchmark from the Weng Lab (excluding antibody-antigen complexes). This subset is extended to include the structures from the PDB related to those of the individual components of each complex, and hence represent potential templates for investigating and benchmarking integrated homology modeling and docking approaches. Template sets can be dynamically customized by specifying ranges in sequence similarity and in PDB release dates, or using other filtering options, such as excluding sets of specific structures from the template list. Multiple sequence alignments, as well as structural alignments of the templates to their corresponding subunits in the target are also provided. The resource is accessible online or can be downloaded at http://cluspro.org/benchmark, and is updated on a weekly basis in synchrony with new PDB releases. Proteins 2016; 85:10-16. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Review and Analysis of Algorithmic Approaches Developed for Prognostics on CMAPSS Dataset
NASA Technical Reports Server (NTRS)
Ramasso, Emannuel; Saxena, Abhinav
2014-01-01
Benchmarking of prognostic algorithms has been challenging due to limited availability of common datasets suitable for prognostics. In an attempt to alleviate this problem several benchmarking datasets have been collected by NASA's prognostic center of excellence and made available to the Prognostics and Health Management (PHM) community to allow evaluation and comparison of prognostics algorithms. Among those datasets are five C-MAPSS datasets that have been extremely popular due to their unique characteristics making them suitable for prognostics. The C-MAPSS datasets pose several challenges that have been tackled by different methods in the PHM literature. In particular, management of high variability due to sensor noise, effects of operating conditions, and presence of multiple simultaneous fault modes are some factors that have great impact on the generalization capabilities of prognostics algorithms. More than 70 publications have used the C-MAPSS datasets for developing data-driven prognostic algorithms. The C-MAPSS datasets are also shown to be well-suited for development of new machine learning and pattern recognition tools for several key preprocessing steps such as feature extraction and selection, failure mode assessment, operating conditions assessment, health status estimation, uncertainty management, and prognostics performance evaluation. This paper summarizes a comprehensive literature review of publications using C-MAPSS datasets and provides guidelines and references to further usage of these datasets in a manner that allows clear and consistent comparison between different approaches.
CIFAR10-DVS: An Event-Stream Dataset for Object Classification
Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping
2017-01-01
Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582
CIFAR10-DVS: An Event-Stream Dataset for Object Classification.
Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping
2017-01-01
Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.
Area-to-point regression kriging for pan-sharpening
NASA Astrophysics Data System (ADS)
Wang, Qunming; Shi, Wenzhong; Atkinson, Peter M.
2016-04-01
Pan-sharpening is a technique to combine the fine spatial resolution panchromatic (PAN) band with the coarse spatial resolution multispectral bands of the same satellite to create a fine spatial resolution multispectral image. In this paper, area-to-point regression kriging (ATPRK) is proposed for pan-sharpening. ATPRK considers the PAN band as the covariate. Moreover, ATPRK is extended with a local approach, called adaptive ATPRK (AATPRK), which fits a regression model using a local, non-stationary scheme such that the regression coefficients change across the image. The two geostatistical approaches, ATPRK and AATPRK, were compared to the 13 state-of-the-art pan-sharpening approaches summarized in Vivone et al. (2015) in experiments on three separate datasets. ATPRK and AATPRK produced more accurate pan-sharpened images than the 13 benchmark algorithms in all three experiments. Unlike the benchmark algorithms, the two geostatistical solutions precisely preserved the spectral properties of the original coarse data. Furthermore, ATPRK can be enhanced by a local scheme in AATRPK, in cases where the residuals from a global regression model are such that their spatial character varies locally.
Laajala, Teemu D; Murtojärvi, Mika; Virkki, Arho; Aittokallio, Tero
2018-06-15
Prognostic models are widely used in clinical decision-making, such as risk stratification and tailoring treatment strategies, with the aim to improve patient outcomes while reducing overall healthcare costs. While prognostic models have been adopted into clinical use, benchmarking their performance has been difficult due to lack of open clinical datasets. The recent DREAM 9.5 Prostate Cancer Challenge carried out an extensive benchmarking of prognostic models for metastatic Castration-Resistant Prostate Cancer (mCRPC), based on multiple cohorts of open clinical trial data. We make available an open-source implementation of the top-performing model, ePCR, along with an extended toolbox for its further re-use and development, and demonstrate how to best apply the implemented model to real-world data cohorts of advanced prostate cancer patients. The open-source R-package ePCR and its reference documentation are available at the Central R Archive Network (CRAN): https://CRAN.R-project.org/package=ePCR. R-vignette provides step-by-step examples for the ePCR usage. Supplementary data are available at Bioinformatics online.
BMDExpress Data Viewer: A Visualization Tool to Analyze BMDExpress Datasets
Regulatory agencies increasingly apply benchmark dose (BMD) modeling to determine points of departure in human risk assessments. BMDExpress applies BMD modeling to transcriptomics datasets and groups genes to biological processes and pathways for rapid assessment of doses at whic...
Federal Register 2010, 2011, 2012, 2013, 2014
2013-05-13
..., describes the fisheries, evaluates the status of the stock, estimates biological benchmarks, projects future.... Participants will evaluate and recommend datasets appropriate for assessment analysis, employ assessment models to evaluate stock status, estimate population benchmarks and management criteria, and project future...
HEp-2 cell image classification method based on very deep convolutional networks with small datasets
NASA Astrophysics Data System (ADS)
Lu, Mengchi; Gao, Long; Guo, Xifeng; Liu, Qiang; Yin, Jianping
2017-07-01
Human Epithelial-2 (HEp-2) cell images staining patterns classification have been widely used to identify autoimmune diseases by the anti-Nuclear antibodies (ANA) test in the Indirect Immunofluorescence (IIF) protocol. Because manual test is time consuming, subjective and labor intensive, image-based Computer Aided Diagnosis (CAD) systems for HEp-2 cell classification are developing. However, methods proposed recently are mostly manual features extraction with low accuracy. Besides, the scale of available benchmark datasets is small, which does not exactly suitable for using deep learning methods. This issue will influence the accuracy of cell classification directly even after data augmentation. To address these issues, this paper presents a high accuracy automatic HEp-2 cell classification method with small datasets, by utilizing very deep convolutional networks (VGGNet). Specifically, the proposed method consists of three main phases, namely image preprocessing, feature extraction and classification. Moreover, an improved VGGNet is presented to address the challenges of small-scale datasets. Experimental results over two benchmark datasets demonstrate that the proposed method achieves superior performance in terms of accuracy compared with existing methods.
Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson's disease prediction.
Khan, Maryam Mahsal; Mendes, Alexandre; Chalup, Stephan K
2018-01-01
Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson's disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results.
Evolutionary Wavelet Neural Network ensembles for breast cancer and Parkinson’s disease prediction
Mendes, Alexandre; Chalup, Stephan K.
2018-01-01
Wavelet Neural Networks are a combination of neural networks and wavelets and have been mostly used in the area of time-series prediction and control. Recently, Evolutionary Wavelet Neural Networks have been employed to develop cancer prediction models. The present study proposes to use ensembles of Evolutionary Wavelet Neural Networks. The search for a high quality ensemble is directed by a fitness function that incorporates the accuracy of the classifiers both independently and as part of the ensemble itself. The ensemble approach is tested on three publicly available biomedical benchmark datasets, one on Breast Cancer and two on Parkinson’s disease, using a 10-fold cross-validation strategy. Our experimental results show that, for the first dataset, the performance was similar to previous studies reported in literature. On the second dataset, the Evolutionary Wavelet Neural Network ensembles performed better than all previous methods. The third dataset is relatively new and this study is the first to report benchmark results. PMID:29420578
HPC Analytics Support. Requirements for Uncertainty Quantification Benchmarks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paulson, Patrick R.; Purohit, Sumit; Rodriguez, Luke R.
2015-05-01
This report outlines techniques for extending benchmark generation products so they support uncertainty quantification by benchmarked systems. We describe how uncertainty quantification requirements can be presented to candidate analytical tools supporting SPARQL. We describe benchmark data sets for evaluating uncertainty quantification, as well as an approach for using our benchmark generator to produce data sets for generating benchmark data sets.
A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video
2011-06-01
orders of magnitude larger than existing datasets such CAVIAR [7]. TRECVID 2008 airport dataset [16] contains 100 hours of video, but, it provides only...entire human figure (e.g., above shoulder), amounting to 500% human to video 2Some statistics are approximate, obtained from the CAVIAR 1st scene and...and diversity in both col- lection sites and viewpoints. In comparison to surveillance datasets such as CAVIAR [7] and TRECVID [16] shown in Fig. 3
NASA Astrophysics Data System (ADS)
Pernot, Pascal; Savin, Andreas
2018-06-01
Benchmarking studies in computational chemistry use reference datasets to assess the accuracy of a method through error statistics. The commonly used error statistics, such as the mean signed and mean unsigned errors, do not inform end-users on the expected amplitude of prediction errors attached to these methods. We show that, the distributions of model errors being neither normal nor zero-centered, these error statistics cannot be used to infer prediction error probabilities. To overcome this limitation, we advocate for the use of more informative statistics, based on the empirical cumulative distribution function of unsigned errors, namely, (1) the probability for a new calculation to have an absolute error below a chosen threshold and (2) the maximal amplitude of errors one can expect with a chosen high confidence level. Those statistics are also shown to be well suited for benchmarking and ranking studies. Moreover, the standard error on all benchmarking statistics depends on the size of the reference dataset. Systematic publication of these standard errors would be very helpful to assess the statistical reliability of benchmarking conclusions.
Benchmark datasets for 3D MALDI- and DESI-imaging mass spectrometry.
Oetjen, Janina; Veselkov, Kirill; Watrous, Jeramie; McKenzie, James S; Becker, Michael; Hauberg-Lotte, Lena; Kobarg, Jan Hendrik; Strittmatter, Nicole; Mróz, Anna K; Hoffmann, Franziska; Trede, Dennis; Palmer, Andrew; Schiffler, Stefan; Steinhorst, Klaus; Aichler, Michaela; Goldin, Robert; Guntinas-Lichius, Orlando; von Eggeling, Ferdinand; Thiele, Herbert; Maedler, Kathrin; Walch, Axel; Maass, Peter; Dorrestein, Pieter C; Takats, Zoltan; Alexandrov, Theodore
2015-01-01
Three-dimensional (3D) imaging mass spectrometry (MS) is an analytical chemistry technique for the 3D molecular analysis of a tissue specimen, entire organ, or microbial colonies on an agar plate. 3D-imaging MS has unique advantages over existing 3D imaging techniques, offers novel perspectives for understanding the spatial organization of biological processes, and has growing potential to be introduced into routine use in both biology and medicine. Owing to the sheer quantity of data generated, the visualization, analysis, and interpretation of 3D imaging MS data remain a significant challenge. Bioinformatics research in this field is hampered by the lack of publicly available benchmark datasets needed to evaluate and compare algorithms. High-quality 3D imaging MS datasets from different biological systems at several labs were acquired, supplied with overview images and scripts demonstrating how to read them, and deposited into MetaboLights, an open repository for metabolomics data. 3D imaging MS data were collected from five samples using two types of 3D imaging MS. 3D matrix-assisted laser desorption/ionization imaging (MALDI) MS data were collected from murine pancreas, murine kidney, human oral squamous cell carcinoma, and interacting microbial colonies cultured in Petri dishes. 3D desorption electrospray ionization (DESI) imaging MS data were collected from a human colorectal adenocarcinoma. With the aim to stimulate computational research in the field of computational 3D imaging MS, selected high-quality 3D imaging MS datasets are provided that could be used by algorithm developers as benchmark datasets.
Multi-Complementary Model for Long-Term Tracking
Zhang, Deng; Zhang, Junchang; Xia, Chenyang
2018-01-01
In recent years, video target tracking algorithms have been widely used. However, many tracking algorithms do not achieve satisfactory performance, especially when dealing with problems such as object occlusions, background clutters, motion blur, low illumination color images, and sudden illumination changes in real scenes. In this paper, we incorporate an object model based on contour information into a Staple tracker that combines the correlation filter model and color model to greatly improve the tracking robustness. Since each model is responsible for tracking specific features, the three complementary models combine for more robust tracking. In addition, we propose an efficient object detection model with contour and color histogram features, which has good detection performance and better detection efficiency compared to the traditional target detection algorithm. Finally, we optimize the traditional scale calculation, which greatly improves the tracking execution speed. We evaluate our tracker on the Object Tracking Benchmarks 2013 (OTB-13) and Object Tracking Benchmarks 2015 (OTB-15) benchmark datasets. With the OTB-13 benchmark datasets, our algorithm is improved by 4.8%, 9.6%, and 10.9% on the success plots of OPE, TRE and SRE, respectively, in contrast to another classic LCT (Long-term Correlation Tracking) algorithm. On the OTB-15 benchmark datasets, when compared with the LCT algorithm, our algorithm achieves 10.4%, 12.5%, and 16.1% improvement on the success plots of OPE, TRE, and SRE, respectively. At the same time, it needs to be emphasized that, due to the high computational efficiency of the color model and the object detection model using efficient data structures, and the speed advantage of the correlation filters, our tracking algorithm could still achieve good tracking speed. PMID:29425170
Fan, Ming; Zheng, Bin; Li, Lihua
2015-10-01
Knowledge of the structural class of a given protein is important for understanding its folding patterns. Although a lot of efforts have been made, it still remains a challenging problem for prediction of protein structural class solely from protein sequences. The feature extraction and classification of proteins are the main problems in prediction. In this research, we extended our earlier work regarding these two aspects. In protein feature extraction, we proposed a scheme by calculating the word frequency and word position from sequences of amino acid, reduced amino acid, and secondary structure. For an accurate classification of the structural class of protein, we developed a novel Multi-Agent Ada-Boost (MA-Ada) method by integrating the features of Multi-Agent system into Ada-Boost algorithm. Extensive experiments were taken to test and compare the proposed method using four benchmark datasets in low homology. The results showed classification accuracies of 88.5%, 96.0%, 88.4%, and 85.5%, respectively, which are much better compared with the existing methods. The source code and dataset are available on request.
Daily Temperature and Precipitation Data for 518 Russian Meteorological Stations (1881 - 2010)
Bulygina, O. N. [All-Russian Research Institute of Hydrometeorological Information-World Data Centre; Razuvaev, V. N. [All-Russian Research Institute of Hydrometeorological Information-World Data Centre
2012-01-01
Over the past several decades, many climate datasets have been exchanged directly between the principal climate data centers of the United States (NOAA's National Climatic Data Center (NCDC)) and the former-USSR/Russia (All-Russian Research Institute for Hydrometeorological Information-World Data Center (RIHMI-WDC)). This data exchange has its roots in a bilateral initiative known as the Agreement on Protection of the Environment (Tatusko 1990). CDIAC has partnered with NCDC and RIHMI-WDC since the early 1990s to help make former-USSR climate datasets available to the public. The first former-USSR daily temperature and precipitation dataset released by CDIAC was initially created within the framework of the international cooperation between RIHMI-WDC and CDIAC and was published by CDIAC as NDP-040, consisting of data from 223 stations over the former USSR whose data were published in USSR Meteorological Monthly (Part 1: Daily Data). The database presented here consists of records from 518 Russian stations (excluding the former-USSR stations outside the Russian territory contained in NDP-040), for the most part extending through 2010. Records not extending through 2010 result from stations having closed or else their data were not published in Meteorological Monthly of CIS Stations (Part 1: Daily Data). The database was created from the digital media of the State Data Holding. The station inventory was arrived at using (a) the list of Roshydromet stations that are included in the Global Climate Observation Network (this list was approved by the Head of Roshydromet on 25 March 2004) and (b) the list of Roshydromet benchmark meteorological stations prepared by V.I. Kodratyuk, Head of the Department at Voeikov Main Geophysical Observatory.
NASA Astrophysics Data System (ADS)
Klos, Anna; Pottiaux, Eric; Van Malderen, Roeland; Bock, Olivier; Bogusz, Janusz
2017-04-01
A synthetic benchmark dataset of Integrated Water Vapour (IWV) was created within the activity of "Data homogenisation" of sub-working group WG3 of COST ES1206 Action. The benchmark dataset was created basing on the analysis of IWV differences retrieved by Global Positioning System (GPS) International GNSS Service (IGS) stations using European Centre for Medium-Range Weather Forecats (ECMWF) reanalysis data (ERA-Interim). Having analysed a set of 120 series of IWV differences (ERAI-GPS) derived for IGS stations, we delivered parameters of a number of gaps and breaks for every certain station. Moreover, we estimated values of trends, significant seasonalities and character of residuals when deterministic model was removed. We tested five different noise models and found that a combination of white and autoregressive processes of first order describes the stochastic part with a good accuracy. Basing on this analysis, we performed Monte Carlo simulations of 25 years long data with two different types of noise: white as well as combination of white and autoregressive processes. We also added few strictly defined offsets, creating three variants of synthetic dataset: easy, less-complicated and fully-complicated. The 'Easy' dataset included seasonal signals (annual, semi-annual, 3 and 4 months if present for a particular station), offsets and white noise. The 'Less-complicated' dataset included above-mentioned, as well as the combination of white and first order autoregressive processes (AR(1)+WH). The 'Fully-complicated' dataset included, beyond above, a trend and gaps. In this research, we show the impact of manual homogenisation on the estimates of trend and its error. We also cross-compare the results for three above-mentioned datasets, as the synthetized noise type might have a significant influence on manual homogenisation. Therefore, it might mostly affect the values of trend and their uncertainties when inappropriately handled. In a future, the synthetic dataset we present is going to be used as a benchmark to test various statistical tools in terms of homogenisation task.
A Benchmark and Comparative Study of Video-Based Face Recognition on COX Face Database.
Huang, Zhiwu; Shan, Shiguang; Wang, Ruiping; Zhang, Haihong; Lao, Shihong; Kuerban, Alifu; Chen, Xilin
2015-12-01
Face recognition with still face images has been widely studied, while the research on video-based face recognition is inadequate relatively, especially in terms of benchmark datasets and comparisons. Real-world video-based face recognition applications require techniques for three distinct scenarios: 1) Videoto-Still (V2S); 2) Still-to-Video (S2V); and 3) Video-to-Video (V2V), respectively, taking video or still image as query or target. To the best of our knowledge, few datasets and evaluation protocols have benchmarked for all the three scenarios. In order to facilitate the study of this specific topic, this paper contributes a benchmarking and comparative study based on a newly collected still/video face database, named COX(1) Face DB. Specifically, we make three contributions. First, we collect and release a largescale still/video face database to simulate video surveillance with three different video-based face recognition scenarios (i.e., V2S, S2V, and V2V). Second, for benchmarking the three scenarios designed on our database, we review and experimentally compare a number of existing set-based methods. Third, we further propose a novel Point-to-Set Correlation Learning (PSCL) method, and experimentally show that it can be used as a promising baseline method for V2S/S2V face recognition on COX Face DB. Extensive experimental results clearly demonstrate that video-based face recognition needs more efforts, and our COX Face DB is a good benchmark database for evaluation.
Improving Protein Fold Recognition by Deep Learning Networks.
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-04
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl's benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
A global water resources ensemble of hydrological models: the eartH2Observe Tier-1 dataset
NASA Astrophysics Data System (ADS)
Schellekens, Jaap; Dutra, Emanuel; Martínez-de la Torre, Alberto; Balsamo, Gianpaolo; van Dijk, Albert; Sperna Weiland, Frederiek; Minvielle, Marie; Calvet, Jean-Christophe; Decharme, Bertrand; Eisner, Stephanie; Fink, Gabriel; Flörke, Martina; Peßenteiner, Stefanie; van Beek, Rens; Polcher, Jan; Beck, Hylke; Orth, René; Calton, Ben; Burke, Sophia; Dorigo, Wouter; Weedon, Graham P.
2017-07-01
The dataset presented here consists of an ensemble of 10 global hydrological and land surface models for the period 1979-2012 using a reanalysis-based meteorological forcing dataset (0.5° resolution). The current dataset serves as a state of the art in current global hydrological modelling and as a benchmark for further improvements in the coming years. A signal-to-noise ratio analysis revealed low inter-model agreement over (i) snow-dominated regions and (ii) tropical rainforest and monsoon areas. The large uncertainty of precipitation in the tropics is not reflected in the ensemble runoff. Verification of the results against benchmark datasets for evapotranspiration, snow cover, snow water equivalent, soil moisture anomaly and total water storage anomaly using the tools from The International Land Model Benchmarking Project (ILAMB) showed overall useful model performance, while the ensemble mean generally outperformed the single model estimates. The results also show that there is currently no single best model for all variables and that model performance is spatially variable. In our unconstrained model runs the ensemble mean of total runoff into the ocean was 46 268 km3 yr-1 (334 kg m-2 yr-1), while the ensemble mean of total evaporation was 537 kg m-2 yr-1. All data are made available openly through a Water Cycle Integrator portal (WCI, wci.earth2observe.eu), and via a direct http and ftp download. The portal follows the protocols of the open geospatial consortium such as OPeNDAP, WCS and WMS. The DOI for the data is https://doi.org/10.1016/10.5281/zenodo.167070.
Federal Register 2010, 2011, 2012, 2013, 2014
2012-11-26
... coverage \\1\\ in the individual and small group markets, Medicaid benchmark and benchmark-equivalent plans...) Act extends the coverage of the EHB package to issuers of non-grandfathered individual and small group... small group markets, and not to Medicaid benchmark or benchmark-equivalent plans. EHB applicability to...
Kirwan, Jennifer A; Weber, Ralf J M; Broadhurst, David I; Viant, Mark R
2014-01-01
Direct-infusion mass spectrometry (DIMS) metabolomics is an important approach for characterising molecular responses of organisms to disease, drugs and the environment. Increasingly large-scale metabolomics studies are being conducted, necessitating improvements in both bioanalytical and computational workflows to maintain data quality. This dataset represents a systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of cardiac tissue extracts. It comprises of twenty biological samples (cow vs. sheep) that were analysed repeatedly, in 8 batches across 7 days, together with a concurrent set of quality control (QC) samples. Data are presented from each step of the workflow and are available in MetaboLights. The strength of the dataset is that intra- and inter-batch variation can be corrected using QC spectra and the quality of this correction assessed independently using the repeatedly-measured biological samples. Originally designed to test the efficacy of a batch-correction algorithm, it will enable others to evaluate novel data processing algorithms. Furthermore, this dataset serves as a benchmark for DIMS metabolomics, derived using best-practice workflows and rigorous quality assessment. PMID:25977770
DOE Office of Scientific and Technical Information (OSTI.GOV)
McLoughlin, K.
2016-01-22
The software application “MetaQuant” was developed by our group at Lawrence Livermore National Laboratory (LLNL). It is designed to profile microbial populations in a sample using data from whole-genome shotgun (WGS) metagenomic DNA sequencing. Several other metagenomic profiling applications have been described in the literature. We ran a series of benchmark tests to compare the performance of MetaQuant against that of a few existing profiling tools, using real and simulated sequence datasets. This report describes our benchmarking procedure and results.
ERIC Educational Resources Information Center
Wall, Andrew; Frost, Robert; Smith, Ryan; Keeling, Richard
2008-01-01
Although datasets such as the Integrated Postsecondary Data System are available as inputs to higher education funding formulas, these datasets can be unreliable, incomplete, or unresponsive to criteria identified by state education officials. State formulas do not always match the state's economic and human capital goals. This article analyzes…
A Novel Performance Evaluation Methodology for Single-Target Trackers.
Kristan, Matej; Matas, Jiri; Leonardis, Ales; Vojir, Tomas; Pflugfelder, Roman; Fernandez, Gustavo; Nebehay, Georg; Porikli, Fatih; Cehovin, Luka
2016-11-01
This paper addresses the problem of single-target tracker performance evaluation. We consider the performance measures, the dataset and the evaluation system to be the most important components of tracker evaluation and propose requirements for each of them. The requirements are the basis of a new evaluation methodology that aims at a simple and easily interpretable tracker comparison. The ranking-based methodology addresses tracker equivalence in terms of statistical significance and practical differences. A fully-annotated dataset with per-frame annotations with several visual attributes is introduced. The diversity of its visual properties is maximized in a novel way by clustering a large number of videos according to their visual attributes. This makes it the most sophistically constructed and annotated dataset to date. A multi-platform evaluation system allowing easy integration of third-party trackers is presented as well. The proposed evaluation methodology was tested on the VOT2014 challenge on the new dataset and 38 trackers, making it the largest benchmark to date. Most of the tested trackers are indeed state-of-the-art since they outperform the standard baselines, resulting in a highly-challenging benchmark. An exhaustive analysis of the dataset from the perspective of tracking difficulty is carried out. To facilitate tracker comparison a new performance visualization technique is proposed.
Huang, Chien-Hung; Peng, Huai-Shun; Ng, Ka-Lok
2015-01-01
Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.
2015-01-01
Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis. PMID:25866773
Predicting drug-target interactions by dual-network integrated logistic matrix factorization
NASA Astrophysics Data System (ADS)
Hao, Ming; Bryant, Stephen H.; Wang, Yanli
2017-01-01
In this work, we propose a dual-network integrated logistic matrix factorization (DNILMF) algorithm to predict potential drug-target interactions (DTI). The prediction procedure consists of four steps: (1) inferring new drug/target profiles and constructing profile kernel matrix; (2) diffusing drug profile kernel matrix with drug structure kernel matrix; (3) diffusing target profile kernel matrix with target sequence kernel matrix; and (4) building DNILMF model and smoothing new drug/target predictions based on their neighbors. We compare our algorithm with the state-of-the-art method based on the benchmark dataset. Results indicate that the DNILMF algorithm outperforms the previously reported approaches in terms of AUPR (area under precision-recall curve) and AUC (area under curve of receiver operating characteristic) based on the 5 trials of 10-fold cross-validation. We conclude that the performance improvement depends on not only the proposed objective function, but also the used nonlinear diffusion technique which is important but under studied in the DTI prediction field. In addition, we also compile a new DTI dataset for increasing the diversity of currently available benchmark datasets. The top prediction results for the new dataset are confirmed by experimental studies or supported by other computational research.
Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach
M, Pandi; R, Balamurugan; N, Sadhasivam
2017-12-29
Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component. Creative Commons Attribution License
Metric Evaluation Pipeline for 3d Modeling of Urban Scenes
NASA Astrophysics Data System (ADS)
Bosch, M.; Leichtman, A.; Chilcott, D.; Goldberg, H.; Brown, M.
2017-05-01
Publicly available benchmark data and metric evaluation approaches have been instrumental in enabling research to advance state of the art methods for remote sensing applications in urban 3D modeling. Most publicly available benchmark datasets have consisted of high resolution airborne imagery and lidar suitable for 3D modeling on a relatively modest scale. To enable research in larger scale 3D mapping, we have recently released a public benchmark dataset with multi-view commercial satellite imagery and metrics to compare 3D point clouds with lidar ground truth. We now define a more complete metric evaluation pipeline developed as publicly available open source software to assess semantically labeled 3D models of complex urban scenes derived from multi-view commercial satellite imagery. Evaluation metrics in our pipeline include horizontal and vertical accuracy and completeness, volumetric completeness and correctness, perceptual quality, and model simplicity. Sources of ground truth include airborne lidar and overhead imagery, and we demonstrate a semi-automated process for producing accurate ground truth shape files to characterize building footprints. We validate our current metric evaluation pipeline using 3D models produced using open source multi-view stereo methods. Data and software is made publicly available to enable further research and planned benchmarking activities.
Boulesteix, Anne-Laure; Wilson, Rory; Hapfelmeier, Alexander
2017-09-09
The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Li, Jia; Xia, Changqun; Chen, Xiaowu
2017-10-12
Image-based salient object detection (SOD) has been extensively studied in past decades. However, video-based SOD is much less explored due to the lack of large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos. In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects who free-view all videos. From the user data, we find that salient objects in a video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for videobased salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliencyguided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at the pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are constructed in an unsupervised manner that automatically infers a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In experiments, the proposed unsupervised approach is compared with 31 state-of-the-art models on the proposed dataset and outperforms 30 of them, including 19 imagebased classic (unsupervised or non-deep learning) models, six image-based deep learning models, and five video-based unsupervised models. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.
Versari, Cristian; Stoma, Szymon; Batmanov, Kirill; Llamosi, Artémis; Mroz, Filip; Kaczmarek, Adam; Deyell, Matt; Lhoussaine, Cédric; Hersen, Pascal; Batt, Gregory
2017-02-01
With the continuous expansion of single cell biology, the observation of the behaviour of individual cells over extended durations and with high accuracy has become a problem of central importance. Surprisingly, even for yeast cells that have relatively regular shapes, no solution has been proposed that reaches the high quality required for long-term experiments for segmentation and tracking (S&T) based on brightfield images. Here, we present CellStar , a tool chain designed to achieve good performance in long-term experiments. The key features are the use of a new variant of parametrized active rays for segmentation, a neighbourhood-preserving criterion for tracking, and the use of an iterative approach that incrementally improves S&T quality. A graphical user interface enables manual corrections of S&T errors and their use for the automated correction of other, related errors and for parameter learning. We created a benchmark dataset with manually analysed images and compared CellStar with six other tools, showing its high performance, notably in long-term tracking. As a community effort, we set up a website, the Yeast Image Toolkit, with the benchmark and the Evaluation Platform to gather this and additional information provided by others. © 2017 The Authors.
Versari, Cristian; Stoma, Szymon; Batmanov, Kirill; Llamosi, Artémis; Mroz, Filip; Kaczmarek, Adam; Deyell, Matt
2017-01-01
With the continuous expansion of single cell biology, the observation of the behaviour of individual cells over extended durations and with high accuracy has become a problem of central importance. Surprisingly, even for yeast cells that have relatively regular shapes, no solution has been proposed that reaches the high quality required for long-term experiments for segmentation and tracking (S&T) based on brightfield images. Here, we present CellStar, a tool chain designed to achieve good performance in long-term experiments. The key features are the use of a new variant of parametrized active rays for segmentation, a neighbourhood-preserving criterion for tracking, and the use of an iterative approach that incrementally improves S&T quality. A graphical user interface enables manual corrections of S&T errors and their use for the automated correction of other, related errors and for parameter learning. We created a benchmark dataset with manually analysed images and compared CellStar with six other tools, showing its high performance, notably in long-term tracking. As a community effort, we set up a website, the Yeast Image Toolkit, with the benchmark and the Evaluation Platform to gather this and additional information provided by others. PMID:28179544
Improving Protein Fold Recognition by Deep Learning Networks
NASA Astrophysics Data System (ADS)
Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin
2015-12-01
For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl’s benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.
Spectral relative standard deviation: a practical benchmark in metabolomics.
Parsons, Helen M; Ekman, Drew R; Collette, Timothy W; Viant, Mark R
2009-03-01
Metabolomics datasets, by definition, comprise of measurements of large numbers of metabolites. Both technical (analytical) and biological factors will induce variation within these measurements that is not consistent across all metabolites. Consequently, criteria are required to assess the reproducibility of metabolomics datasets that are derived from all the detected metabolites. Here we calculate spectrum-wide relative standard deviations (RSDs; also termed coefficient of variation, CV) for ten metabolomics datasets, spanning a variety of sample types from mammals, fish, invertebrates and a cell line, and display them succinctly as boxplots. We demonstrate multiple applications of spectral RSDs for characterising technical as well as inter-individual biological variation: for optimising metabolite extractions, comparing analytical techniques, investigating matrix effects, and comparing biofluids and tissue extracts from single and multiple species for optimising experimental design. Technical variation within metabolomics datasets, recorded using one- and two-dimensional NMR and mass spectrometry, ranges from 1.6 to 20.6% (reported as the median spectral RSD). Inter-individual biological variation is typically larger, ranging from as low as 7.2% for tissue extracts from laboratory-housed rats to 58.4% for fish plasma. In addition, for some of the datasets we confirm that the spectral RSD values are largely invariant across different spectral processing methods, such as baseline correction, normalisation and binning resolution. In conclusion, we propose spectral RSDs and their median values contained herein as practical benchmarks for metabolomics studies.
Berthon, Beatrice; Spezi, Emiliano; Galavis, Paulina; Shepherd, Tony; Apte, Aditya; Hatt, Mathieu; Fayad, Hadi; De Bernardi, Elisabetta; Soffientini, Chiara D; Ross Schmidtlein, C; El Naqa, Issam; Jeraj, Robert; Lu, Wei; Das, Shiva; Zaidi, Habib; Mawlawi, Osama R; Visvikis, Dimitris; Lee, John A; Kirov, Assen S
2017-08-01
The aim of this paper is to define the requirements and describe the design and implementation of a standard benchmark tool for evaluation and validation of PET-auto-segmentation (PET-AS) algorithms. This work follows the recommendations of Task Group 211 (TG211) appointed by the American Association of Physicists in Medicine (AAPM). The recommendations published in the AAPM TG211 report were used to derive a set of required features and to guide the design and structure of a benchmarking software tool. These items included the selection of appropriate representative data and reference contours obtained from established approaches and the description of available metrics. The benchmark was designed in a way that it could be extendable by inclusion of bespoke segmentation methods, while maintaining its main purpose of being a standard testing platform for newly developed PET-AS methods. An example of implementation of the proposed framework, named PETASset, was built. In this work, a selection of PET-AS methods representing common approaches to PET image segmentation was evaluated within PETASset for the purpose of testing and demonstrating the capabilities of the software as a benchmark platform. A selection of clinical, physical, and simulated phantom data, including "best estimates" reference contours from macroscopic specimens, simulation template, and CT scans was built into the PETASset application database. Specific metrics such as Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV), and Sensitivity (S), were included to allow the user to compare the results of any given PET-AS algorithm to the reference contours. In addition, a tool to generate structured reports on the evaluation of the performance of PET-AS algorithms against the reference contours was built. The variation of the metric agreement values with the reference contours across the PET-AS methods evaluated for demonstration were between 0.51 and 0.83, 0.44 and 0.86, and 0.61 and 1.00 for DSC, PPV, and the S metric, respectively. Examples of agreement limits were provided to show how the software could be used to evaluate a new algorithm against the existing state-of-the art. PETASset provides a platform that allows standardizing the evaluation and comparison of different PET-AS methods on a wide range of PET datasets. The developed platform will be available to users willing to evaluate their PET-AS methods and contribute with more evaluation datasets. © 2017 The Authors. Medical Physics published by Wiley Periodicals, Inc. on behalf of American Association of Physicists in Medicine.
The Harvard organic photovoltaic dataset
Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; ...
2016-09-27
Presented in this work is the Harvard Organic Photovoltaic Dataset (HOPV15), a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
The Harvard organic photovoltaic dataset
Lopez, Steven A.; Pyzer-Knapp, Edward O.; Simm, Gregor N.; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R.; Hachmann, Johannes; Aspuru-Guzik, Alán
2016-01-01
The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications. PMID:27676312
The Harvard organic photovoltaic dataset.
Lopez, Steven A; Pyzer-Knapp, Edward O; Simm, Gregor N; Lutzow, Trevor; Li, Kewei; Seress, Laszlo R; Hachmann, Johannes; Aspuru-Guzik, Alán
2016-09-27
The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.
Benchmarking Foot Trajectory Estimation Methods for Mobile Gait Analysis
Ollenschläger, Malte; Roth, Nils; Klucken, Jochen
2017-01-01
Mobile gait analysis systems based on inertial sensing on the shoe are applied in a wide range of applications. Especially for medical applications, they can give new insights into motor impairment in, e.g., neurodegenerative disease and help objectify patient assessment. One key component in these systems is the reconstruction of the foot trajectories from inertial data. In literature, various methods for this task have been proposed. However, performance is evaluated on a variety of datasets due to the lack of large, generally accepted benchmark datasets. This hinders a fair comparison of methods. In this work, we implement three orientation estimation and three double integration schemes for use in a foot trajectory estimation pipeline. All methods are drawn from literature and evaluated against a marker-based motion capture reference. We provide a fair comparison on the same dataset consisting of 735 strides from 16 healthy subjects. As a result, the implemented methods are ranked and we identify the most suitable processing pipeline for foot trajectory estimation in the context of mobile gait analysis. PMID:28832511
Hagopian, Raffi; Davidson, John R; Datta, Ruchira S; Samad, Bushra; Jarvis, Glen R; Sjölander, Kimmen
2010-07-01
We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Ediger, David; Jiang, Karl
2009-02-15
We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Madduri, Kamesh; Ediger, David; Jiang, Karl
2009-05-29
We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
Stanislawski, L.V.
2009-01-01
The United States Geological Survey has been researching generalization approaches to enable multiple-scale display and delivery of geographic data. This paper presents automated methods to prune network and polygon features of the United States high-resolution National Hydrography Dataset (NHD) to lower resolutions. Feature-pruning rules, data enrichment, and partitioning are derived from knowledge of surface water, the NHD model, and associated feature specification standards. Relative prominence of network features is estimated from upstream drainage area (UDA). Network and polygon features are pruned by UDA and NHD reach code to achieve a drainage density appropriate for any less detailed map scale. Data partitioning maintains local drainage density variations that characterize the terrain. For demonstration, a 48 subbasin area of 1:24 000-scale NHD was pruned to 1:100 000-scale (100 K) and compared to a benchmark, the 100 K NHD. The coefficient of line correspondence (CLC) is used to evaluate how well pruned network features match the benchmark network. CLC values of 0.82 and 0.77 result from pruning with and without partitioning, respectively. The number of polygons that remain after pruning is about seven times that of the benchmark, but the area covered by the polygons that remain after pruning is only about 10% greater than the area covered by benchmark polygons. ?? 2009.
Integrative Analysis of High-throughput Cancer Studies with Contrasted Penalization
Shi, Xingjie; Liu, Jin; Huang, Jian; Zhou, Yong; Shia, BenChang; Ma, Shuangge
2015-01-01
In cancer studies with high-throughput genetic and genomic measurements, integrative analysis provides a way to effectively pool and analyze heterogeneous raw data from multiple independent studies and outperforms “classic” meta-analysis and single-dataset analysis. When marker selection is of interest, the genetic basis of multiple datasets can be described using the homogeneity model or the heterogeneity model. In this study, we consider marker selection under the heterogeneity model, which includes the homogeneity model as a special case and can be more flexible. Penalization methods have been developed in the literature for marker selection. This study advances from the published ones by introducing the contrast penalties, which can accommodate the within- and across-dataset structures of covariates/regression coefficients and, by doing so, further improve marker selection performance. Specifically, we develop a penalization method that accommodates the across-dataset structures by smoothing over regression coefficients. An effective iterative algorithm, which calls an inner coordinate descent iteration, is developed. Simulation shows that the proposed method outperforms the benchmark with more accurate marker identification. The analysis of breast cancer and lung cancer prognosis studies with gene expression measurements shows that the proposed method identifies genes different from those using the benchmark and has better prediction performance. PMID:24395534
Miao, Zhichao; Westhof, Eric
2016-07-08
RBscore&NBench combines a web server, RBscore and a database, NBench. RBscore predicts RNA-/DNA-binding residues in proteins and visualizes the prediction scores and features on protein structures. The scoring scheme of RBscore directly links feature values to nucleic acid binding probabilities and illustrates the nucleic acid binding energy funnel on the protein surface. To avoid dataset, binding site definition and assessment metric biases, we compared RBscore with 18 web servers and 3 stand-alone programs on 41 datasets, which demonstrated the high and stable accuracy of RBscore. A comprehensive comparison led us to develop a benchmark database named NBench. The web server is available on: http://ahsoka.u-strasbg.fr/rbscorenbench/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Large-scale Labeled Datasets to Fuel Earth Science Deep Learning Applications
NASA Astrophysics Data System (ADS)
Maskey, M.; Ramachandran, R.; Miller, J.
2017-12-01
Deep learning has revolutionized computer vision and natural language processing with various algorithms scaled using high-performance computing. However, generic large-scale labeled datasets such as the ImageNet are the fuel that drives the impressive accuracy of deep learning results. Large-scale labeled datasets already exist in domains such as medical science, but creating them in the Earth science domain is a challenge. While there are ways to apply deep learning using limited labeled datasets, there is a need in the Earth sciences for creating large-scale labeled datasets for benchmarking and scaling deep learning applications. At the NASA Marshall Space Flight Center, we are using deep learning for a variety of Earth science applications where we have encountered the need for large-scale labeled datasets. We will discuss our approaches for creating such datasets and why these datasets are just as valuable as deep learning algorithms. We will also describe successful usage of these large-scale labeled datasets with our deep learning based applications.
MatchingLand, geospatial data testbed for the assessment of matching methods.
Xavier, Emerson M A; Ariza-López, Francisco J; Ureña-Cámara, Manuel A
2017-12-05
This article presents datasets prepared with the aim of helping the evaluation of geospatial matching methods for vector data. These datasets were built up from mapping data produced by official Spanish mapping agencies. The testbed supplied encompasses the three geometry types: point, line and area. Initial datasets were submitted to geometric transformations in order to generate synthetic datasets. These transformations represent factors that might influence the performance of geospatial matching methods, like the morphology of linear or areal features, systematic transformations, and random disturbance over initial data. We call our 11 GiB benchmark data 'MatchingLand' and we hope it can be useful for the geographic information science research community.
PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations
Bendl, Jaroslav; Stourac, Jan; Salanda, Ondrej; Pavelka, Antonin; Wieben, Eric D.; Zendulka, Jaroslav; Brezovsky, Jan; Damborsky, Jiri
2014-01-01
Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp. PMID:24453961
Bio-inspired benchmark generator for extracellular multi-unit recordings
Mondragón-González, Sirenia Lizbeth; Burguière, Eric
2017-01-01
The analysis of multi-unit extracellular recordings of brain activity has led to the development of numerous tools, ranging from signal processing algorithms to electronic devices and applications. Currently, the evaluation and optimisation of these tools are hampered by the lack of ground-truth databases of neural signals. These databases must be parameterisable, easy to generate and bio-inspired, i.e. containing features encountered in real electrophysiological recording sessions. Towards that end, this article introduces an original computational approach to create fully annotated and parameterised benchmark datasets, generated from the summation of three components: neural signals from compartmental models and recorded extracellular spikes, non-stationary slow oscillations, and a variety of different types of artefacts. We present three application examples. (1) We reproduced in-vivo extracellular hippocampal multi-unit recordings from either tetrode or polytrode designs. (2) We simulated recordings in two different experimental conditions: anaesthetised and awake subjects. (3) Last, we also conducted a series of simulations to study the impact of different level of artefacts on extracellular recordings and their influence in the frequency domain. Beyond the results presented here, such a benchmark dataset generator has many applications such as calibration, evaluation and development of both hardware and software architectures. PMID:28233819
Nesvizhskii, Alexey I.
2013-01-01
Analysis of protein interaction networks and protein complexes using affinity purification and mass spectrometry (AP/MS) is among most commonly used and successful applications of proteomics technologies. One of the foremost challenges of AP/MS data is a large number of false positive protein interactions present in unfiltered datasets. Here we review computational and informatics strategies for detecting specific protein interaction partners in AP/MS experiments, with a focus on incomplete (as opposite to genome-wide) interactome mapping studies. These strategies range from standard statistical approaches, to empirical scoring schemes optimized for a particular type of data, to advanced computational frameworks. The common denominator among these methods is the use of label-free quantitative information such as spectral counts or integrated peptide intensities that can be extracted from AP/MS data. We also discuss related issues such as combining multiple biological or technical replicates, and dealing with data generated using different tagging strategies. Computational approaches for benchmarking of scoring methods are discussed, and the need for generation of reference AP/MS datasets is highlighted. Finally, we discuss the possibility of more extended modeling of experimental AP/MS data, including integration with external information such as protein interaction predictions based on functional genomics data. PMID:22611043
InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk.
Cheng, Liang; Jiang, Yue; Ju, Hong; Sun, Jie; Peng, Jiajie; Zhou, Meng; Hu, Yang
2018-01-19
Since the establishment of the first biomedical ontology Gene Ontology (GO), the number of biomedical ontology has increased dramatically. Nowadays over 300 ontologies have been built including extensively used Disease Ontology (DO) and Human Phenotype Ontology (HPO). Because of the advantage of identifying novel relationships between terms, calculating similarity between ontology terms is one of the major tasks in this research area. Though similarities between terms within each ontology have been studied with in silico methods, term similarities across different ontologies were not investigated as deeply. The latest method took advantage of gene functional interaction network (GFIN) to explore such inter-ontology similarities of terms. However, it only used gene interactions and failed to make full use of the connectivity among gene nodes of the network. In addition, all existent methods are particularly designed for GO and their performances on the extended ontology community remain unknown. We proposed a method InfAcrOnt to infer similarities between terms across ontologies utilizing the entire GFIN. InfAcrOnt builds a term-gene-gene network which comprised ontology annotations and GFIN, and acquires similarities between terms across ontologies through modeling the information flow within the network by random walk. In our benchmark experiments on sub-ontologies of GO, InfAcrOnt achieves a high average area under the receiver operating characteristic curve (AUC) (0.9322 and 0.9309) and low standard deviations (1.8746e-6 and 3.0977e-6) in both human and yeast benchmark datasets exhibiting superior performance. Meanwhile, comparisons of InfAcrOnt results and prior knowledge on pair-wise DO-HPO terms and pair-wise DO-GO terms show high correlations. The experiment results show that InfAcrOnt significantly improves the performance of inferring similarities between terms across ontologies in benchmark set.
Lessons Learned over Four Benchmark Exercises from the Community Structure-Activity Resource
Carlson, Heather A.
2016-01-01
Preparing datasets and analyzing the results is difficult and time-consuming, and I hope the points raised here will help other scientists avoid some of the thorny issues we wrestled with. PMID:27345761
Spectral Relative Standard Deviation: A Practical Benchmark in Metabolomics
Metabolomics datasets, by definition, comprise of measurements of large numbers of metabolites. Both technical (analytical) and biological factors will induce variation within these measurements that is not consistent across all metabolites. Consequently, criteria are required to...
Liu, Bin; Wu, Hao; Zhang, Deyuan; Wang, Xiaolong; Chou, Kuo-Chen
2017-02-21
To expedite the pace in conducting genome/proteome analysis, we have developed a Python package called Pse-Analysis. The powerful package can automatically complete the following five procedures: (1) sample feature extraction, (2) optimal parameter selection, (3) model training, (4) cross validation, and (5) evaluating prediction quality. All the work a user needs to do is to input a benchmark dataset along with the query biological sequences concerned. Based on the benchmark dataset, Pse-Analysis will automatically construct an ideal predictor, followed by yielding the predicted results for the submitted query samples. All the aforementioned tedious jobs can be automatically done by the computer. Moreover, the multiprocessing technique was adopted to enhance computational speed by about 6 folds. The Pse-Analysis Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/Pse-Analysis/, and can be directly run on Windows, Linux, and Unix.
Karim, Rashed; Bhagirath, Pranav; Claus, Piet; James Housden, R; Chen, Zhong; Karimaghaloo, Zahra; Sohn, Hyon-Mok; Lara Rodríguez, Laura; Vera, Sergio; Albà, Xènia; Hennemuth, Anja; Peitgen, Heinz-Otto; Arbel, Tal; Gonzàlez Ballester, Miguel A; Frangi, Alejandro F; Götte, Marco; Razavi, Reza; Schaeffter, Tobias; Rhode, Kawal
2016-05-01
Studies have demonstrated the feasibility of late Gadolinium enhancement (LGE) cardiovascular magnetic resonance (CMR) imaging for guiding the management of patients with sequelae to myocardial infarction, such as ventricular tachycardia and heart failure. Clinical implementation of these developments necessitates a reproducible and reliable segmentation of the infarcted regions. It is challenging to compare new algorithms for infarct segmentation in the left ventricle (LV) with existing algorithms. Benchmarking datasets with evaluation strategies are much needed to facilitate comparison. This manuscript presents a benchmarking evaluation framework for future algorithms that segment infarct from LGE CMR of the LV. The image database consists of 30 LGE CMR images of both humans and pigs that were acquired from two separate imaging centres. A consensus ground truth was obtained for all data using maximum likelihood estimation. Six widely-used fixed-thresholding methods and five recently developed algorithms are tested on the benchmarking framework. Results demonstrate that the algorithms have better overlap with the consensus ground truth than most of the n-SD fixed-thresholding methods, with the exception of the Full-Width-at-Half-Maximum (FWHM) fixed-thresholding method. Some of the pitfalls of fixed thresholding methods are demonstrated in this work. The benchmarking evaluation framework, which is a contribution of this work, can be used to test and benchmark future algorithms that detect and quantify infarct in LGE CMR images of the LV. The datasets, ground truth and evaluation code have been made publicly available through the website: https://www.cardiacatlas.org/web/guest/challenges. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Thorsen, Jonathan; Brejnrod, Asker; Mortensen, Martin; Rasmussen, Morten A; Stokholm, Jakob; Al-Soud, Waleed Abu; Sørensen, Søren; Bisgaard, Hans; Waage, Johannes
2016-11-25
There is an immense scientific interest in the human microbiome and its effects on human physiology, health, and disease. A common approach for examining bacterial communities is high-throughput sequencing of 16S rRNA gene hypervariable regions, aggregating sequence-similar amplicons into operational taxonomic units (OTUs). Strategies for detecting differential relative abundance of OTUs between sample conditions include classical statistical approaches as well as a plethora of newer methods, many borrowing from the related field of RNA-seq analysis. This effort is complicated by unique data characteristics, including sparsity, sequencing depth variation, and nonconformity of read counts to theoretical distributions, which is often exacerbated by exploratory and/or unbalanced study designs. Here, we assess the robustness of available methods for (1) inference in differential relative abundance analysis and (2) beta-diversity-based sample separation, using a rigorous benchmarking framework based on large clinical 16S microbiome datasets from different sources. Running more than 380,000 full differential relative abundance tests on real datasets with permuted case/control assignments and in silico-spiked OTUs, we identify large differences in method performance on a range of parameters, including false positive rates, sensitivity to sparsity and case/control balances, and spike-in retrieval rate. In large datasets, methods with the highest false positive rates also tend to have the best detection power. For beta-diversity-based sample separation, we show that library size normalization has very little effect and that the distance metric is the most important factor in terms of separation power. Our results, generalizable to datasets from different sequencing platforms, demonstrate how the choice of method considerably affects analysis outcome. Here, we give recommendations for tools that exhibit low false positive rates, have good retrieval power across effect sizes and case/control proportions, and have low sparsity bias. Result output from some commonly used methods should be interpreted with caution. We provide an easily extensible framework for benchmarking of new methods and future microbiome datasets.
Deep Filter Banks for Texture Recognition, Description, and Segmentation.
Cimpoi, Mircea; Maji, Subhransu; Kokkinos, Iasonas; Vedaldi, Andrea
Visual textures have played a key role in image understanding because they convey important semantics of images, and because texture representations that pool local image descriptors in an orderless manner have had a tremendous impact in diverse applications. In this paper we make several contributions to texture understanding. First, instead of focusing on texture instance and material category recognition, we propose a human-interpretable vocabulary of texture attributes to describe common texture patterns, complemented by a new describable texture dataset for benchmarking. Second, we look at the problem of recognizing materials and texture attributes in realistic imaging conditions, including when textures appear in clutter, developing corresponding benchmarks on top of the recently proposed OpenSurfaces dataset. Third, we revisit classic texture represenations, including bag-of-visual-words and the Fisher vectors, in the context of deep learning and show that these have excellent efficiency and generalization properties if the convolutional layers of a deep model are used as filter banks. We obtain in this manner state-of-the-art performance in numerous datasets well beyond textures, an efficient method to apply deep features to image regions, as well as benefit in transferring features from one domain to another.
Evaluating the Quantitative Capabilities of Metagenomic Analysis Software.
Kerepesi, Csaba; Grolmusz, Vince
2016-05-01
DNA sequencing technologies are applied widely and frequently today to describe metagenomes, i.e., microbial communities in environmental or clinical samples, without the need for culturing them. These technologies usually return short (100-300 base-pairs long) DNA reads, and these reads are processed by metagenomic analysis software that assign phylogenetic composition-information to the dataset. Here we evaluate three metagenomic analysis software (AmphoraNet--a webserver implementation of AMPHORA2--, MG-RAST, and MEGAN5) for their capabilities of assigning quantitative phylogenetic information for the data, describing the frequency of appearance of the microorganisms of the same taxa in the sample. The difficulties of the task arise from the fact that longer genomes produce more reads from the same organism than shorter genomes, and some software assign higher frequencies to species with longer genomes than to those with shorter ones. This phenomenon is called the "genome length bias." Dozens of complex artificial metagenome benchmarks can be found in the literature. Because of the complexity of those benchmarks, it is usually difficult to judge the resistance of a metagenomic software to this "genome length bias." Therefore, we have made a simple benchmark for the evaluation of the "taxon-counting" in a metagenomic sample: we have taken the same number of copies of three full bacterial genomes of different lengths, break them up randomly to short reads of average length of 150 bp, and mixed the reads, creating our simple benchmark. Because of its simplicity, the benchmark is not supposed to serve as a mock metagenome, but if a software fails on that simple task, it will surely fail on most real metagenomes. We applied three software for the benchmark. The ideal quantitative solution would assign the same proportion to the three bacterial taxa. We have found that AMPHORA2/AmphoraNet gave the most accurate results and the other two software were under-performers: they counted quite reliably each short read to their respective taxon, producing the typical genome length bias. The benchmark dataset is available at http://pitgroup.org/static/3RandomGenome-100kavg150bps.fna.
Özgür, Arzucan; Hur, Junguk; He, Yongqun
2016-01-01
The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: 'has literature mining keywords' and 'has keyword dependency pattern'. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords.
Decibel: The Relational Dataset Branching System
Maddox, Michael; Goehring, David; Elmore, Aaron J.; Madden, Samuel; Parameswaran, Aditya; Deshpande, Amol
2017-01-01
As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs. PMID:28149668
Traffic sign classification with dataset augmentation and convolutional neural network
NASA Astrophysics Data System (ADS)
Tang, Qing; Kurnianggoro, Laksono; Jo, Kang-Hyun
2018-04-01
This paper presents a method for traffic sign classification using a convolutional neural network (CNN). In this method, firstly we transfer a color image into grayscale, and then normalize it in the range (-1,1) as the preprocessing step. To increase robustness of classification model, we apply a dataset augmentation algorithm and create new images to train the model. To avoid overfitting, we utilize a dropout module before the last fully connection layer. To assess the performance of the proposed method, the German traffic sign recognition benchmark (GTSRB) dataset is utilized. Experimental results show that the method is effective in classifying traffic signs.
Development and Validation of a High-Quality Composite Real-World Mortality Endpoint.
Curtis, Melissa D; Griffith, Sandra D; Tucker, Melisa; Taylor, Michael D; Capra, William B; Carrigan, Gillis; Holzman, Ben; Torres, Aracelis Z; You, Paul; Arnieri, Brandon; Abernethy, Amy P
2018-05-14
To create a high-quality electronic health record (EHR)-derived mortality dataset for retrospective and prospective real-world evidence generation. Oncology EHR data, supplemented with external commercial and US Social Security Death Index data, benchmarked to the National Death Index (NDI). We developed a recent, linkable, high-quality mortality variable amalgamated from multiple data sources to supplement EHR data, benchmarked against the highest completeness U.S. mortality data, the NDI. Data quality of the mortality variable version 2.0 is reported here. For advanced non-small-cell lung cancer, sensitivity of mortality information improved from 66 percent in EHR structured data to 91 percent in the composite dataset, with high date agreement compared to the NDI. For advanced melanoma, metastatic colorectal cancer, and metastatic breast cancer, sensitivity of the final variable was 85 to 88 percent. Kaplan-Meier survival analyses showed that improving mortality data completeness minimized overestimation of survival relative to NDI-based estimates. For EHR-derived data to yield reliable real-world evidence, it needs to be of known and sufficiently high quality. Considering the impact of mortality data completeness on survival endpoints, we highlight the importance of data quality assessment and advocate benchmarking to the NDI. © 2018 The Authors. Health Services Research published by Wiley Periodicals, Inc. on behalf of Health Research and Educational Trust.
Furber, Gareth; Brann, Peter; Skene, Clive; Allison, Stephen
2011-06-01
The purpose of this study was to benchmark the cost efficiency of community care across six child and adolescent mental health services (CAMHS) drawn from different Australian states. Organizational, contact and outcome data from the National Mental Health Benchmarking Project (NMHBP) data-sets were used to calculate cost per "treatment hour" and cost per episode for the six participating organizations. We also explored the relationship between intake severity as measured by the Health of the Nations Outcome Scales for Children and Adolescents (HoNOSCA) and cost per episode. The average cost per treatment hour was $223, with cost differences across the six services ranging from a mean of $156 to $273 per treatment hour. The average cost per episode was $3349 (median $1577) and there were significant differences in the CAMHS organizational medians ranging from $388 to $7076 per episode. HoNOSCA scores explained at best 6% of the cost variance per episode. These large cost differences indicate that community CAMHS have the potential to make substantial gains in cost efficiency through collaborative benchmarking. Benchmarking forums need considerable financial and business expertise for detailed comparison of business models for service provision.
BMDExpress Data Viewer: A Visualization Tool to Analyze BMDExpress Datasets(SoTC)
Background: Benchmark Dose (BMD) modelling is a mathematical approach used to determine where a dose-response change begins to take place relative to controls following chemical exposure. BMDs are being increasingly applied in regulatory toxicology to estimate acceptable exposure...
BMDExpress Data Viewer: A Visualization Tool to Analyze BMDExpress Datasets (STC symposium)
Background: Benchmark Dose (BMD) modelling is a mathematical approach used to determine where a dose-response change begins to take place relative to controls following chemical exposure. BMDs are being increasingly applied in regulatory toxicology to estimate acceptable exposure...
International land Model Benchmarking (ILAMB) Package v002.00
Collier, Nathaniel [Oak Ridge National Laboratory; Hoffman, Forrest M. [Oak Ridge National Laboratory; Mu, Mingquan [University of California, Irvine; Randerson, James T. [University of California, Irvine; Riley, William J. [Lawrence Berkeley National Laboratory
2016-05-09
As a contribution to International Land Model Benchmarking (ILAMB) Project, we are providing new analysis approaches, benchmarking tools, and science leadership. The goal of ILAMB is to assess and improve the performance of land models through international cooperation and to inform the design of new measurement campaigns and field studies to reduce uncertainties associated with key biogeochemical processes and feedbacks. ILAMB is expected to be a primary analysis tool for CMIP6 and future model-data intercomparison experiments. This team has developed initial prototype benchmarking systems for ILAMB, which will be improved and extended to include ocean model metrics and diagnostics.
International land Model Benchmarking (ILAMB) Package v001.00
Mu, Mingquan [University of California, Irvine; Randerson, James T. [University of California, Irvine; Riley, William J. [Lawrence Berkeley National Laboratory; Hoffman, Forrest M. [Oak Ridge National Laboratory
2016-05-02
As a contribution to International Land Model Benchmarking (ILAMB) Project, we are providing new analysis approaches, benchmarking tools, and science leadership. The goal of ILAMB is to assess and improve the performance of land models through international cooperation and to inform the design of new measurement campaigns and field studies to reduce uncertainties associated with key biogeochemical processes and feedbacks. ILAMB is expected to be a primary analysis tool for CMIP6 and future model-data intercomparison experiments. This team has developed initial prototype benchmarking systems for ILAMB, which will be improved and extended to include ocean model metrics and diagnostics.
Super Normal Vector for Human Activity Recognition with Depth Cameras.
Yang, Xiaodong; Tian, YingLi
2017-05-01
The advent of cost-effectiveness and easy-operation depth cameras has facilitated a variety of visual recognition tasks including human activity recognition. This paper presents a novel framework for recognizing human activities from video sequences captured by depth cameras. We extend the surface normal to polynormal by assembling local neighboring hypersurface normals from a depth sequence to jointly characterize local motion and shape information. We then propose a general scheme of super normal vector (SNV) to aggregate the low-level polynormals into a discriminative representation, which can be viewed as a simplified version of the Fisher kernel representation. In order to globally capture the spatial layout and temporal order, an adaptive spatio-temporal pyramid is introduced to subdivide a depth video into a set of space-time cells. In the extensive experiments, the proposed approach achieves superior performance to the state-of-the-art methods on the four public benchmark datasets, i.e., MSRAction3D, MSRDailyActivity3D, MSRGesture3D, and MSRActionPairs3D.
Alpha Matting with KL-Divergence Based Sparse Sampling.
Karacan, Levent; Erdem, Aykut; Erdem, Erkut
2017-06-22
In this paper, we present a new sampling-based alpha matting approach for the accurate estimation of foreground and background layers of an image. Previous sampling-based methods typically rely on certain heuristics in collecting representative samples from known regions, and thus their performance deteriorates if the underlying assumptions are not satisfied. To alleviate this, we take an entirely new approach and formulate sampling as a sparse subset selection problem where we propose to pick a small set of candidate samples that best explains the unknown pixels. Moreover, we describe a new dissimilarity measure for comparing two samples which is based on KLdivergence between the distributions of features extracted in the vicinity of the samples. The proposed framework is general and could be easily extended to video matting by additionally taking temporal information into account in the sampling process. Evaluation on standard benchmark datasets for image and video matting demonstrates that our approach provides more accurate results compared to the state-of-the-art methods.
Multi-task feature selection in microarray data by binary integer programming.
Lan, Liang; Vucetic, Slobodan
2013-12-20
A major challenge in microarray classification is that the number of features is typically orders of magnitude larger than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a low-rank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended to solve multi-task microarray classification problems. We compared the single-task version of the proposed feature selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical results show that the proposed method achieved the most accurate predictions overall. We also evaluated the multi-task version of the proposed algorithm on 8 multi-task microarray datasets. The multi-task feature selection algorithm resulted in significantly higher accuracy than when using the single-task feature selection methods.
Discriminative least squares regression for multiclass classification and feature selection.
Xiang, Shiming; Nie, Feiping; Meng, Gaofeng; Pan, Chunhong; Zhang, Changshui
2012-11-01
This paper presents a framework of discriminative least squares regression (LSR) for multiclass classification and feature selection. The core idea is to enlarge the distance between different classes under the conceptual framework of LSR. First, a technique called ε-dragging is introduced to force the regression targets of different classes moving along opposite directions such that the distances between classes can be enlarged. Then, the ε-draggings are integrated into the LSR model for multiclass classification. Our learning framework, referred to as discriminative LSR, has a compact model form, where there is no need to train two-class machines that are independent of each other. With its compact form, this model can be naturally extended for feature selection. This goal is achieved in terms of L2,1 norm of matrix, generating a sparse learning model for feature selection. The model for multiclass classification and its extension for feature selection are finally solved elegantly and efficiently. Experimental evaluation over a range of benchmark datasets indicates the validity of our method.
Spence, Richard T; Chang, David C; Kaafarani, Haytham M A; Panieri, Eugenio; Anderson, Geoffrey A; Hutter, Matthew M
2018-02-01
Despite the existence of multiple validated risk assessment and quality benchmarking tools in surgery, their utility outside of high-income countries is limited. We sought to derive, validate and apply a scoring system that is both (1) feasible, and (2) reliably predicts mortality in a middle-income country (MIC) context. A 5-step methodology was used: (1) development of a de novo surgical outcomes database modeled around the American College of Surgeons' National Surgical Quality Improvement Program (ACS-NSQIP) in South Africa (SA dataset), (2) use of the resultant data to identify all predictors of in-hospital death with more than 90% capture indicating feasibility of collection, (3) use these predictors to derive and validate an integer-based score that reliably predicts in-hospital death in the 2012 ACS-NSQIP, (4) apply the score in the original SA dataset and demonstrate its performance, (5) identify threshold cutoffs of the score to prompt action and drive quality improvement. Following step one-three above, the 13 point Codman's score was derived and validated on 211,737 and 109,079 patients, respectively, and includes: age 65 (1), partially or completely dependent functional status (1), preoperative transfusions ≥4 units (1), emergency operation (2), sepsis or septic shock (2) American Society of Anesthesia score ≥3 (3) and operative procedure (1-3). Application of the score to 373 patients in the SA dataset showed good discrimination and calibration to predict an in-hospital death. A Codman Score of 8 is an optimal cutoff point for defining expected and unexpected deaths. We have designed a novel risk prediction score specific for a MIC context. The Codman Score can prove useful for both (1) preoperative decision-making and (2) benchmarking the quality of surgical care in MIC's.
Creating a non-linear total sediment load formula using polynomial best subset regression model
NASA Astrophysics Data System (ADS)
Okcu, Davut; Pektas, Ali Osman; Uyumaz, Ali
2016-08-01
The aim of this study is to derive a new total sediment load formula which is more accurate and which has less application constraints than the well-known formulae of the literature. 5 most known stream power concept sediment formulae which are approved by ASCE are used for benchmarking on a wide range of datasets that includes both field and flume (lab) observations. The dimensionless parameters of these widely used formulae are used as inputs in a new regression approach. The new approach is called Polynomial Best subset regression (PBSR) analysis. The aim of the PBRS analysis is fitting and testing all possible combinations of the input variables and selecting the best subset. Whole the input variables with their second and third powers are included in the regression to test the possible relation between the explanatory variables and the dependent variable. While selecting the best subset a multistep approach is used that depends on significance values and also the multicollinearity degrees of inputs. The new formula is compared to others in a holdout dataset and detailed performance investigations are conducted for field and lab datasets within this holdout data. Different goodness of fit statistics are used as they represent different perspectives of the model accuracy. After the detailed comparisons are carried out we figured out the most accurate equation that is also applicable on both flume and river data. Especially, on field dataset the prediction performance of the proposed formula outperformed the benchmark formulations.
Experimental Evaluation of Acoustic Engine Liner Models Developed with COMSOL Multiphysics
NASA Technical Reports Server (NTRS)
Schiller, Noah H.; Jones, Michael G.; Bertolucci, Brandon
2017-01-01
Accurate modeling tools are needed to design new engine liners capable of reducing aircraft noise. The purpose of this study is to determine if a commercially-available finite element package, COMSOL Multiphysics, can be used to accurately model a range of different acoustic engine liner designs, and in the process, collect and document a benchmark dataset that can be used in both current and future code evaluation activities. To achieve these goals, a variety of liner samples, ranging from conventional perforate-over-honeycomb to extended-reaction designs, were installed in one wall of the grazing flow impedance tube at the NASA Langley Research Center. The liners were exposed to high sound pressure levels and grazing flow, and the effect of the liner on the sound field in the flow duct was measured. These measurements were then compared with predictions. While this report only includes comparisons for a subset of the configurations, the full database of all measurements and predictions is available in electronic format upon request. The results demonstrate that both conventional perforate-over-honeycomb and extended-reaction liners can be accurately modeled using COMSOL. Therefore, this modeling tool can be used with confidence to supplement the current suite of acoustic propagation codes, and ultimately develop new acoustic engine liners designed to reduce aircraft noise.
NASA Astrophysics Data System (ADS)
Behlim, Sadaf Iqbal; Syed, Tahir Qasim; Malik, Muhammad Yameen; Vigneron, Vincent
2016-11-01
Grouping image tokens is an intermediate step needed to arrive at meaningful image representation and summarization. Usually, perceptual cues, for instance, gestalt properties inform token grouping. However, they do not take into account structural continuities that could be derived from other tokens belonging to similar structures irrespective of their location. We propose an image representation that encodes structural constraints emerging from local binary patterns (LBP), which provides a long-distance measure of similarity but in a structurally connected way. Our representation provides a grouping of pixels or larger image tokens that is free of numeric similarity measures and could therefore be extended to nonmetric spaces. The representation lends itself nicely to ubiquitous image processing applications such as connected component labeling and segmentation. We test our proposed representation on the perceptual grouping or segmentation task on the popular Berkeley segmentation dataset (BSD500) that with respect to human segmented images achieves an average F-measure of 0.559. Our algorithm achieves a high average recall of 0.787 and is therefore well-suited to other applications such as object retrieval and category-independent object recognition. The proposed merging heuristic based on levels of singular tree component has shown promising results on the BSD500 dataset and currently ranks 12th among all benchmarked algorithms, but contrary to the others, it requires no data-driven training or specialized preprocessing.
A Multi-Sensor Fusion MAV State Estimation from Long-Range Stereo, IMU, GPS and Barometric Sensors.
Song, Yu; Nuske, Stephen; Scherer, Sebastian
2016-12-22
State estimation is the most critical capability for MAV (Micro-Aerial Vehicle) localization, autonomous obstacle avoidance, robust flight control and 3D environmental mapping. There are three main challenges for MAV state estimation: (1) it can deal with aggressive 6 DOF (Degree Of Freedom) motion; (2) it should be robust to intermittent GPS (Global Positioning System) (even GPS-denied) situations; (3) it should work well both for low- and high-altitude flight. In this paper, we present a state estimation technique by fusing long-range stereo visual odometry, GPS, barometric and IMU (Inertial Measurement Unit) measurements. The new estimation system has two main parts, a stochastic cloning EKF (Extended Kalman Filter) estimator that loosely fuses both absolute state measurements (GPS, barometer) and the relative state measurements (IMU, visual odometry), and is derived and discussed in detail. A long-range stereo visual odometry is proposed for high-altitude MAV odometry calculation by using both multi-view stereo triangulation and a multi-view stereo inverse depth filter. The odometry takes the EKF information (IMU integral) for robust camera pose tracking and image feature matching, and the stereo odometry output serves as the relative measurements for the update of the state estimation. Experimental results on a benchmark dataset and our real flight dataset show the effectiveness of the proposed state estimation system, especially for the aggressive, intermittent GPS and high-altitude MAV flight.
Natural image sequences constrain dynamic receptive fields and imply a sparse code.
Häusler, Chris; Susemihl, Alex; Nawrot, Martin P
2013-11-06
In their natural environment, animals experience a complex and dynamic visual scenery. Under such natural stimulus conditions, neurons in the visual cortex employ a spatially and temporally sparse code. For the input scenario of natural still images, previous work demonstrated that unsupervised feature learning combined with the constraint of sparse coding can predict physiologically measured receptive fields of simple cells in the primary visual cortex. This convincingly indicated that the mammalian visual system is adapted to the natural spatial input statistics. Here, we extend this approach to the time domain in order to predict dynamic receptive fields that can account for both spatial and temporal sparse activation in biological neurons. We rely on temporal restricted Boltzmann machines and suggest a novel temporal autoencoding training procedure. When tested on a dynamic multi-variate benchmark dataset this method outperformed existing models of this class. Learning features on a large dataset of natural movies allowed us to model spatio-temporal receptive fields for single neurons. They resemble temporally smooth transformations of previously obtained static receptive fields and are thus consistent with existing theories. A neuronal spike response model demonstrates how the dynamic receptive field facilitates temporal and population sparseness. We discuss the potential mechanisms and benefits of a spatially and temporally sparse representation of natural visual input. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.
The value of protein structure classification information-Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
2015-08-27
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
The value of protein structure classification information-Surveying the scientific literature
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
Bryan, Kenneth; Cunningham, Pádraig
2008-01-01
Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786
CSmetaPred: a consensus method for prediction of catalytic residues.
Choudhary, Preeti; Kumar, Shailesh; Bachhawat, Anand Kumar; Pandit, Shashi Bhushan
2017-12-22
Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc. Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization. The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://14.139.227.206/csmetapred/ .
Benchmarking a Visual-Basic based multi-component one-dimensional reactive transport modeling tool
NASA Astrophysics Data System (ADS)
Torlapati, Jagadish; Prabhakar Clement, T.
2013-01-01
We present the details of a comprehensive numerical modeling tool, RT1D, which can be used for simulating biochemical and geochemical reactive transport problems. The code can be run within the standard Microsoft EXCEL Visual Basic platform, and it does not require any additional software tools. The code can be easily adapted by others for simulating different types of laboratory-scale reactive transport experiments. We illustrate the capabilities of the tool by solving five benchmark problems with varying levels of reaction complexity. These literature-derived benchmarks are used to highlight the versatility of the code for solving a variety of practical reactive transport problems. The benchmarks are described in detail to provide a comprehensive database, which can be used by model developers to test other numerical codes. The VBA code presented in the study is a practical tool that can be used by laboratory researchers for analyzing both batch and column datasets within an EXCEL platform.
A benchmarking procedure for PIGE related differential cross-sections
NASA Astrophysics Data System (ADS)
Axiotis, M.; Lagoyannis, A.; Fazinić, S.; Harissopulos, S.; Kokkoris, M.; Preketes-Sigalas, K.; Provatas, G.
2018-05-01
The application of standard-less PIGE requires the a priori knowledge of the differential cross section of the reaction used for the quantification of each detected light element. Towards this end, a lot of datasets have been published the last few years from several laboratories around the world. The discrepancies often found between different measured cross sections can be resolved by applying a rigorous benchmarking procedure through the measurement of thick target yields. Such a procedure is proposed in the present paper and is applied in the case of the 19F(p,p‧ γ)19F reaction.
Building Bridges Between Geoscience and Data Science through Benchmark Data Sets
NASA Astrophysics Data System (ADS)
Thompson, D. R.; Ebert-Uphoff, I.; Demir, I.; Gel, Y.; Hill, M. C.; Karpatne, A.; Güereque, M.; Kumar, V.; Cabral, E.; Smyth, P.
2017-12-01
The changing nature of observational field data demands richer and more meaningful collaboration between data scientists and geoscientists. Thus, among other efforts, the Working Group on Case Studies of the NSF-funded RCN on Intelligent Systems Research To Support Geosciences (IS-GEO) is developing a framework to strengthen such collaborations through the creation of benchmark datasets. Benchmark datasets provide an interface between disciplines without requiring extensive background knowledge. The goals are to create (1) a means for two-way communication between geoscience and data science researchers; (2) new collaborations, which may lead to new approaches for data analysis in the geosciences; and (3) a public, permanent repository of complex data sets, representative of geoscience problems, useful to coordinate efforts in research and education. The group identified 10 key elements and characteristics for ideal benchmarks. High impact: A problem with high potential impact. Active research area: A group of geoscientists should be eager to continue working on the topic. Challenge: The problem should be challenging for data scientists. Data science generality and versatility: It should stimulate development of new general and versatile data science methods. Rich information content: Ideally the data set provides stimulus for analysis at many different levels. Hierarchical problem statement: A hierarchy of suggested analysis tasks, from relatively straightforward to open-ended tasks. Means for evaluating success: Data scientists and geoscientists need means to evaluate whether the algorithms are successful and achieve intended purpose. Quick start guide: Introduction for data scientists on how to easily read the data to enable rapid initial data exploration. Geoscience context: Summary for data scientists of the specific data collection process, instruments used, any pre-processing and the science questions to be answered. Citability: A suitable identifier to facilitate tracking the use of the benchmark later on, e.g. allowing search engines to find all research papers using it. A first sample benchmark developed in collaboration with the Jet Propulsion Laboratory (JPL) deals with the automatic analysis of imaging spectrometer data to detect significant methane sources in the atmosphere.
Bruni, Renato; Cesarone, Francesco; Scozzari, Andrea; Tardella, Fabio
2016-09-01
A large number of portfolio selection models have appeared in the literature since the pioneering work of Markowitz. However, even when computational and empirical results are described, they are often hard to replicate and compare due to the unavailability of the datasets used in the experiments. We provide here several datasets for portfolio selection generated using real-world price values from several major stock markets. The datasets contain weekly return values, adjusted for dividends and for stock splits, which are cleaned from errors as much as possible. The datasets are available in different formats, and can be used as benchmarks for testing the performances of portfolio selection models and for comparing the efficiency of the algorithms used to solve them. We also provide, for these datasets, the portfolios obtained by several selection strategies based on Stochastic Dominance models (see "On Exact and Approximate Stochastic Dominance Strategies for Portfolio Selection" (Bruni et al. [2])). We believe that testing portfolio models on publicly available datasets greatly simplifies the comparison of the different portfolio selection strategies.
Benchmark Dose (BMD) modelling is a mathematical approach used to determine where a dose-response change begins to take place relative to controls following chemical exposure. BMDs are being increasingly applied in regulatory toxicology to determine points of departure. BMDExpres...
Ensemble reconstruction of spatio-temporal extreme low-flow events in France since 1871
NASA Astrophysics Data System (ADS)
Caillouet, Laurie; Vidal, Jean-Philippe; Sauquet, Eric; Devers, Alexandre; Graff, Benjamin
2017-06-01
The length of streamflow observations is generally limited to the last 50 years even in data-rich countries like France. It therefore offers too small a sample of extreme low-flow events to properly explore the long-term evolution of their characteristics and associated impacts. To overcome this limit, this work first presents a daily 140-year ensemble reconstructed streamflow dataset for a reference network of near-natural catchments in France. This dataset, called SCOPE Hydro (Spatially COherent Probabilistic Extended Hydrological dataset), is based on (1) a probabilistic precipitation, temperature, and reference evapotranspiration downscaling of the Twentieth Century Reanalysis over France, called SCOPE Climate, and (2) continuous hydrological modelling using SCOPE Climate as forcings over the whole period. This work then introduces tools for defining spatio-temporal extreme low-flow events. Extreme low-flow events are first locally defined through the sequent peak algorithm using a novel combination of a fixed threshold and a daily variable threshold. A dedicated spatial matching procedure is then established to identify spatio-temporal events across France. This procedure is furthermore adapted to the SCOPE Hydro 25-member ensemble to characterize in a probabilistic way unrecorded historical events at the national scale. Extreme low-flow events are described and compared in a spatially and temporally homogeneous way over 140 years on a large set of catchments. Results highlight well-known recent events like 1976 or 1989-1990, but also older and relatively forgotten ones like the 1878 and 1893 events. These results contribute to improving our knowledge of historical events and provide a selection of benchmark events for climate change adaptation purposes. Moreover, this study allows for further detailed analyses of the effect of climate variability and anthropogenic climate change on low-flow hydrology at the scale of France.
A deep learning pipeline for Indian dance style classification
NASA Astrophysics Data System (ADS)
Dewan, Swati; Agarwal, Shubham; Singh, Navjyoti
2018-04-01
In this paper, we address the problem of dance style classification to classify Indian dance or any dance in general. We propose a 3-step deep learning pipeline. First, we extract 14 essential joint locations of the dancer from each video frame, this helps us to derive any body region location within the frame, we use this in the second step which forms the main part of our pipeline. Here, we divide the dancer into regions of important motion in each video frame. We then extract patches centered at these regions. Main discriminative motion is captured in these patches. We stack the features from all such patches of a frame into a single vector and form our hierarchical dance pose descriptor. Finally, in the third step, we build a high level representation of the dance video using the hierarchical descriptors and train it using a Recurrent Neural Network (RNN) for classification. Our novelty also lies in the way we use multiple representations for a single video. This helps us to: (1) Overcome the RNN limitation of learning small sequences over big sequences such as dance; (2) Extract more data from the available dataset for effective deep learning by training multiple representations. Our contributions in this paper are three-folds: (1) We provide a deep learning pipeline for classification of any form of dance; (2) We prove that a segmented representation of a dance video works well with sequence learning techniques for recognition purposes; (3) We extend and refine the ICD dataset and provide a new dataset for evaluation of dance. Our model performs comparable or better in some cases than the state-of-the-art on action recognition benchmarks.
Tahir, Muhammad; Jan, Bismillah; Hayat, Maqsood; Shah, Shakir Ullah; Amin, Muhammad
2018-04-01
Discriminative and informative feature extraction is the core requirement for accurate and efficient classification of protein subcellular localization images so that drug development could be more effective. The objective of this paper is to propose a novel modification in the Threshold Adjacency Statistics technique and enhance its discriminative power. In this work, we utilized Threshold Adjacency Statistics from a novel perspective to enhance its discrimination power and efficiency. In this connection, we utilized seven threshold ranges to produce seven distinct feature spaces, which are then used to train seven SVMs. The final prediction is obtained through the majority voting scheme. The proposed ETAS-SubLoc system is tested on two benchmark datasets using 5-fold cross-validation technique. We observed that our proposed novel utilization of TAS technique has improved the discriminative power of the classifier. The ETAS-SubLoc system has achieved 99.2% accuracy, 99.3% sensitivity and 99.1% specificity for Endogenous dataset outperforming the classical Threshold Adjacency Statistics technique. Similarly, 91.8% accuracy, 96.3% sensitivity and 91.6% specificity values are achieved for Transfected dataset. Simulation results validated the effectiveness of ETAS-SubLoc that provides superior prediction performance compared to the existing technique. The proposed methodology aims at providing support to pharmaceutical industry as well as research community towards better drug designing and innovation in the fields of bioinformatics and computational biology. The implementation code for replicating the experiments presented in this paper is available at: https://drive.google.com/file/d/0B7IyGPObWbSqRTRMcXI2bG5CZWs/view?usp=sharing. Copyright © 2018 Elsevier B.V. All rights reserved.
Comprehensive comparison of gap filling techniques for eddy covariance net carbon fluxes
NASA Astrophysics Data System (ADS)
Moffat, A. M.; Papale, D.; Reichstein, M.; Hollinger, D. Y.; Richardson, A. D.; Barr, A. G.; Beckstein, C.; Braswell, B. H.; Churkina, G.; Desai, A. R.; Falge, E.; Gove, J. H.; Heimann, M.; Hui, D.; Jarvis, A. J.; Kattge, J.; Noormets, A.; Stauch, V. J.
2007-12-01
Review of fifteen techniques for estimating missing values of net ecosystem CO2 exchange (NEE) in eddy covariance time series and evaluation of their performance for different artificial gap scenarios based on a set of ten benchmark datasets from six forested sites in Europe. The goal of gap filling is the reproduction of the NEE time series and hence this present work focuses on estimating missing NEE values, not on editing or the removal of suspect values in these time series due to systematic errors in the measurements (e.g. nighttime flux, advection). The gap filling was examined by generating fifty secondary datasets with artificial gaps (ranging in length from single half-hours to twelve consecutive days) for each benchmark dataset and evaluating the performance with a variety of statistical metrics. The performance of the gap filling varied among sites and depended on the level of aggregation (native half- hourly time step versus daily), long gaps were more difficult to fill than short gaps, and differences among the techniques were more pronounced during the day than at night. The non-linear regression techniques (NLRs), the look-up table (LUT), marginal distribution sampling (MDS), and the semi-parametric model (SPM) generally showed good overall performance. The artificial neural network based techniques (ANNs) were generally, if only slightly, superior to the other techniques. The simple interpolation technique of mean diurnal variation (MDV) showed a moderate but consistent performance. Several sophisticated techniques, the dual unscented Kalman filter (UKF), the multiple imputation method (MIM), the terrestrial biosphere model (BETHY), but also one of the ANNs and one of the NLRs showed high biases which resulted in a low reliability of the annual sums, indicating that additional development might be needed. An uncertainty analysis comparing the estimated random error in the ten benchmark datasets with the artificial gap residuals suggested that the techniques are already at or very close to the noise limit of the measurements. Based on the techniques and site data examined here, the effect of gap filling on the annual sums of NEE is modest, with most techniques falling within a range of ±25 g C m-2 y-1.
A PDB-wide, evolution-based assessment of protein-protein interfaces.
Baskaran, Kumaran; Duarte, Jose M; Biyani, Nikhil; Bliven, Spencer; Capitani, Guido
2014-10-18
Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\\#downloads. Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.
Ramus, Claire; Hovasse, Agnès; Marcellin, Marlène; Hesse, Anne-Marie; Mouton-Barbosa, Emmanuelle; Bouyssié, David; Vaca, Sebastian; Carapito, Christine; Chaoui, Karima; Bruley, Christophe; Garin, Jérôme; Cianférani, Sarah; Ferro, Myriam; Van Dorssaeler, Alain; Burlet-Schiltz, Odile; Schaeffer, Christine; Couté, Yohann; Gonzalez de Peredo, Anne
2016-01-30
Proteomic workflows based on nanoLC-MS/MS data-dependent-acquisition analysis have progressed tremendously in recent years. High-resolution and fast sequencing instruments have enabled the use of label-free quantitative methods, based either on spectral counting or on MS signal analysis, which appear as an attractive way to analyze differential protein expression in complex biological samples. However, the computational processing of the data for label-free quantification still remains a challenge. Here, we used a proteomic standard composed of an equimolar mixture of 48 human proteins (Sigma UPS1) spiked at different concentrations into a background of yeast cell lysate to benchmark several label-free quantitative workflows, involving different software packages developed in recent years. This experimental design allowed to finely assess their performances in terms of sensitivity and false discovery rate, by measuring the number of true and false-positive (respectively UPS1 or yeast background proteins found as differential). The spiked standard dataset has been deposited to the ProteomeXchange repository with the identifier PXD001819 and can be used to benchmark other label-free workflows, adjust software parameter settings, improve algorithms for extraction of the quantitative metrics from raw MS data, or evaluate downstream statistical methods. Bioinformatic pipelines for label-free quantitative analysis must be objectively evaluated in their ability to detect variant proteins with good sensitivity and low false discovery rate in large-scale proteomic studies. This can be done through the use of complex spiked samples, for which the "ground truth" of variant proteins is known, allowing a statistical evaluation of the performances of the data processing workflow. We provide here such a controlled standard dataset and used it to evaluate the performances of several label-free bioinformatics tools (including MaxQuant, Skyline, MFPaQ, IRMa-hEIDI and Scaffold) in different workflows, for detection of variant proteins with different absolute expression levels and fold change values. The dataset presented here can be useful for tuning software tool parameters, and also testing new algorithms for label-free quantitative analysis, or for evaluation of downstream statistical methods. Copyright © 2015 Elsevier B.V. All rights reserved.
A benchmark for comparison of dental radiography analysis algorithms.
Wang, Ching-Wei; Huang, Cheng-Ta; Lee, Jia-Hong; Li, Chung-Hsing; Chang, Sheng-Wei; Siao, Ming-Jhih; Lai, Tat-Ming; Ibragimov, Bulat; Vrtovec, Tomaž; Ronneberger, Olaf; Fischer, Philipp; Cootes, Tim F; Lindner, Claudia
2016-07-01
Dental radiography plays an important role in clinical diagnosis, treatment and surgery. In recent years, efforts have been made on developing computerized dental X-ray image analysis systems for clinical usages. A novel framework for objective evaluation of automatic dental radiography analysis algorithms has been established under the auspices of the IEEE International Symposium on Biomedical Imaging 2015 Bitewing Radiography Caries Detection Challenge and Cephalometric X-ray Image Analysis Challenge. In this article, we present the datasets, methods and results of the challenge and lay down the principles for future uses of this benchmark. The main contributions of the challenge include the creation of the dental anatomy data repository of bitewing radiographs, the creation of the anatomical abnormality classification data repository of cephalometric radiographs, and the definition of objective quantitative evaluation for comparison and ranking of the algorithms. With this benchmark, seven automatic methods for analysing cephalometric X-ray image and two automatic methods for detecting bitewing radiography caries have been compared, and detailed quantitative evaluation results are presented in this paper. Based on the quantitative evaluation results, we believe automatic dental radiography analysis is still a challenging and unsolved problem. The datasets and the evaluation software will be made available to the research community, further encouraging future developments in this field. (http://www-o.ntust.edu.tw/~cweiwang/ISBI2015/). Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Model evaluation using a community benchmarking system for land surface models
NASA Astrophysics Data System (ADS)
Mu, M.; Hoffman, F. M.; Lawrence, D. M.; Riley, W. J.; Keppel-Aleks, G.; Kluzek, E. B.; Koven, C. D.; Randerson, J. T.
2014-12-01
Evaluation of atmosphere, ocean, sea ice, and land surface models is an important step in identifying deficiencies in Earth system models and developing improved estimates of future change. For the land surface and carbon cycle, the design of an open-source system has been an important objective of the International Land Model Benchmarking (ILAMB) project. Here we evaluated CMIP5 and CLM models using a benchmarking system that enables users to specify models, data sets, and scoring systems so that results can be tailored to specific model intercomparison projects. Our scoring system used information from four different aspects of global datasets, including climatological mean spatial patterns, seasonal cycle dynamics, interannual variability, and long-term trends. Variable-to-variable comparisons enable investigation of the mechanistic underpinnings of model behavior, and allow for some control of biases in model drivers. Graphics modules allow users to evaluate model performance at local, regional, and global scales. Use of modular structures makes it relatively easy for users to add new variables, diagnostic metrics, benchmarking datasets, or model simulations. Diagnostic results are automatically organized into HTML files, so users can conveniently share results with colleagues. We used this system to evaluate atmospheric carbon dioxide, burned area, global biomass and soil carbon stocks, net ecosystem exchange, gross primary production, ecosystem respiration, terrestrial water storage, evapotranspiration, and surface radiation from CMIP5 historical and ESM historical simulations. We found that the multi-model mean often performed better than many of the individual models for most variables. We plan to publicly release a stable version of the software during fall of 2014 that has land surface, carbon cycle, hydrology, radiation and energy cycle components.
ERIC Educational Resources Information Center
Cullen, R. B.
A recent study of work skill competitiveness and overall national competitiveness worldwide revealed that 17 countries are more competitive than Australia. Some countries have a relative resource advantage and will be able to extend access to education and training more effectively than Australia will, and some countries have targeted education…
NASA Astrophysics Data System (ADS)
Giebel, Gregor; Cline, Joel; Frank, Helmut; Shaw, Will; Pinson, Pierre; Hodge, Bri-Mathias; Kariniotakis, Georges; Sempreviva, Anna Maria; Draxl, Caroline
2017-04-01
Wind power forecasts have been used operatively for over 20 years. Despite this fact, there are still several possibilities to improve the forecasts, both from the weather prediction side and from the usage of the forecasts. The new International Energy Agency (IEA) Task on Wind Power Forecasting tries to organise international collaboration, among national weather centres with an interest and/or large projects on wind forecast improvements (NOAA, DWD, UK MetOffice, …) and operational forecaster and forecast users. The Task is divided in three work packages: Firstly, a collaboration on the improvement of the scientific basis for the wind predictions themselves. This includes numerical weather prediction model physics, but also widely distributed information on accessible datasets for verification. Secondly, we will be aiming at an international pre-standard (an IEA Recommended Practice) on benchmarking and comparing wind power forecasts, including probabilistic forecasts aiming at industry and forecasters alike. This WP will also organise benchmarks, in cooperation with the IEA Task WakeBench. Thirdly, we will be engaging end users aiming at dissemination of the best practice in the usage of wind power predictions, especially probabilistic ones. The Operating Agent is Gregor Giebel of DTU, Co-Operating Agent is Joel Cline of the US Department of Energy. Collaboration in the task is solicited from everyone interested in the forecasting business. We will collaborate with IEA Task 31 Wakebench, which developed the Windbench benchmarking platform, which this task will use for forecasting benchmarks. The task runs for three years, 2016-2018. Main deliverables are an up-to-date list of current projects and main project results, including datasets which can be used by researchers around the world to improve their own models, an IEA Recommended Practice on performance evaluation of probabilistic forecasts, a position paper regarding the use of probabilistic forecasts, and one or more benchmark studies implemented on the Windbench platform hosted at CENER. Additionally, spreading of relevant information in both the forecasters and the users community is paramount. The poster also shows the work done in the first half of the Task, e.g. the collection of available datasets and the learnings from a public workshop on 9 June in Barcelona on Experiences with the Use of Forecasts and Gaps in Research. Participation is open for all interested parties in member states of the IEA Annex on Wind Power, see ieawind.org for the up-to-date list. For collaboration, please contact the author grgi@dtu.dk).
Bottini, Silvia; Hamouda-Tekaya, Nedra; Tanasa, Bogdan; Zaragosi, Laure-Emmanuelle; Grandjean, Valerie; Repetto, Emanuela; Trabucchi, Michele
2017-05-19
Experimental evidence indicates that about 60% of miRNA-binding activity does not follow the canonical rule about the seed matching between miRNA and target mRNAs, but rather a non-canonical miRNA targeting activity outside the seed or with a seed-like motifs. Here, we propose a new unbiased method to identify canonical and non-canonical miRNA-binding sites from peaks identified by Ago2 Cross-Linked ImmunoPrecipitation associated to high-throughput sequencing (CLIP-seq). Since the quality of peaks is of pivotal importance for the final output of the proposed method, we provide a comprehensive benchmarking of four peak detection programs, namely CIMS, PIPE-CLIP, Piranha and Pyicoclip, on four publicly available Ago2-HITS-CLIP datasets and one unpublished in-house Ago2-dataset in stem cells. We measured the sensitivity, the specificity and the position accuracy toward miRNA binding sites identification, and the agreement with TargetScan. Secondly, we developed a new pipeline, called miRBShunter, to identify canonical and non-canonical miRNA-binding sites based on de novo motif identification from Ago2 peaks and prediction of miRNA::RNA heteroduplexes. miRBShunter was tested and experimentally validated on the in-house Ago2-dataset and on an Ago2-PAR-CLIP dataset in human stem cells. Overall, we provide guidelines to choose a suitable peak detection program and a new method for miRNA-target identification. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Bottini, Silvia; Hamouda-Tekaya, Nedra; Tanasa, Bogdan; Zaragosi, Laure-Emmanuelle; Grandjean, Valerie; Repetto, Emanuela
2017-01-01
Abstract Experimental evidence indicates that about 60% of miRNA-binding activity does not follow the canonical rule about the seed matching between miRNA and target mRNAs, but rather a non-canonical miRNA targeting activity outside the seed or with a seed-like motifs. Here, we propose a new unbiased method to identify canonical and non-canonical miRNA-binding sites from peaks identified by Ago2 Cross-Linked ImmunoPrecipitation associated to high-throughput sequencing (CLIP-seq). Since the quality of peaks is of pivotal importance for the final output of the proposed method, we provide a comprehensive benchmarking of four peak detection programs, namely CIMS, PIPE-CLIP, Piranha and Pyicoclip, on four publicly available Ago2-HITS-CLIP datasets and one unpublished in-house Ago2-dataset in stem cells. We measured the sensitivity, the specificity and the position accuracy toward miRNA binding sites identification, and the agreement with TargetScan. Secondly, we developed a new pipeline, called miRBShunter, to identify canonical and non-canonical miRNA-binding sites based on de novo motif identification from Ago2 peaks and prediction of miRNA::RNA heteroduplexes. miRBShunter was tested and experimentally validated on the in-house Ago2-dataset and on an Ago2-PAR-CLIP dataset in human stem cells. Overall, we provide guidelines to choose a suitable peak detection program and a new method for miRNA-target identification. PMID:28108660
NASA Astrophysics Data System (ADS)
Cescatti, A.; Duveiller, G.; Hooker, J.
2017-12-01
Changing vegetation cover not only affects the atmospheric concentration of greenhouse gases but also alters the radiative and non-radiative properties of the surface. The result of competing biophysical processes on Earth's surface energy balance varies spatially and seasonally, and can lead to warming or cooling depending on the specific vegetation change and on the background climate. To date these effects are not accounted for in land-based climate policies because of the complexity of the phenomena, contrasting model predictions and the lack of global data-driven assessments. To overcome the limitations of available observation-based diagnostics and of the on-going model inter-comparison, here we present a new benchmarking dataset derived from satellite remote sensing. This global dataset provides the potential changes induced by multiple vegetation transitions on the single terms of the surface energy balance. We used this dataset for two major goals: 1) Quantify the impact of actual vegetation changes that occurred during the decade 2000-2010, showing the overwhelming role of tropical deforestation in warming the surface by reducing evapotranspiration despite the concurrent brightening of the Earth. 2) Benchmark a series of ESMs against data-driven metrics of the land cover change impacts on the various terms of the surface energy budget and on the surface temperature. We anticipate that the dataset could be also used to evaluate future scenarios of land cover change and to develop the monitoring, reporting and verification guidelines required for the implementation of mitigation plans that account for biophysical land processes.
How to Advance TPC Benchmarks with Dependability Aspects
NASA Astrophysics Data System (ADS)
Almeida, Raquel; Poess, Meikel; Nambiar, Raghunath; Patil, Indira; Vieira, Marco
Transactional systems are the core of the information systems of most organizations. Although there is general acknowledgement that failures in these systems often entail significant impact both on the proceeds and reputation of companies, the benchmarks developed and managed by the Transaction Processing Performance Council (TPC) still maintain their focus on reporting bare performance. Each TPC benchmark has to pass a list of dependability-related tests (to verify ACID properties), but not all benchmarks require measuring their performances. While TPC-E measures the recovery time of some system failures, TPC-H and TPC-C only require functional correctness of such recovery. Consequently, systems used in TPC benchmarks are tuned mostly for performance. In this paper we argue that nowadays systems should be tuned for a more comprehensive suite of dependability tests, and that a dependability metric should be part of TPC benchmark publications. The paper discusses WHY and HOW this can be achieved. Two approaches are introduced and discussed: augmenting each TPC benchmark in a customized way, by extending each specification individually; and pursuing a more unified approach, defining a generic specification that could be adjoined to any TPC benchmark.
Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models
Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.
2016-01-01
An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777
Mendenhall, Jeffrey; Meiler, Jens
2016-02-01
Dropout is an Artificial Neural Network (ANN) training technique that has been shown to improve ANN performance across canonical machine learning (ML) datasets. Quantitative Structure Activity Relationship (QSAR) datasets used to relate chemical structure to biological activity in Ligand-Based Computer-Aided Drug Discovery pose unique challenges for ML techniques, such as heavily biased dataset composition, and relatively large number of descriptors relative to the number of actives. To test the hypothesis that dropout also improves QSAR ANNs, we conduct a benchmark on nine large QSAR datasets. Use of dropout improved both enrichment false positive rate and log-scaled area under the receiver-operating characteristic curve (logAUC) by 22-46 % over conventional ANN implementations. Optimal dropout rates are found to be a function of the signal-to-noise ratio of the descriptor set, and relatively independent of the dataset. Dropout ANNs with 2D and 3D autocorrelation descriptors outperform conventional ANNs as well as optimized fingerprint similarity search methods.
Mendenhall, Jeffrey; Meiler, Jens
2016-01-01
Dropout is an Artificial Neural Network (ANN) training technique that has been shown to improve ANN performance across canonical machine learning (ML) datasets. Quantitative Structure Activity Relationship (QSAR) datasets used to relate chemical structure to biological activity in Ligand-Based Computer-Aided Drug Discovery (LB-CADD) pose unique challenges for ML techniques, such as heavily biased dataset composition, and relatively large number of descriptors relative to the number of actives. To test the hypothesis that dropout also improves QSAR ANNs, we conduct a benchmark on nine large QSAR datasets. Use of dropout improved both Enrichment false positive rate (FPR) and log-scaled area under the receiver-operating characteristic curve (logAUC) by 22–46% over conventional ANN implementations. Optimal dropout rates are found to be a function of the signal-to-noise ratio of the descriptor set, and relatively independent of the dataset. Dropout ANNs with 2D and 3D autocorrelation descriptors outperform conventional ANNs as well as optimized fingerprint similarity search methods. PMID:26830599
Benchmarking undedicated cloud computing providers for analysis of genomic datasets.
Yazar, Seyhan; Gooden, George E C; Mackey, David A; Hewitt, Alex W
2014-01-01
A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5-78.2) for E.coli and 53.5% (95% CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5-303.1) and 173.9% (95% CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.
Benchmarking Undedicated Cloud Computing Providers for Analysis of Genomic Datasets
Yazar, Seyhan; Gooden, George E. C.; Mackey, David A.; Hewitt, Alex W.
2014-01-01
A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5–78.2) for E.coli and 53.5% (95% CI: 34.4–72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5–303.1) and 173.9% (95% CI: 134.6–213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE. PMID:25247298
The US EPA’s N-Methyl Carbamate (NMC) Cumulative Risk assessment was based on the effect on acetylcholine esterase (AChE) activity of exposure to 10 NMC pesticides through dietary, drinking water, and residential exposures, assuming the effects of joint exposure to NMCs is dose-...
Divide and Conquer-Based 1D CNN Human Activity Recognition Using Test Data Sharpening †
Yoon, Sang Min
2018-01-01
Human Activity Recognition (HAR) aims to identify the actions performed by humans using signals collected from various sensors embedded in mobile devices. In recent years, deep learning techniques have further improved HAR performance on several benchmark datasets. In this paper, we propose one-dimensional Convolutional Neural Network (1D CNN) for HAR that employs a divide and conquer-based classifier learning coupled with test data sharpening. Our approach leverages a two-stage learning of multiple 1D CNN models; we first build a binary classifier for recognizing abstract activities, and then build two multi-class 1D CNN models for recognizing individual activities. We then introduce test data sharpening during prediction phase to further improve the activity recognition accuracy. While there have been numerous researches exploring the benefits of activity signal denoising for HAR, few researches have examined the effect of test data sharpening for HAR. We evaluate the effectiveness of our approach on two popular HAR benchmark datasets, and show that our approach outperforms both the two-stage 1D CNN-only method and other state of the art approaches. PMID:29614767
Divide and Conquer-Based 1D CNN Human Activity Recognition Using Test Data Sharpening.
Cho, Heeryon; Yoon, Sang Min
2018-04-01
Human Activity Recognition (HAR) aims to identify the actions performed by humans using signals collected from various sensors embedded in mobile devices. In recent years, deep learning techniques have further improved HAR performance on several benchmark datasets. In this paper, we propose one-dimensional Convolutional Neural Network (1D CNN) for HAR that employs a divide and conquer-based classifier learning coupled with test data sharpening. Our approach leverages a two-stage learning of multiple 1D CNN models; we first build a binary classifier for recognizing abstract activities, and then build two multi-class 1D CNN models for recognizing individual activities. We then introduce test data sharpening during prediction phase to further improve the activity recognition accuracy. While there have been numerous researches exploring the benefits of activity signal denoising for HAR, few researches have examined the effect of test data sharpening for HAR. We evaluate the effectiveness of our approach on two popular HAR benchmark datasets, and show that our approach outperforms both the two-stage 1D CNN-only method and other state of the art approaches.
dynGENIE3: dynamical GENIE3 for the inference of gene networks from time series expression data.
Huynh-Thu, Vân Anh; Geurts, Pierre
2018-02-21
The elucidation of gene regulatory networks is one of the major challenges of systems biology. Measurements about genes that are exploited by network inference methods are typically available either in the form of steady-state expression vectors or time series expression data. In our previous work, we proposed the GENIE3 method that exploits variable importance scores derived from Random forests to identify the regulators of each target gene. This method provided state-of-the-art performance on several benchmark datasets, but it could however not specifically be applied to time series expression data. We propose here an adaptation of the GENIE3 method, called dynamical GENIE3 (dynGENIE3), for handling both time series and steady-state expression data. The proposed method is evaluated extensively on the artificial DREAM4 benchmarks and on three real time series expression datasets. Although dynGENIE3 does not systematically yield the best performance on each and every network, it is competitive with diverse methods from the literature, while preserving the main advantages of GENIE3 in terms of scalability.
A Bayesian approach to traffic light detection and mapping
NASA Astrophysics Data System (ADS)
Hosseinyalamdary, Siavash; Yilmaz, Alper
2017-03-01
Automatic traffic light detection and mapping is an open research problem. The traffic lights vary in color, shape, geolocation, activation pattern, and installation which complicate their automated detection. In addition, the image of the traffic lights may be noisy, overexposed, underexposed, or occluded. In order to address this problem, we propose a Bayesian inference framework to detect and map traffic lights. In addition to the spatio-temporal consistency constraint, traffic light characteristics such as color, shape and height is shown to further improve the accuracy of the proposed approach. The proposed approach has been evaluated on two benchmark datasets and has been shown to outperform earlier studies. The results show that the precision and recall rates for the KITTI benchmark are 95.78 % and 92.95 % respectively and the precision and recall rates for the LARA benchmark are 98.66 % and 94.65 % .
Benchmark of Machine Learning Methods for Classification of a SENTINEL-2 Image
NASA Astrophysics Data System (ADS)
Pirotti, F.; Sunar, F.; Piragnolo, M.
2016-06-01
Thanks to mainly ESA and USGS, a large bulk of free images of the Earth is readily available nowadays. One of the main goals of remote sensing is to label images according to a set of semantic categories, i.e. image classification. This is a very challenging issue since land cover of a specific class may present a large spatial and spectral variability and objects may appear at different scales and orientations. In this study, we report the results of benchmarking 9 machine learning algorithms tested for accuracy and speed in training and classification of land-cover classes in a Sentinel-2 dataset. The following machine learning methods (MLM) have been tested: linear discriminant analysis, k-nearest neighbour, random forests, support vector machines, multi layered perceptron, multi layered perceptron ensemble, ctree, boosting, logarithmic regression. The validation is carried out using a control dataset which consists of an independent classification in 11 land-cover classes of an area about 60 km2, obtained by manual visual interpretation of high resolution images (20 cm ground sampling distance) by experts. In this study five out of the eleven classes are used since the others have too few samples (pixels) for testing and validating subsets. The classes used are the following: (i) urban (ii) sowable areas (iii) water (iv) tree plantations (v) grasslands. Validation is carried out using three different approaches: (i) using pixels from the training dataset (train), (ii) using pixels from the training dataset and applying cross-validation with the k-fold method (kfold) and (iii) using all pixels from the control dataset. Five accuracy indices are calculated for the comparison between the values predicted with each model and control values over three sets of data: the training dataset (train), the whole control dataset (full) and with k-fold cross-validation (kfold) with ten folds. Results from validation of predictions of the whole dataset (full) show the random forests method with the highest values; kappa index ranging from 0.55 to 0.42 respectively with the most and least number pixels for training. The two neural networks (multi layered perceptron and its ensemble) and the support vector machines - with default radial basis function kernel - methods follow closely with comparable performance.
A Multi-Sensor Fusion MAV State Estimation from Long-Range Stereo, IMU, GPS and Barometric Sensors
Song, Yu; Nuske, Stephen; Scherer, Sebastian
2016-01-01
State estimation is the most critical capability for MAV (Micro-Aerial Vehicle) localization, autonomous obstacle avoidance, robust flight control and 3D environmental mapping. There are three main challenges for MAV state estimation: (1) it can deal with aggressive 6 DOF (Degree Of Freedom) motion; (2) it should be robust to intermittent GPS (Global Positioning System) (even GPS-denied) situations; (3) it should work well both for low- and high-altitude flight. In this paper, we present a state estimation technique by fusing long-range stereo visual odometry, GPS, barometric and IMU (Inertial Measurement Unit) measurements. The new estimation system has two main parts, a stochastic cloning EKF (Extended Kalman Filter) estimator that loosely fuses both absolute state measurements (GPS, barometer) and the relative state measurements (IMU, visual odometry), and is derived and discussed in detail. A long-range stereo visual odometry is proposed for high-altitude MAV odometry calculation by using both multi-view stereo triangulation and a multi-view stereo inverse depth filter. The odometry takes the EKF information (IMU integral) for robust camera pose tracking and image feature matching, and the stereo odometry output serves as the relative measurements for the update of the state estimation. Experimental results on a benchmark dataset and our real flight dataset show the effectiveness of the proposed state estimation system, especially for the aggressive, intermittent GPS and high-altitude MAV flight. PMID:28025524
The value of protein structure classification information—Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.
2015-01-01
ABSTRACT The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:26313554
Benchmarking homogenization algorithms for monthly data
NASA Astrophysics Data System (ADS)
Venema, V. K. C.; Mestre, O.; Aguilar, E.; Auer, I.; Guijarro, J. A.; Domonkos, P.; Vertacnik, G.; Szentimrey, T.; Stepanek, P.; Zahradnicek, P.; Viarre, J.; Müller-Westermeier, G.; Lakatos, M.; Williams, C. N.; Menne, M. J.; Lindau, R.; Rasol, D.; Rustemeier, E.; Kolokythas, K.; Marinova, T.; Andresen, L.; Acquaotta, F.; Fratianni, S.; Cheval, S.; Klancar, M.; Brunetti, M.; Gruber, C.; Prohom Duran, M.; Likso, T.; Esteban, P.; Brandsma, T.
2012-01-01
The COST (European Cooperation in Science and Technology) Action ES0601: advances in homogenization methods of climate series: an integrated approach (HOME) has executed a blind intercomparison and validation study for monthly homogenization algorithms. Time series of monthly temperature and precipitation were evaluated because of their importance for climate studies and because they represent two important types of statistics (additive and multiplicative). The algorithms were validated against a realistic benchmark dataset. The benchmark contains real inhomogeneous data as well as simulated data with inserted inhomogeneities. Random independent break-type inhomogeneities with normally distributed breakpoint sizes were added to the simulated datasets. To approximate real world conditions, breaks were introduced that occur simultaneously in multiple station series within a simulated network of station data. The simulated time series also contained outliers, missing data periods and local station trends. Further, a stochastic nonlinear global (network-wide) trend was added. Participants provided 25 separate homogenized contributions as part of the blind study. After the deadline at which details of the imposed inhomogeneities were revealed, 22 additional solutions were submitted. These homogenized datasets were assessed by a number of performance metrics including (i) the centered root mean square error relative to the true homogeneous value at various averaging scales, (ii) the error in linear trend estimates and (iii) traditional contingency skill scores. The metrics were computed both using the individual station series as well as the network average regional series. The performance of the contributions depends significantly on the error metric considered. Contingency scores by themselves are not very informative. Although relative homogenization algorithms typically improve the homogeneity of temperature data, only the best ones improve precipitation data. Training the users on homogenization software was found to be very important. Moreover, state-of-the-art relative homogenization algorithms developed to work with an inhomogeneous reference are shown to perform best. The study showed that automatic algorithms can perform as well as manual ones.
Benchmarking monthly homogenization algorithms
NASA Astrophysics Data System (ADS)
Venema, V. K. C.; Mestre, O.; Aguilar, E.; Auer, I.; Guijarro, J. A.; Domonkos, P.; Vertacnik, G.; Szentimrey, T.; Stepanek, P.; Zahradnicek, P.; Viarre, J.; Müller-Westermeier, G.; Lakatos, M.; Williams, C. N.; Menne, M.; Lindau, R.; Rasol, D.; Rustemeier, E.; Kolokythas, K.; Marinova, T.; Andresen, L.; Acquaotta, F.; Fratianni, S.; Cheval, S.; Klancar, M.; Brunetti, M.; Gruber, C.; Prohom Duran, M.; Likso, T.; Esteban, P.; Brandsma, T.
2011-08-01
The COST (European Cooperation in Science and Technology) Action ES0601: Advances in homogenization methods of climate series: an integrated approach (HOME) has executed a blind intercomparison and validation study for monthly homogenization algorithms. Time series of monthly temperature and precipitation were evaluated because of their importance for climate studies and because they represent two important types of statistics (additive and multiplicative). The algorithms were validated against a realistic benchmark dataset. The benchmark contains real inhomogeneous data as well as simulated data with inserted inhomogeneities. Random break-type inhomogeneities were added to the simulated datasets modeled as a Poisson process with normally distributed breakpoint sizes. To approximate real world conditions, breaks were introduced that occur simultaneously in multiple station series within a simulated network of station data. The simulated time series also contained outliers, missing data periods and local station trends. Further, a stochastic nonlinear global (network-wide) trend was added. Participants provided 25 separate homogenized contributions as part of the blind study as well as 22 additional solutions submitted after the details of the imposed inhomogeneities were revealed. These homogenized datasets were assessed by a number of performance metrics including (i) the centered root mean square error relative to the true homogeneous value at various averaging scales, (ii) the error in linear trend estimates and (iii) traditional contingency skill scores. The metrics were computed both using the individual station series as well as the network average regional series. The performance of the contributions depends significantly on the error metric considered. Contingency scores by themselves are not very informative. Although relative homogenization algorithms typically improve the homogeneity of temperature data, only the best ones improve precipitation data. Training was found to be very important. Moreover, state-of-the-art relative homogenization algorithms developed to work with an inhomogeneous reference are shown to perform best. The study showed that currently automatic algorithms can perform as well as manual ones.
Performance Characteristics of the Multi-Zone NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Jin, Haoqiang; VanderWijngaart, Rob F.
2003-01-01
We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of grids, but had not previously been captured in bench-marks. The new suite, named NPB Multi-Zone, is extended from the NAS Parallel Benchmarks suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the Message Passing Interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on three different parallel computers. We also use an empirical formula to investigate the performance characteristics of the multi-zone benchmarks.
Nicholson, Andrew G; Detterbeck, Frank; Marx, Alexander; Roden, Anja C; Marchevsky, Alberto M; Mukai, Kiyoshi; Chen, Gang; Marino, Mirella; den Bakker, Michael A; Yang, Woo-Ick; Judge, Meagan; Hirschowitz, Lynn
2017-03-01
The International Collaboration on Cancer Reporting (ICCR) is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom, the College of American Pathologists, the Canadian Association of Pathologists-Association Canadienne des Pathologists in association with the Canadian Partnership Against Cancer, and the European Society of Pathology. Its goal is to produce standardized, internationally agreed, evidence-based datasets for use throughout the world. This article describes the development of a cancer dataset by the multidisciplinary ICCR expert panel for the reporting of thymic epithelial tumours. The dataset includes 'required' (mandatory) and 'recommended' (non-mandatory) elements, which are validated by a review of current evidence and supported by explanatory text. Seven required elements and 12 recommended elements were agreed by the international dataset authoring committee to represent the essential information for the reporting of thymic epithelial tumours. The use of an internationally agreed, structured pathology dataset for reporting thymic tumours provides all of the necessary information for optimal patient management, facilitates consistent and accurate data collection, and provides valuable data for research and international benchmarking. The dataset also provides a valuable resource for those countries and institutions that are not in a position to develop their own datasets. © 2016 John Wiley & Sons Ltd.
Mei, Suyu
2012-10-07
Recent years have witnessed much progress in computational modeling for protein subcellular localization. However, there are far few computational models for predicting plant protein subcellular multi-localization. In this paper, we propose a multi-label multi-kernel transfer learning model for predicting multiple subcellular locations of plant proteins (MLMK-TLM). The method proposes a multi-label confusion matrix and adapts one-against-all multi-class probabilistic outputs to multi-label learning scenario, based on which we further extend our published work MK-TLM (multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization) for plant protein subcellular multi-localization. By proper homolog knowledge transfer, MLMK-TLM is applicable to novel plant protein subcellular localization in multi-label learning scenario. The experiments on plant protein benchmark dataset show that MLMK-TLM outperforms the baseline model. Unlike the existing models, MLMK-TLM also reports its misleading tendency, which is important for comprehensive survey of model's multi-labeling performance. Copyright © 2012 Elsevier Ltd. All rights reserved.
Brain tumor segmentation in multi-spectral MRI using convolutional neural networks (CNN).
Iqbal, Sajid; Ghani, M Usman; Saba, Tanzila; Rehman, Amjad
2018-04-01
A tumor could be found in any area of the brain and could be of any size, shape, and contrast. There may exist multiple tumors of different types in a human brain at the same time. Accurate tumor area segmentation is considered primary step for treatment of brain tumors. Deep Learning is a set of promising techniques that could provide better results as compared to nondeep learning techniques for segmenting timorous part inside a brain. This article presents a deep convolutional neural network (CNN) to segment brain tumors in MRIs. The proposed network uses BRATS segmentation challenge dataset which is composed of images obtained through four different modalities. Accordingly, we present an extended version of existing network to solve segmentation problem. The network architecture consists of multiple neural network layers connected in sequential order with the feeding of Convolutional feature maps at the peer level. Experimental results on BRATS 2015 benchmark data thus show the usability of the proposed approach and its superiority over the other approaches in this area of research. © 2018 Wiley Periodicals, Inc.
Deep learning and face recognition: the state of the art
NASA Astrophysics Data System (ADS)
Balaban, Stephen
2015-05-01
Deep Neural Networks (DNNs) have established themselves as a dominant technique in machine learning. DNNs have been top performers on a wide variety of tasks including image classification, speech recognition, and face recognition.1-3 Convolutional neural networks (CNNs) have been used in nearly all of the top performing methods on the Labeled Faces in the Wild (LFW) dataset.3-6 In this talk and accompanying paper, I attempt to provide a review and summary of the deep learning techniques used in the state-of-the-art. In addition, I highlight the need for both larger and more challenging public datasets to benchmark these systems. Despite the ability of DNNs and autoencoders to perform unsupervised feature learning, modern facial recognition pipelines still require domain specific engineering in the form of re-alignment. For example, in Facebook's recent DeepFace paper, a 3D "frontalization" step lies at the beginning of the pipeline. This step creates a 3D face model for the incoming image and then uses a series of affine transformations of the fiducial points to "frontalize" the image. This step enables the DeepFace system to use a neural network architecture with locally connected layers without weight sharing as opposed to standard convolutional layers.6 Deep learning techniques combined with large datasets have allowed research groups to surpass human level performance on the LFW dataset.3, 5 The high accuracy (99.63% for FaceNet at the time of publishing) and utilization of outside data (hundreds of millions of images in the case of Google's FaceNet) suggest that current face verification benchmarks such as LFW may not be challenging enough, nor provide enough data, for current techniques.3, 5 There exist a variety of organizations with mobile photo sharing applications that would be capable of releasing a very large scale and highly diverse dataset of facial images captured on mobile devices. Such an "ImageNet for Face Recognition" would likely receive a warm welcome from researchers and practitioners alike.
Wherry, Susan A.; Wood, Tamara M.; Anderson, Chauncey W.
2015-01-01
Using the extended 1991–2010 external phosphorus loading dataset, the lake TMDL model was recalibrated following the same procedures outlined in the Phase 1 review. The version of the model selected for further development incorporated an updated sediment initial condition, a numerical solution method for the chlorophyll a model, changes to light and phosphorus factors limiting algal growth, and a new pH-model regression, which removed Julian day dependence in order to avoid discontinuities in pH at year boundaries. This updated lake TMDL model was recalibrated using the extended dataset in order to compare calibration parameters to those obtained from a calibration with the original 7.5-year dataset. The resulting algal settling velocity calibrated from the extended dataset was more than twice the value calibrated with the original dataset, and, because the calibrated values of algal settling velocity and recycle rate are related (more rapid settling required more rapid recycling), the recycling rate also was larger than that determined with the original dataset. These changes in calibration parameters highlight the uncertainty in critical rates in the Upper Klamath Lake TMDL model and argue for their direct measurement in future data collection to increase confidence in the model predictions.
NASA Astrophysics Data System (ADS)
Rock, Gilles; Fischer, Kim; Schlerf, Martin; Gerhards, Max; Udelhoven, Thomas
2017-04-01
The development and optimization of image processing algorithms requires the availability of datasets depicting every step from earth surface to the sensor's detector. The lack of ground truth data obliges to develop algorithms on simulated data. The simulation of hyperspectral remote sensing data is a useful tool for a variety of tasks such as the design of systems, the understanding of the image formation process, and the development and validation of data processing algorithms. An end-to-end simulator has been set up consisting of a forward simulator, a backward simulator and a validation module. The forward simulator derives radiance datasets based on laboratory sample spectra, applies atmospheric contributions using radiative transfer equations, and simulates the instrument response using configurable sensor models. This is followed by the backward simulation branch, consisting of an atmospheric correction (AC), a temperature and emissivity separation (TES) or a hybrid AC and TES algorithm. An independent validation module allows the comparison between input and output dataset and the benchmarking of different processing algorithms. In this study, hyperspectral thermal infrared scenes of a variety of surfaces have been simulated to analyze existing AC and TES algorithms. The ARTEMISS algorithm was optimized and benchmarked against the original implementations. The errors in TES were found to be related to incorrect water vapor retrieval. The atmospheric characterization could be optimized resulting in increasing accuracies in temperature and emissivity retrieval. Airborne datasets of different spectral resolutions were simulated from terrestrial HyperCam-LW measurements. The simulated airborne radiance spectra were subjected to atmospheric correction and TES and further used for a plant species classification study analyzing effects related to noise and mixed pixels.
Benchmarking Big Data Systems and the BigData Top100 List.
Baru, Chaitanya; Bhandarkar, Milind; Nambiar, Raghunath; Poess, Meikel; Rabl, Tilmann
2013-03-01
"Big data" has become a major force of innovation across enterprises of all sizes. New platforms with increasingly more features for managing big datasets are being announced almost on a weekly basis. Yet, there is currently a lack of any means of comparability among such platforms. While the performance of traditional database systems is well understood and measured by long-established institutions such as the Transaction Processing Performance Council (TCP), there is neither a clear definition of the performance of big data systems nor a generally agreed upon metric for comparing these systems. In this article, we describe a community-based effort for defining a big data benchmark. Over the past year, a Big Data Benchmarking Community has become established in order to fill this void. The effort focuses on defining an end-to-end application-layer benchmark for measuring the performance of big data applications, with the ability to easily adapt the benchmark specification to evolving challenges in the big data space. This article describes the efforts that have been undertaken thus far toward the definition of a BigData Top100 List. While highlighting the major technical as well as organizational challenges, through this article, we also solicit community input into this process.
The EB Factory: Fundamental Stellar Astrophysics with Eclipsing Binary Stars Discovered by Kepler
NASA Astrophysics Data System (ADS)
Stassun, Keivan
Eclipsing binaries (EBs) are key laboratories for determining the fundamental properties of stars. EBs are therefore foundational objects for constraining stellar evolution models, which in turn are central to determinations of stellar mass functions, of exoplanet properties, and many other areas. The primary goal of this proposal is to mine the Kepler mission light curves for: (1) EBs that include a subgiant star, from which precise ages can be derived and which can thus serve as critically needed age benchmarks; and within these, (2) long-period EBs that include low-mass M stars or brown dwarfs, which are increa-singly becoming the focus of exoplanet searches, but for which there are the fewest available fundamental mass- radius-age benchmarks. A secondary goal of this proposal is to develop an end-to-end computational pipeline -- the Kepler EB Factory -- that allows automatic processing of Kepler light curves for EBs, from period finding, to object classification, to determination of EB physical properties for the most scientifically interesting EBs, and finally to accurate modeling of these EBs for detailed tests and benchmarking of theoretical stellar evolution models. We will integrate the most successful algorithms into a single, cohesive workflow environment, and apply this 'Kepler EB Factory' to the full public Kepler dataset to find and characterize new "benchmark grade" EBs, and will disseminate both the enhanced data products from this pipeline and the pipeline itself to the broader NASA science community. The proposed work responds directly to two of the defined Research Areas of the NASA Astrophysics Data Analysis Program (ADAP), specifically Research Area #2 (Stellar Astrophysics) and Research Area #9 (Astrophysical Databases). To be clear, our primary goal is the fundamental stellar astrophysics that will be enabled by the discovery and analysis of relatively rare, benchmark-grade EBs in the Kepler dataset. At the same time, to enable this goal will require bringing a suite of extant and new custom algorithms to bear on the Kepler data, and thus our development of the Kepler EB Factory represents a value-added product that will allow the widest scientific impact of the in-formation locked within the vast reservoir of the Kepler light curves.
Expanded Outreach at Clemson University. A Case Study.
ERIC Educational Resources Information Center
Bennett, A. Wayne
This paper summarizes recent strategic planning activities at Clemson University, focusing on outreach and extended education goals at the university. Specific benchmarks for outreach and extended education include: (1) by May 1994, each department will develop an operational definition of its public service mission, an action plan to integrate…
Kansas Extended Curricular Standards for Mathematics.
ERIC Educational Resources Information Center
Kansas State Board of Education, Topeka.
This document is an extension of the Kansas Curricular Standards for Mathematics. These standards, benchmarks, and examples are intended to be used in developing curricular materials for students who are eligible for the alternative assessment. One difference in the extended mathematics standards from the general education standards is that grade…
Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation
Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B.
2016-01-01
Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware implementations. With this dataset we hope to (1) promote meaningful comparison between algorithms in the field of neural computation, (2) allow comparison with conventional image recognition methods, (3) provide an assessment of the state of the art in spike-based visual recognition, and (4) help researchers identify future directions and advance the field. PMID:27853419
Benchmarking Spike-Based Visual Recognition: A Dataset and Evaluation.
Liu, Qian; Pineda-García, Garibaldi; Stromatias, Evangelos; Serrano-Gotarredona, Teresa; Furber, Steve B
2016-01-01
Today, increasing attention is being paid to research into spike-based neural computation both to gain a better understanding of the brain and to explore biologically-inspired computation. Within this field, the primate visual pathway and its hierarchical organization have been extensively studied. Spiking Neural Networks (SNNs), inspired by the understanding of observed biological structure and function, have been successfully applied to visual recognition and classification tasks. In addition, implementations on neuromorphic hardware have enabled large-scale networks to run in (or even faster than) real time, making spike-based neural vision processing accessible on mobile robots. Neuromorphic sensors such as silicon retinas are able to feed such mobile systems with real-time visual stimuli. A new set of vision benchmarks for spike-based neural processing are now needed to measure progress quantitatively within this rapidly advancing field. We propose that a large dataset of spike-based visual stimuli is needed to provide meaningful comparisons between different systems, and a corresponding evaluation methodology is also required to measure the performance of SNN models and their hardware implementations. In this paper we first propose an initial NE (Neuromorphic Engineering) dataset based on standard computer vision benchmarksand that uses digits from the MNIST database. This dataset is compatible with the state of current research on spike-based image recognition. The corresponding spike trains are produced using a range of techniques: rate-based Poisson spike generation, rank order encoding, and recorded output from a silicon retina with both flashing and oscillating input stimuli. In addition, a complementary evaluation methodology is presented to assess both model-level and hardware-level performance. Finally, we demonstrate the use of the dataset and the evaluation methodology using two SNN models to validate the performance of the models and their hardware implementations. With this dataset we hope to (1) promote meaningful comparison between algorithms in the field of neural computation, (2) allow comparison with conventional image recognition methods, (3) provide an assessment of the state of the art in spike-based visual recognition, and (4) help researchers identify future directions and advance the field.
Compensatory neurofuzzy model for discrete data classification in biomedical
NASA Astrophysics Data System (ADS)
Ceylan, Rahime
2015-03-01
Biomedical data is separated to two main sections: signals and discrete data. So, studies in this area are about biomedical signal classification or biomedical discrete data classification. There are artificial intelligence models which are relevant to classification of ECG, EMG or EEG signals. In same way, in literature, many models exist for classification of discrete data taken as value of samples which can be results of blood analysis or biopsy in medical process. Each algorithm could not achieve high accuracy rate on classification of signal and discrete data. In this study, compensatory neurofuzzy network model is presented for classification of discrete data in biomedical pattern recognition area. The compensatory neurofuzzy network has a hybrid and binary classifier. In this system, the parameters of fuzzy systems are updated by backpropagation algorithm. The realized classifier model is conducted to two benchmark datasets (Wisconsin Breast Cancer dataset and Pima Indian Diabetes dataset). Experimental studies show that compensatory neurofuzzy network model achieved 96.11% accuracy rate in classification of breast cancer dataset and 69.08% accuracy rate was obtained in experiments made on diabetes dataset with only 10 iterations.
Indoor Modelling Benchmark for 3D Geometry Extraction
NASA Astrophysics Data System (ADS)
Thomson, C.; Boehm, J.
2014-06-01
A combination of faster, cheaper and more accurate hardware, more sophisticated software, and greater industry acceptance have all laid the foundations for an increased desire for accurate 3D parametric models of buildings. Pointclouds are the data source of choice currently with static terrestrial laser scanning the predominant tool for large, dense volume measurement. The current importance of pointclouds as the primary source of real world representation is endorsed by CAD software vendor acquisitions of pointcloud engines in 2011. Both the capture and modelling of indoor environments require great effort in time by the operator (and therefore cost). Automation is seen as a way to aid this by reducing the workload of the user and some commercial packages have appeared that provide automation to some degree. In the data capture phase, advances in indoor mobile mapping systems are speeding up the process, albeit currently with a reduction in accuracy. As a result this paper presents freely accessible pointcloud datasets of two typical areas of a building each captured with two different capture methods and each with an accurate wholly manually created model. These datasets are provided as a benchmark for the research community to gauge the performance and improvements of various techniques for indoor geometry extraction. With this in mind, non-proprietary, interoperable formats are provided such as E57 for the scans and IFC for the reference model. The datasets can be found at: http://indoor-bench.github.io/indoor-bench.
Joint estimation of motion and illumination change in a sequence of images
NASA Astrophysics Data System (ADS)
Koo, Ja-Keoung; Kim, Hyo-Hun; Hong, Byung-Woo
2015-09-01
We present an algorithm that simultaneously computes optical flow and estimates illumination change from an image sequence in a unified framework. We propose an energy functional consisting of conventional optical flow energy based on Horn-Schunck method and an additional constraint that is designed to compensate for illumination changes. Any undesirable illumination change that occurs in the imaging procedure in a sequence while the optical flow is being computed is considered a nuisance factor. In contrast to the conventional optical flow algorithm based on Horn-Schunck functional, which assumes the brightness constancy constraint, our algorithm is shown to be robust with respect to temporal illumination changes in the computation of optical flows. An efficient conjugate gradient descent technique is used in the optimization procedure as a numerical scheme. The experimental results obtained from the Middlebury benchmark dataset demonstrate the robustness and the effectiveness of our algorithm. In addition, comparative analysis of our algorithm and Horn-Schunck algorithm is performed on the additional test dataset that is constructed by applying a variety of synthetic bias fields to the original image sequences in the Middlebury benchmark dataset in order to demonstrate that our algorithm outperforms the Horn-Schunck algorithm. The superior performance of the proposed method is observed in terms of both qualitative visualizations and quantitative accuracy errors when compared to Horn-Schunck optical flow algorithm that easily yields poor results in the presence of small illumination changes leading to violation of the brightness constancy constraint.
Benchmark for license plate character segmentation
NASA Astrophysics Data System (ADS)
Gonçalves, Gabriel Resende; da Silva, Sirlene Pio Gomes; Menotti, David; Shwartz, William Robson
2016-09-01
Automatic license plate recognition (ALPR) has been the focus of many researches in the past years. In general, ALPR is divided into the following problems: detection of on-track vehicles, license plate detection, segmentation of license plate characters, and optical character recognition (OCR). Even though commercial solutions are available for controlled acquisition conditions, e.g., the entrance of a parking lot, ALPR is still an open problem when dealing with data acquired from uncontrolled environments, such as roads and highways when relying only on imaging sensors. Due to the multiple orientations and scales of the license plates captured by the camera, a very challenging task of the ALPR is the license plate character segmentation (LPCS) step, because its effectiveness is required to be (near) optimal to achieve a high recognition rate by the OCR. To tackle the LPCS problem, this work proposes a benchmark composed of a dataset designed to focus specifically on the character segmentation step of the ALPR within an evaluation protocol. Furthermore, we propose the Jaccard-centroid coefficient, an evaluation measure more suitable than the Jaccard coefficient regarding the location of the bounding box within the ground-truth annotation. The dataset is composed of 2000 Brazilian license plates consisting of 14000 alphanumeric symbols and their corresponding bounding box annotations. We also present a straightforward approach to perform LPCS efficiently. Finally, we provide an experimental evaluation for the dataset based on five LPCS approaches and demonstrate the importance of character segmentation for achieving an accurate OCR.
Pedotransfer functions for isoproturon sorption on soils and vadose zone materials.
Moeys, Julien; Bergheaud, Valérie; Coquet, Yves
2011-10-01
Sorption coefficients (the linear K(D) or the non-linear K(F) and N(F)) are critical parameters in models of pesticide transport to groundwater or surface water. In this work, a dataset of isoproturon sorption coefficients and corresponding soil properties (264 K(D) and 55 K(F)) was compiled, and pedotransfer functions were built for predicting isoproturon sorption in soils and vadose zone materials. These were benchmarked against various other prediction methods. The results show that the organic carbon content (OC) and pH are the two main soil properties influencing isoproturon K(D) . The pedotransfer function is K(D) = 1.7822 + 0.0162 OC(1.5) - 0.1958 pH (K(D) in L kg(-1) and OC in g kg(-1)). For low-OC soils (OC < 6.15 g kg(-1)), clay and pH are most influential. The pedotransfer function is then K(D) = 0.9980 + 0.0002 clay - 0.0990 pH (clay in g kg(-1)). Benchmarking K(D) estimations showed that functions calibrated on more specific subsets of the data perform better on these subsets than functions calibrated on larger subsets. Predicting isoproturon sorption in soils in unsampled locations should rely, whenever possible, and by order of preference, on (a) site- or soil-specific pedotransfer functions, (b) pedotransfer functions calibrated on a large dataset, (c) K(OC) values calculated on a large dataset or (d) K(OC) values taken from existing pesticide properties databases. Copyright © 2011 Society of Chemical Industry.
Rafferty, Sharon A.; Arnold, L.R.; Char, Stephen J.
2002-01-01
The U.S. Geological Survey developed this dataset as part of the Colorado Front Range Infrastructure Resources Project (FRIRP). One goal of the FRIRP was to provide information on the availability of those hydrogeologic resources that are either critical to maintaining infrastructure along the northern Front Range or that may become less available because of urban expansion in the northern Front Range. This dataset extends from the Boulder-Jefferson County line on the south, to the middle of Larimer and Weld Counties on the North. On the west, this dataset is bounded by the approximate mountain front of the Front Range of the Rocky Mountains; on the east, by an arbitrary north-south line extending through a point about 6.5 kilometers east of Greeley. This digital geospatial dataset consists of digitized contours of unconsolidated-sediment thickness (depth to bedrock).
Shi, Ruijia; Xu, Cunshuan
2011-06-01
The study of rat proteins is an indispensable task in experimental medicine and drug development. The function of a rat protein is closely related to its subcellular location. Based on the above concept, we construct the benchmark rat proteins dataset and develop a combined approach for predicting the subcellular localization of rat proteins. From protein primary sequence, the multiple sequential features are obtained by using of discrete Fourier analysis, position conservation scoring function and increment of diversity, and these sequential features are selected as input parameters of the support vector machine. By the jackknife test, the overall success rate of prediction is 95.6% on the rat proteins dataset. Our method are performed on the apoptosis proteins dataset and the Gram-negative bacterial proteins dataset with the jackknife test, the overall success rates are 89.9% and 96.4%, respectively. The above results indicate that our proposed method is quite promising and may play a complementary role to the existing predictors in this area.
Drug-target interaction prediction: A Bayesian ranking approach.
Peska, Ladislav; Buza, Krisztian; Koller, Júlia
2017-12-01
In silico prediction of drug-target interactions (DTI) could provide valuable information and speed-up the process of drug repositioning - finding novel usage for existing drugs. In our work, we focus on machine learning algorithms supporting drug-centric repositioning approach, which aims to find novel usage for existing or abandoned drugs. We aim at proposing a per-drug ranking-based method, which reflects the needs of drug-centric repositioning research better than conventional drug-target prediction approaches. We propose Bayesian Ranking Prediction of Drug-Target Interactions (BRDTI). The method is based on Bayesian Personalized Ranking matrix factorization (BPR) which has been shown to be an excellent approach for various preference learning tasks, however, it has not been used for DTI prediction previously. In order to successfully deal with DTI challenges, we extended BPR by proposing: (i) the incorporation of target bias, (ii) a technique to handle new drugs and (iii) content alignment to take structural similarities of drugs and targets into account. Evaluation on five benchmark datasets shows that BRDTI outperforms several state-of-the-art approaches in terms of per-drug nDCG and AUC. BRDTI results w.r.t. nDCG are 0.929, 0.953, 0.948, 0.897 and 0.690 for G-Protein Coupled Receptors (GPCR), Ion Channels (IC), Nuclear Receptors (NR), Enzymes (E) and Kinase (K) datasets respectively. Additionally, BRDTI significantly outperformed other methods (BLM-NII, WNN-GIP, NetLapRLS and CMF) w.r.t. nDCG in 17 out of 20 cases. Furthermore, BRDTI was also shown to be able to predict novel drug-target interactions not contained in the original datasets. The average recall at top-10 predicted targets for each drug was 0.762, 0.560, 1.000 and 0.404 for GPCR, IC, NR, and E datasets respectively. Based on the evaluation, we can conclude that BRDTI is an appropriate choice for researchers looking for an in silico DTI prediction technique to be used in drug-centric repositioning scenarios. BRDTI Software and supplementary materials are available online at www.ksi.mff.cuni.cz/∼peska/BRDTI. Copyright © 2017 Elsevier B.V. All rights reserved.
Lee, A S; Colagiuri, S; Flack, J R
2018-04-06
We developed and implemented a national audit and benchmarking programme to describe the clinical status of people with diabetes attending specialist diabetes services in Australia. The Australian National Diabetes Information Audit and Benchmarking (ANDIAB) initiative was established as a quality audit activity. De-identified data on demographic, clinical, biochemical and outcome items were collected from specialist diabetes services across Australia to provide cross-sectional data on people with diabetes attending specialist centres at least biennially during the years 1998 to 2011. In total, 38 155 sets of data were collected over the eight ANDIAB audits. Each ANDIAB audit achieved its primary objective to collect, collate, analyse, audit and report clinical diabetes data in Australia. Each audit resulted in the production of a pooled data report, as well as individual site reports allowing comparison and benchmarking against other participating sites. The ANDIAB initiative resulted in the largest cross-sectional national de-identified dataset describing the clinical status of people with diabetes attending specialist diabetes services in Australia. ANDIAB showed that people treated by specialist services had a high burden of diabetes complications. This quality audit activity provided a framework to guide planning of healthcare services. © 2018 Diabetes UK.
Fusion and Sense Making of Heterogeneous Sensor Network and Other Sources
2017-03-16
multimodal fusion framework that uses both training data and web resources for scene classification, the experimental results on the benchmark datasets...show that the proposed text-aided scene classification framework could significantly improve classification performance. Experimental results also show...human whose adaptability is achieved by reliability- dependent weighting of different sensory modalities. Experimental results show that the proposed
Rotation-invariant features for multi-oriented text detection in natural images.
Yao, Cong; Zhang, Xin; Bai, Xiang; Liu, Wenyu; Ma, Yi; Tu, Zhuowen
2013-01-01
Texts in natural scenes carry rich semantic information, which can be used to assist a wide range of applications, such as object recognition, image/video retrieval, mapping/navigation, and human computer interaction. However, most existing systems are designed to detect and recognize horizontal (or near-horizontal) texts. Due to the increasing popularity of mobile-computing devices and applications, detecting texts of varying orientations from natural images under less controlled conditions has become an important but challenging task. In this paper, we propose a new algorithm to detect texts of varying orientations. Our algorithm is based on a two-level classification scheme and two sets of features specially designed for capturing the intrinsic characteristics of texts. To better evaluate the proposed method and compare it with the competing algorithms, we generate a comprehensive dataset with various types of texts in diverse real-world scenes. We also propose a new evaluation protocol, which is more suitable for benchmarking algorithms for detecting texts in varying orientations. Experiments on benchmark datasets demonstrate that our system compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on variant texts in complex natural scenes.
A dynamic fault tree model of a propulsion system
NASA Technical Reports Server (NTRS)
Xu, Hong; Dugan, Joanne Bechta; Meshkat, Leila
2006-01-01
We present a dynamic fault tree model of the benchmark propulsion system, and solve it using Galileo. Dynamic fault trees (DFT) extend traditional static fault trees with special gates to model spares and other sequence dependencies. Galileo solves DFT models using a judicious combination of automatically generated Markov and Binary Decision Diagram models. Galileo easily handles the complexities exhibited by the benchmark problem. In particular, Galileo is designed to model phased mission systems.
Classification and assessment tools for structural motif discovery algorithms.
Badr, Ghada; Al-Turaiki, Isra; Mathkour, Hassan
2013-01-01
Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery. Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures. We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.
NASA Astrophysics Data System (ADS)
Kokkoris, M.; Dede, S.; Kantre, K.; Lagoyannis, A.; Ntemou, E.; Paneta, V.; Preketes-Sigalas, K.; Provatas, G.; Vlastou, R.; Bogdanović-Radović, I.; Siketić, Z.; Obajdin, N.
2017-08-01
The evaluated proton differential cross sections suitable for the Elastic Backscattering Spectroscopy (EBS) analysis of natSi and 16O, as obtained from SigmaCalc 2.0, have been benchmarked over a wide energy and angular range at two different accelerator laboratories, namely at N.C.S.R. 'Demokritos', Athens, Greece and at Ruđer Bošković Institute (RBI), Zagreb, Croatia, using a variety of high-purity thick targets of known stoichiometry. The results are presented in graphical and tabular forms, while the observed discrepancies, as well as, the limits in accuracy of the benchmarking procedure, along with target related effects, are thoroughly discussed and analysed. In the case of oxygen the agreement between simulated and experimental spectra was generally good, while for silicon serious discrepancies were observed above Ep,lab = 2.5 MeV, suggesting that a further tuning of the appropriate nuclear model parameters in the evaluated differential cross-section datasets is required.
Benchmarking database performance for genomic data.
Khushi, Matloob
2015-06-01
Genomic regions represent features such as gene annotations, transcription factor binding sites and epigenetic modifications. Performing various genomic operations such as identifying overlapping/non-overlapping regions or nearest gene annotations are common research needs. The data can be saved in a database system for easy management, however, there is no comprehensive database built-in algorithm at present to identify overlapping regions. Therefore I have developed a novel region-mapping (RegMap) SQL-based algorithm to perform genomic operations and have benchmarked the performance of different databases. Benchmarking identified that PostgreSQL extracts overlapping regions much faster than MySQL. Insertion and data uploads in PostgreSQL were also better, although general searching capability of both databases was almost equivalent. In addition, using the algorithm pair-wise, overlaps of >1000 datasets of transcription factor binding sites and histone marks, collected from previous publications, were reported and it was found that HNF4G significantly co-locates with cohesin subunit STAG1 (SA1).Inc. © 2015 Wiley Periodicals, Inc.
Benchmarking nitrogen removal suspended-carrier biofilm systems using dynamic simulation.
Vanhooren, H; Yuan, Z; Vanrolleghem, P A
2002-01-01
We are witnessing an enormous growth in biological nitrogen removal from wastewater. It presents specific challenges beyond traditional COD (carbon) removal. A possibility for optimised process design is the use of biomass-supporting media. In this paper, attached growth processes (AGP) are evaluated using dynamic simulations. The advantages of these systems that were qualitatively described elsewhere, are validated quantitatively based on a simulation benchmark for activated sludge treatment systems. This simulation benchmark is extended with a biofilm model that allows for fast and accurate simulation of the conversion of different substrates in a biofilm. The economic feasibility of this system is evaluated using the data generated with the benchmark simulations. Capital savings due to volume reduction and reduced sludge production are weighed out against increased aeration costs. In this evaluation, effluent quality is integrated as well.
SAR image classification based on CNN in real and simulation datasets
NASA Astrophysics Data System (ADS)
Peng, Lijiang; Liu, Ming; Liu, Xiaohua; Dong, Liquan; Hui, Mei; Zhao, Yuejin
2018-04-01
Convolution neural network (CNN) has made great success in image classification tasks. Even in the field of synthetic aperture radar automatic target recognition (SAR-ATR), state-of-art results has been obtained by learning deep representation of features on the MSTAR benchmark. However, the raw data of MSTAR have shortcomings in training a SAR-ATR model because of high similarity in background among the SAR images of each kind. This indicates that the CNN would learn the hierarchies of features of backgrounds as well as the targets. To validate the influence of the background, some other SAR images datasets have been made which contains the simulation SAR images of 10 manufactured targets such as tank and fighter aircraft, and the backgrounds of simulation SAR images are sampled from the whole original MSTAR data. The simulation datasets contain the dataset that the backgrounds of each kind images correspond to the one kind of backgrounds of MSTAR targets or clutters and the dataset that each image shares the random background of whole MSTAR targets or clutters. In addition, mixed datasets of MSTAR and simulation datasets had been made to use in the experiments. The CNN architecture proposed in this paper are trained on all datasets mentioned above. The experimental results shows that the architecture can get high performances on all datasets even the backgrounds of the images are miscellaneous, which indicates the architecture can learn a good representation of the targets even though the drastic changes on background.
NASA Astrophysics Data System (ADS)
Evans, B. J. K.; Wyborn, L. A.; Druken, K. A.; Richards, C. J.; Trenham, C. E.; Wang, J.
2016-12-01
The Australian National Computational Infrastructure (NCI) manages a large geospatial repository (10+ PBytes) of Earth systems, environmental, water management and geophysics research data, co-located with a petascale supercomputer and an integrated research cloud. NCI has applied the principles of the "Common Framework for Earth-Observation Data" (the Framework) to the organisation of these collections enabling a diverse range of researchers to explore different aspects of the data and, in particular, for seamless programmatic data analysis, both in-situ access and via data services. NCI provides access to the collections through the National Environmental Research Data Interoperability Platform (NERDIP) - a comprehensive and integrated data platform with both common and emerging services designed to enable data accessibility and citability. Applying the Framework across the range of datasets ensures that programmatic access, both in-situ and network methods, work as uniformly as possible for any dataset, using both APIs and data services. NCI has also created a comprehensive quality assurance framework to regularise compliance checks across the data, library APIs and data services, and to establish a comprehensive set of benchmarks to quantify both functionality and performance perspectives for the Framework. The quality assurance includes organisation of datasets through a data management plan, which anchors the data directory structure, version controls and data information services so that they are kept aligned with operational changes over time. Specific attention has been placed on the way data are packed inside the files. Our experience has shown that complying with standards such as CF and ACDD is still not enough to ensure that all data services or software packages correctly read the data. Further, data may not be optimally organised for the different access patterns, which causes poor performance of the CPUs and bandwidth utilisation. We will also discuss some gaps in the Framework that have emerged and our approach to resolving these.
Zhong, Shangping; Chen, Tianshun; He, Fengying; Niu, Yuzhen
2014-09-01
For a practical pattern classification task solved by kernel methods, the computing time is mainly spent on kernel learning (or training). However, the current kernel learning approaches are based on local optimization techniques, and hard to have good time performances, especially for large datasets. Thus the existing algorithms cannot be easily extended to large-scale tasks. In this paper, we present a fast Gaussian kernel learning method by solving a specially structured global optimization (SSGO) problem. We optimize the Gaussian kernel function by using the formulated kernel target alignment criterion, which is a difference of increasing (d.i.) functions. Through using a power-transformation based convexification method, the objective criterion can be represented as a difference of convex (d.c.) functions with a fixed power-transformation parameter. And the objective programming problem can then be converted to a SSGO problem: globally minimizing a concave function over a convex set. The SSGO problem is classical and has good solvability. Thus, to find the global optimal solution efficiently, we can adopt the improved Hoffman's outer approximation method, which need not repeat the searching procedure with different starting points to locate the best local minimum. Also, the proposed method can be proven to converge to the global solution for any classification task. We evaluate the proposed method on twenty benchmark datasets, and compare it with four other Gaussian kernel learning methods. Experimental results show that the proposed method stably achieves both good time-efficiency performance and good classification performance. Copyright © 2014 Elsevier Ltd. All rights reserved.
Normal Modes Expose Active Sites in Enzymes.
Glantz-Gashai, Yitav; Meirson, Tomer; Samson, Abraham O
2016-12-01
Accurate prediction of active sites is an important tool in bioinformatics. Here we present an improved structure based technique to expose active sites that is based on large changes of solvent accessibility accompanying normal mode dynamics. The technique which detects EXPOsure of active SITes through normal modEs is named EXPOSITE. The technique is trained using a small 133 enzyme dataset and tested using a large 845 enzyme dataset, both with known active site residues. EXPOSITE is also tested in a benchmark protein ligand dataset (PLD) comprising 48 proteins with and without bound ligands. EXPOSITE is shown to successfully locate the active site in most instances, and is found to be more accurate than other structure-based techniques. Interestingly, in several instances, the active site does not correspond to the largest pocket. EXPOSITE is advantageous due to its high precision and paves the way for structure based prediction of active site in enzymes.
Normal Modes Expose Active Sites in Enzymes
Glantz-Gashai, Yitav; Samson, Abraham O.
2016-01-01
Accurate prediction of active sites is an important tool in bioinformatics. Here we present an improved structure based technique to expose active sites that is based on large changes of solvent accessibility accompanying normal mode dynamics. The technique which detects EXPOsure of active SITes through normal modEs is named EXPOSITE. The technique is trained using a small 133 enzyme dataset and tested using a large 845 enzyme dataset, both with known active site residues. EXPOSITE is also tested in a benchmark protein ligand dataset (PLD) comprising 48 proteins with and without bound ligands. EXPOSITE is shown to successfully locate the active site in most instances, and is found to be more accurate than other structure-based techniques. Interestingly, in several instances, the active site does not correspond to the largest pocket. EXPOSITE is advantageous due to its high precision and paves the way for structure based prediction of active site in enzymes. PMID:28002427
Complex versus simple models: ion-channel cardiac toxicity prediction.
Mistry, Hitesh B
2018-01-01
There is growing interest in applying detailed mathematical models of the heart for ion-channel related cardiac toxicity prediction. However, a debate as to whether such complex models are required exists. Here an assessment in the predictive performance between two established large-scale biophysical cardiac models and a simple linear model B net was conducted. Three ion-channel data-sets were extracted from literature. Each compound was designated a cardiac risk category using two different classification schemes based on information within CredibleMeds. The predictive performance of each model within each data-set for each classification scheme was assessed via a leave-one-out cross validation. Overall the B net model performed equally as well as the leading cardiac models in two of the data-sets and outperformed both cardiac models on the latest. These results highlight the importance of benchmarking complex versus simple models but also encourage the development of simple models.
GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.
Jiang, Lan; Chen, Huidong; Pinello, Luca; Yuan, Guo-Cheng
2016-07-01
High-throughput single-cell technologies have great potential to discover new cell types; however, it remains challenging to detect rare cell types that are distinct from a large population. We present a novel computational method, called GiniClust, to overcome this challenge. Validation against a benchmark dataset indicates that GiniClust achieves high sensitivity and specificity. Application of GiniClust to public single-cell RNA-seq datasets uncovers previously unrecognized rare cell types, including Zscan4-expressing cells within mouse embryonic stem cells and hemoglobin-expressing cells in the mouse cortex and hippocampus. GiniClust also correctly detects a small number of normal cells that are mixed in a cancer cell population.
Matt: local flexibility aids protein multiple structure alignment.
Menke, Matthew; Berger, Bonnie; Cowen, Lenore
2008-01-01
Even when there is agreement on what measure a protein multiple structure alignment should be optimizing, finding the optimal alignment is computationally prohibitive. One approach used by many previous methods is aligned fragment pair chaining, where short structural fragments from all the proteins are aligned against each other optimally, and the final alignment chains these together in geometrically consistent ways. Ye and Godzik have recently suggested that adding geometric flexibility may help better model protein structures in a variety of contexts. We introduce the program Matt (Multiple Alignment with Translations and Twists), an aligned fragment pair chaining algorithm that, in intermediate steps, allows local flexibility between fragments: small translations and rotations are temporarily allowed to bring sets of aligned fragments closer, even if they are physically impossible under rigid body transformations. After a dynamic programming assembly guided by these "bent" alignments, geometric consistency is restored in the final step before the alignment is output. Matt is tested against other recent multiple protein structure alignment programs on the popular Homstrad and SABmark benchmark datasets. Matt's global performance is competitive with the other programs on Homstrad, but outperforms the other programs on SABmark, a benchmark of multiple structure alignments of proteins with more distant homology. On both datasets, Matt demonstrates an ability to better align the ends of alpha-helices and beta-strands, an important characteristic of any structure alignment program intended to help construct a structural template library for threading approaches to the inverse protein-folding problem. The related question of whether Matt alignments can be used to distinguish distantly homologous structure pairs from pairs of proteins that are not homologous is also considered. For this purpose, a p-value score based on the length of the common core and average root mean squared deviation (RMSD) of Matt alignments is shown to largely separate decoys from homologous protein structures in the SABmark benchmark dataset. We postulate that Matt's strong performance comes from its ability to model proteins in different conformational states and, perhaps even more important, its ability to model backbone distortions in more distantly related proteins.
Tsatsaronis, George; Balikas, Georgios; Malakasiotis, Prodromos; Partalas, Ioannis; Zschunke, Matthias; Alvers, Michael R; Weissenborn, Dirk; Krithara, Anastasia; Petridis, Sergios; Polychronopoulos, Dimitris; Almirantis, Yannis; Pavlopoulos, John; Baskiotis, Nicolas; Gallinari, Patrick; Artiéres, Thierry; Ngomo, Axel-Cyrille Ngonga; Heino, Norman; Gaussier, Eric; Barrio-Alvers, Liliana; Schroeder, Michael; Androutsopoulos, Ion; Paliouras, Georgios
2015-04-30
This article provides an overview of the first BIOASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BIOASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies. The 2013 BIOASQ competition comprised two tasks, Task 1a and Task 1b. In Task 1a participants were asked to automatically annotate new PUBMED documents with MESH headings. Twelve teams participated in Task 1a, with a total of 46 system runs submitted, and one of the teams performing consistently better than the MTI indexer used by NLM to suggest MESH headings to curators. Task 1b used benchmark datasets containing 29 development and 282 test English questions, along with gold standard (reference) answers, prepared by a team of biomedical experts from around Europe and participants had to automatically produce answers. Three teams participated in Task 1b, with 11 system runs. The BIOASQ infrastructure, including benchmark datasets, evaluation mechanisms, and the results of the participants and baseline methods, is publicly available. A publicly available evaluation infrastructure for biomedical semantic indexing and QA has been developed, which includes benchmark datasets, and can be used to evaluate systems that: assign MESH headings to published articles or to English questions; retrieve relevant RDF triples from ontologies, relevant articles and snippets from PUBMED Central; produce "exact" and paragraph-sized "ideal" answers (summaries). The results of the systems that participated in the 2013 BIOASQ competition are promising. In Task 1a one of the systems performed consistently better from the NLM's MTI indexer. In Task 1b the systems received high scores in the manual evaluation of the "ideal" answers; hence, they produced high quality summaries as answers. Overall, BIOASQ helped obtain a unified view of how techniques from text classification, semantic indexing, document and passage retrieval, question answering, and text summarization can be combined to allow biomedical experts to obtain concise, user-understandable answers to questions reflecting their real information needs.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.
Churg, Andrew; Attanoos, Richard; Borczuk, Alain C; Chirieac, Lucian R; Galateau-Sallé, Françoise; Gibbs, Allen; Henderson, Douglas; Roggli, Victor; Rusch, Valerie; Judge, Meagan J; Srigley, John R
2016-10-01
-The International Collaboration on Cancer Reporting is a not-for-profit organization formed by the Royal Colleges of Pathologists of Australasia and the United Kingdom; the College of American Pathologists; the Canadian Association of Pathologists-Association Canadienne des Pathologists, in association with the Canadian Partnership Against Cancer; and the European Society of Pathology. Its goal is to produce common, internationally agreed upon, evidence-based datasets for use throughout the world. -To describe a dataset developed by the Expert Panel of the International Collaboration on Cancer Reporting for reporting malignant mesothelioma of both the pleura and peritoneum. The dataset is composed of "required" (mandatory) and "recommended" (nonmandatory) elements. -Based on a review of the most recent evidence and supported by explanatory commentary. -Eight required elements and 7 recommended elements were agreed upon by the Expert Panel to represent the essential information for reporting malignant mesothelioma of the pleura and peritoneum. -In time, the widespread use of an internationally agreed upon, structured, pathology dataset for mesothelioma will lead not only to improved patient management but also provide valuable data for research and international benchmarks.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
A Manual Segmentation Tool for Three-Dimensional Neuron Datasets.
Magliaro, Chiara; Callara, Alejandro L; Vanello, Nicola; Ahluwalia, Arti
2017-01-01
To date, automated or semi-automated software and algorithms for segmentation of neurons from three-dimensional imaging datasets have had limited success. The gold standard for neural segmentation is considered to be the manual isolation performed by an expert. To facilitate the manual isolation of complex objects from image stacks, such as neurons in their native arrangement within the brain, a new Manual Segmentation Tool (ManSegTool) has been developed. ManSegTool allows user to load an image stack, scroll down the images and to manually draw the structures of interest stack-by-stack. Users can eliminate unwanted regions or split structures (i.e., branches from different neurons that are too close each other, but, to the experienced eye, clearly belong to a unique cell), to view the object in 3D and save the results obtained. The tool can be used for testing the performance of a single-neuron segmentation algorithm or to extract complex objects, where the available automated methods still fail. Here we describe the software's main features and then show an example of how ManSegTool can be used to segment neuron images acquired using a confocal microscope. In particular, expert neuroscientists were asked to segment different neurons from which morphometric variables were subsequently extracted as a benchmark for precision. In addition, a literature-defined index for evaluating the goodness of segmentation was used as a benchmark for accuracy. Neocortical layer axons from a DIADEM challenge dataset were also segmented with ManSegTool and compared with the manual "gold-standard" generated for the competition.
Methodology and issues of integral experiments selection for nuclear data validation
NASA Astrophysics Data System (ADS)
Tatiana, Ivanova; Ivanov, Evgeny; Hill, Ian
2017-09-01
Nuclear data validation involves a large suite of Integral Experiments (IEs) for criticality, reactor physics and dosimetry applications. [1] Often benchmarks are taken from international Handbooks. [2, 3] Depending on the application, IEs have different degrees of usefulness in validation, and usually the use of a single benchmark is not advised; indeed, it may lead to erroneous interpretation and results. [1] This work aims at quantifying the importance of benchmarks used in application dependent cross section validation. The approach is based on well-known General Linear Least Squared Method (GLLSM) extended to establish biases and uncertainties for given cross sections (within a given energy interval). The statistical treatment results in a vector of weighting factors for the integral benchmarks. These factors characterize the value added by a benchmark for nuclear data validation for the given application. The methodology is illustrated by one example, selecting benchmarks for 239Pu cross section validation. The studies were performed in the framework of Subgroup 39 (Methods and approaches to provide feedback from nuclear and covariance data adjustment for improvement of nuclear data files) established at the Working Party on International Nuclear Data Evaluation Cooperation (WPEC) of the Nuclear Science Committee under the Nuclear Energy Agency (NEA/OECD).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Peiyuan; Brown, Timothy; Fullmer, William D.
Five benchmark problems are developed and simulated with the computational fluid dynamics and discrete element model code MFiX. The benchmark problems span dilute and dense regimes, consider statistically homogeneous and inhomogeneous (both clusters and bubbles) particle concentrations and a range of particle and fluid dynamic computational loads. Several variations of the benchmark problems are also discussed to extend the computational phase space to cover granular (particles only), bidisperse and heat transfer cases. A weak scaling analysis is performed for each benchmark problem and, in most cases, the scalability of the code appears reasonable up to approx. 103 cores. Profiling ofmore » the benchmark problems indicate that the most substantial computational time is being spent on particle-particle force calculations, drag force calculations and interpolating between discrete particle and continuum fields. Hardware performance analysis was also carried out showing significant Level 2 cache miss ratios and a rather low degree of vectorization. These results are intended to serve as a baseline for future developments to the code as well as a preliminary indicator of where to best focus performance optimizations.« less
Bayesian estimation of differential transcript usage from RNA-seq data.
Papastamoulis, Panagiotis; Rattray, Magnus
2017-11-27
Next generation sequencing allows the identification of genes consisting of differentially expressed transcripts, a term which usually refers to changes in the overall expression level. A specific type of differential expression is differential transcript usage (DTU) and targets changes in the relative within gene expression of a transcript. The contribution of this paper is to: (a) extend the use of cjBitSeq to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels and (b) propose a Bayesian version of DRIMSeq, a frequentist model for inferring DTU. cjBitSeq is a read based model and performs fully Bayesian inference by MCMC sampling on the space of latent state of each transcript per gene. BayesDRIMSeq is a count based model and estimates the Bayes Factor of a DTU model against a null model using Laplace's approximation. The proposed models are benchmarked against the existing ones using a recent independent simulation study as well as a real RNA-seq dataset. Our results suggest that the Bayesian methods exhibit similar performance with DRIMSeq in terms of precision/recall but offer better calibration of False Discovery Rate.
NASA Astrophysics Data System (ADS)
Arnold, Nicholas; Loch, Stuart; Ballance, Connor; Thomas, Ed
2017-10-01
Low temperature plasmas (Te < 10 eV) are ubiquitous in the medical, industrial, basic, and dusty plasma communities, and offer an opportunity for researchers to gain a better understanding of atomic processes in plasmas. Here, we report on a new atomic dataset for neutral and low charge states of argon, from which rate coefficients and cross-sections for the electron-impact excitation of neutral argon are determined. We benchmark by comparing with electron impact excitation cross-sections available in the literature, with very good agreement. We have used the Atomic Data and Analysis Structure (ADAS) code suite to calculate a level-resolved, generalized collisional-radiative (GCR) model for line emission in low temperature argon plasmas. By combining our theoretical model with experimental electron temperature, density, and spectral measurements from the Auburn Linear eXperiment for Instability Studies (ALEXIS), we have developed diagnostic techniques to measure metastable fraction, electron temperature, and electron density. In the future we hope to refine our methods, and extend our model to plasmas other than ALEXIS. Supported by the U.S. Department of Energy. Grant Number: DE-FG02-00ER54476.
Galpert, Deborah; del Río, Sara; Herrera, Francisco; Ancede-Gallardo, Evys; Antunes, Agostinho; Agüero-Chapin, Guillermin
2015-01-01
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification. PMID:26605337
Estimating differential expression from multiple indicators
Ilmjärv, Sten; Hundahl, Christian Ansgar; Reimets, Riin; Niitsoo, Margus; Kolde, Raivo; Vilo, Jaak; Vasar, Eero; Luuk, Hendrik
2014-01-01
Regardless of the advent of high-throughput sequencing, microarrays remain central in current biomedical research. Conventional microarray analysis pipelines apply data reduction before the estimation of differential expression, which is likely to render the estimates susceptible to noise from signal summarization and reduce statistical power. We present a probe-level framework, which capitalizes on the high number of concurrent measurements to provide more robust differential expression estimates. The framework naturally extends to various experimental designs and target categories (e.g. transcripts, genes, genomic regions) as well as small sample sizes. Benchmarking in relation to popular microarray and RNA-sequencing data-analysis pipelines indicated high and stable performance on the Microarray Quality Control dataset and in a cell-culture model of hypoxia. Experimental-data-exhibiting long-range epigenetic silencing of gene expression was used to demonstrate the efficacy of detecting differential expression of genomic regions, a level of analysis not embraced by conventional workflows. Finally, we designed and conducted an experiment to identify hypothermia-responsive genes in terms of monotonic time-response. As a novel insight, hypothermia-dependent up-regulation of multiple genes of two major antioxidant pathways was identified and verified by quantitative real-time PCR. PMID:24586062
An Active Learning Framework for Hyperspectral Image Classification Using Hierarchical Segmentation
NASA Technical Reports Server (NTRS)
Zhang, Zhou; Pasolli, Edoardo; Crawford, Melba M.; Tilton, James C.
2015-01-01
Augmenting spectral data with spatial information for image classification has recently gained significant attention, as classification accuracy can often be improved by extracting spatial information from neighboring pixels. In this paper, we propose a new framework in which active learning (AL) and hierarchical segmentation (HSeg) are combined for spectral-spatial classification of hyperspectral images. The spatial information is extracted from a best segmentation obtained by pruning the HSeg tree using a new supervised strategy. The best segmentation is updated at each iteration of the AL process, thus taking advantage of informative labeled samples provided by the user. The proposed strategy incorporates spatial information in two ways: 1) concatenating the extracted spatial features and the original spectral features into a stacked vector and 2) extending the training set using a self-learning-based semi-supervised learning (SSL) approach. Finally, the two strategies are combined within an AL framework. The proposed framework is validated with two benchmark hyperspectral datasets. Higher classification accuracies are obtained by the proposed framework with respect to five other state-of-the-art spectral-spatial classification approaches. Moreover, the effectiveness of the proposed pruning strategy is also demonstrated relative to the approaches based on a fixed segmentation.
Multimodal emotional state recognition using sequence-dependent deep hierarchical features.
Barros, Pablo; Jirak, Doreen; Weber, Cornelius; Wermter, Stefan
2015-12-01
Emotional state recognition has become an important topic for human-robot interaction in the past years. By determining emotion expressions, robots can identify important variables of human behavior and use these to communicate in a more human-like fashion and thereby extend the interaction possibilities. Human emotions are multimodal and spontaneous, which makes them hard to be recognized by robots. Each modality has its own restrictions and constraints which, together with the non-structured behavior of spontaneous expressions, create several difficulties for the approaches present in the literature, which are based on several explicit feature extraction techniques and manual modality fusion. Our model uses a hierarchical feature representation to deal with spontaneous emotions, and learns how to integrate multiple modalities for non-verbal emotion recognition, making it suitable to be used in an HRI scenario. Our experiments show that a significant improvement of recognition accuracy is achieved when we use hierarchical features and multimodal information, and our model improves the accuracy of state-of-the-art approaches from 82.5% reported in the literature to 91.3% for a benchmark dataset on spontaneous emotion expressions. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Galpert, Deborah; Del Río, Sara; Herrera, Francisco; Ancede-Gallardo, Evys; Antunes, Agostinho; Agüero-Chapin, Guillermin
2015-01-01
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.
Analysis of energy-based algorithms for RNA secondary structure prediction
2012-01-01
Background RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. Results We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Conclusions Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets. PMID:22296803
Analysis of energy-based algorithms for RNA secondary structure prediction.
Hajiaghayi, Monir; Condon, Anne; Hoos, Holger H
2012-02-01
RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets.
Evaluation of Graph Pattern Matching Workloads in Graph Analysis Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hong, Seokyong; Lee, Sangkeun; Lim, Seung-Hwan
2016-01-01
Graph analysis has emerged as a powerful method for data scientists to represent, integrate, query, and explore heterogeneous data sources. As a result, graph data management and mining became a popular area of research, and led to the development of plethora of systems in recent years. Unfortunately, the number of emerging graph analysis systems and the wide range of applications, coupled with a lack of apples-to-apples comparisons, make it difficult to understand the trade-offs between different systems and the graph operations for which they are designed. A fair comparison of these systems is a challenging task for the following reasons:more » multiple data models, non-standardized serialization formats, various query interfaces to users, and diverse environments they operate in. To address these key challenges, in this paper we present a new benchmark suite by extending the Lehigh University Benchmark (LUBM) to cover the most common capabilities of various graph analysis systems. We provide the design process of the benchmark, which generalizes the workflow for data scientists to conduct the desired graph analysis on different graph analysis systems. Equipped with this extended benchmark suite, we present performance comparison for nine subgraph pattern retrieval operations over six graph analysis systems, namely NetworkX, Neo4j, Jena, Titan, GraphX, and uRiKA. Through the proposed benchmark suite, this study reveals both quantitative and qualitative findings in (1) implications in loading data into each system; (2) challenges in describing graph patterns for each query interface; and (3) different sensitivity of each system to query selectivity. We envision that this study will pave the road for: (i) data scientists to select the suitable graph analysis systems, and (ii) data management system designers to advance graph analysis systems.« less
Zapata-Peñasco, Icoquih; Poot-Hernandez, Augusto Cesar; Eguiarte, Luis E
2017-01-01
Abstract The increasing number of metagenomic and genomic sequences has dramatically improved our understanding of microbial diversity, yet our ability to infer metabolic capabilities in such datasets remains challenging. We describe the Multigenomic Entropy Based Score pipeline (MEBS), a software platform designed to evaluate, compare, and infer complex metabolic pathways in large “omic” datasets, including entire biogeochemical cycles. MEBS is open source and available through https://github.com/eead-csic-compbio/metagenome_Pfam_score. To demonstrate its use, we modeled the sulfur cycle by exhaustively curating the molecular and ecological elements involved (compounds, genes, metabolic pathways, and microbial taxa). This information was reduced to a collection of 112 characteristic Pfam protein domains and a list of complete-sequenced sulfur genomes. Using the mathematical framework of relative entropy (H΄), we quantitatively measured the enrichment of these domains among sulfur genomes. The entropy of each domain was used both to build up a final score that indicates whether a (meta)genomic sample contains the metabolic machinery of interest and to propose marker domains in metagenomic sequences such as DsrC (PF04358). MEBS was benchmarked with a dataset of 2107 non-redundant microbial genomes from RefSeq and 935 metagenomes from MG-RAST. Its performance, reproducibility, and robustness were evaluated using several approaches, including random sampling, linear regression models, receiver operator characteristic plots, and the area under the curve metric (AUC). Our results support the broad applicability of this algorithm to accurately classify (AUC = 0.985) hard-to-culture genomes (e.g., Candidatus Desulforudis audaxviator), previously characterized ones, and metagenomic environments such as hydrothermal vents, or deep-sea sediment. Our benchmark indicates that an entropy-based score can capture the metabolic machinery of interest and can be used to efficiently classify large genomic and metagenomic datasets, including uncultivated/unexplored taxa. PMID:29069412
SU-G-BRC-17: Using Generalized Mean for Equivalent Square Estimation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, S; Fan, Q; Lei, Y
Purpose: Equivalent Square (ES) is a widely used concept in radiotherapy. It enables us to determine many important quantities for a rectangular treatment field, without measurement, based on the corresponding values from its ES field. In this study, we propose a Generalized Mean (GM) type ES formula and compare it with other established formulae using benchmark datasets. Methods: Our GM approach is expressed as ES=(w•fx^α+(1-w)•fy^α)^(1/α), where fx, fy, α, and w represent field sizes, power index, and a weighting factor, respectively. When α=−1 it reduces to well-known Sterling type ES formulae. In our study, α and w are determined throughmore » least-square-fitting. Akaike Information Criterion (AIC) was used to benchmark the performance of each formula. BJR (Supplement 17) ES field table for X-ray PDDs and open field output factor tables in Varian TrueBeam representative dataset were used for validation. Results: Switching from α=−1 to α=−1.25, a 20% reduction in standard deviation of residual error in ES estimation was achieved for the BJR dataset. The maximum relative residual error was reduced from ∼3% (in Sterling formula) or ∼2% (in Vadash/Bjarngard formula) down to ∼1% in GM formula for open fields of all energies and at rectangular field sizes from 3cm to 40cm in the Varian dataset. The improvement of the GM over the Sterling type ES formulae is particularly noticeable for very elongated rectangular fields with short width. AIC analysis confirmed the superior performance of the GM formula after taking into account the expanded parameter space. Conclusion: The GM significantly outperforms Sterling type formulae at slightly increased computational cost. The GM calculation may nullify the requirement of data measurement for many rectangular fields and hence shorten the Linac commissioning process. Improved dose calculation accuracy is also expected by adopting the GM formula into treatment planning and secondary MU check systems.« less
De Anda, Valerie; Zapata-Peñasco, Icoquih; Poot-Hernandez, Augusto Cesar; Eguiarte, Luis E; Contreras-Moreira, Bruno; Souza, Valeria
2017-11-01
The increasing number of metagenomic and genomic sequences has dramatically improved our understanding of microbial diversity, yet our ability to infer metabolic capabilities in such datasets remains challenging. We describe the Multigenomic Entropy Based Score pipeline (MEBS), a software platform designed to evaluate, compare, and infer complex metabolic pathways in large "omic" datasets, including entire biogeochemical cycles. MEBS is open source and available through https://github.com/eead-csic-compbio/metagenome_Pfam_score. To demonstrate its use, we modeled the sulfur cycle by exhaustively curating the molecular and ecological elements involved (compounds, genes, metabolic pathways, and microbial taxa). This information was reduced to a collection of 112 characteristic Pfam protein domains and a list of complete-sequenced sulfur genomes. Using the mathematical framework of relative entropy (H΄), we quantitatively measured the enrichment of these domains among sulfur genomes. The entropy of each domain was used both to build up a final score that indicates whether a (meta)genomic sample contains the metabolic machinery of interest and to propose marker domains in metagenomic sequences such as DsrC (PF04358). MEBS was benchmarked with a dataset of 2107 non-redundant microbial genomes from RefSeq and 935 metagenomes from MG-RAST. Its performance, reproducibility, and robustness were evaluated using several approaches, including random sampling, linear regression models, receiver operator characteristic plots, and the area under the curve metric (AUC). Our results support the broad applicability of this algorithm to accurately classify (AUC = 0.985) hard-to-culture genomes (e.g., Candidatus Desulforudis audaxviator), previously characterized ones, and metagenomic environments such as hydrothermal vents, or deep-sea sediment. Our benchmark indicates that an entropy-based score can capture the metabolic machinery of interest and can be used to efficiently classify large genomic and metagenomic datasets, including uncultivated/unexplored taxa. © The Author 2017. Published by Oxford University Press.
BLOND, a building-level office environment dataset of typical electrical appliances.
Kriechbaumer, Thomas; Jacobsen, Hans-Arno
2018-03-27
Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of.
BLOND, a building-level office environment dataset of typical electrical appliances
NASA Astrophysics Data System (ADS)
Kriechbaumer, Thomas; Jacobsen, Hans-Arno
2018-03-01
Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of.
BLOND, a building-level office environment dataset of typical electrical appliances
Kriechbaumer, Thomas; Jacobsen, Hans-Arno
2018-01-01
Energy metering has gained popularity as conventional meters are replaced by electronic smart meters that promise energy savings and higher comfort levels for occupants. Achieving these goals requires a deeper understanding of consumption patterns to reduce the energy footprint: load profile forecasting, power disaggregation, appliance identification, startup event detection, etc. Publicly available datasets are used to test, verify, and benchmark possible solutions to these problems. For this purpose, we present the BLOND dataset: continuous energy measurements of a typical office environment at high sampling rates with common appliances and load profiles. We provide voltage and current readings for aggregated circuits and matching fully-labeled ground truth data (individual appliance measurements). The dataset contains 53 appliances (16 classes) in a 3-phase power grid. BLOND-50 contains 213 days of measurements sampled at 50kSps (aggregate) and 6.4kSps (individual appliances). BLOND-250 consists of the same setup: 50 days, 250kSps (aggregate), 50kSps (individual appliances). These are the longest continuous measurements at such high sampling rates and fully-labeled ground truth we are aware of. PMID:29583141
Integration of heterogeneous features for remote sensing scene classification
NASA Astrophysics Data System (ADS)
Wang, Xin; Xiong, Xingnan; Ning, Chen; Shi, Aiye; Lv, Guofang
2018-01-01
Scene classification is one of the most important issues in remote sensing (RS) image processing. We find that features from different channels (shape, spectral, texture, etc.), levels (low-level and middle-level), or perspectives (local and global) could provide various properties for RS images, and then propose a heterogeneous feature framework to extract and integrate heterogeneous features with different types for RS scene classification. The proposed method is composed of three modules (1) heterogeneous features extraction, where three heterogeneous feature types, called DS-SURF-LLC, mean-Std-LLC, and MS-CLBP, are calculated, (2) heterogeneous features fusion, where the multiple kernel learning (MKL) is utilized to integrate the heterogeneous features, and (3) an MKL support vector machine classifier for RS scene classification. The proposed method is extensively evaluated on three challenging benchmark datasets (a 6-class dataset, a 12-class dataset, and a 21-class dataset), and the experimental results show that the proposed method leads to good classification performance. It produces good informative features to describe the RS image scenes. Moreover, the integration of heterogeneous features outperforms some state-of-the-art features on RS scene classification tasks.
Yu, Jinchao; Guerois, Raphaël
2016-12-15
Protein-protein docking methods are of great importance for understanding interactomes at the structural level. It has become increasingly appealing to use not only experimental structures but also homology models of unbound subunits as input for docking simulations. So far we are missing a large scale assessment of the success of rigid-body free docking methods on homology models. We explored how we could benefit from comparative modelling of unbound subunits to expand docking benchmark datasets. Starting from a collection of 3157 non-redundant, high X-ray resolution heterodimers, we developed the PPI4DOCK benchmark containing 1417 docking targets based on unbound homology models. Rigid-body docking by Zdock showed that for 1208 cases (85.2%), at least one correct decoy was generated, emphasizing the efficiency of rigid-body docking in generating correct assemblies. Overall, the PPI4DOCK benchmark contains a large set of realistic cases and provides new ground for assessing docking and scoring methodologies. Benchmark sets can be downloaded from http://biodev.cea.fr/interevol/ppi4dock/ CONTACT: guerois@cea.frSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Navier-Stokes analysis of a liquid rocket engine disk cavity
NASA Technical Reports Server (NTRS)
Benjamin, Theodore G.; Mcconnaughey, Paul K.
1991-01-01
This paper presents a Navier-Stokes analysis of hydrodynamic phenomena occurring in the aft disk cavity of a liquid rocket engine turbine. The cavity analyzed in the Space Shuttle Main Engine Alternate Turbopump currently being developed by NASA and Pratt and Whitney. Comparison of results obtained from the Navier-Stokes code for two rotating disk datasets available in the literature are presented as benchmark validations. The benchmark results obtained using the code show good agreement relative to experimental data, and the turbine disk cavity was analyzed with comparable grid resolution, dissipation levels, and turbulence models. Predicted temperatures in the cavity show that little mixing of hot and cold fluid occurs in the cavity and the flow is dominated by swirl and pumping up the rotating disk.
OpenSHS: Open Smart Home Simulator.
Alshammari, Nasser; Alshammari, Talal; Sedky, Mohamed; Champion, Justin; Bauer, Carolin
2017-05-02
This paper develops a new hybrid, open-source, cross-platform 3D smart home simulator, OpenSHS, for dataset generation. OpenSHS offers an opportunity for researchers in the field of the Internet of Things (IoT) and machine learning to test and evaluate their models. Following a hybrid approach, OpenSHS combines advantages from both interactive and model-based approaches. This approach reduces the time and efforts required to generate simulated smart home datasets. We have designed a replication algorithm for extending and expanding a dataset. A small sample dataset produced, by OpenSHS, can be extended without affecting the logical order of the events. The replication provides a solution for generating large representative smart home datasets. We have built an extensible library of smart devices that facilitates the simulation of current and future smart home environments. Our tool divides the dataset generation process into three distinct phases: first design: the researcher designs the initial virtual environment by building the home, importing smart devices and creating contexts; second, simulation: the participant simulates his/her context-specific events; and third, aggregation: the researcher applies the replication algorithm to generate the final dataset. We conducted a study to assess the ease of use of our tool on the System Usability Scale (SUS).
OpenSHS: Open Smart Home Simulator
Alshammari, Nasser; Alshammari, Talal; Sedky, Mohamed; Champion, Justin; Bauer, Carolin
2017-01-01
This paper develops a new hybrid, open-source, cross-platform 3D smart home simulator, OpenSHS, for dataset generation. OpenSHS offers an opportunity for researchers in the field of the Internet of Things (IoT) and machine learning to test and evaluate their models. Following a hybrid approach, OpenSHS combines advantages from both interactive and model-based approaches. This approach reduces the time and efforts required to generate simulated smart home datasets. We have designed a replication algorithm for extending and expanding a dataset. A small sample dataset produced, by OpenSHS, can be extended without affecting the logical order of the events. The replication provides a solution for generating large representative smart home datasets. We have built an extensible library of smart devices that facilitates the simulation of current and future smart home environments. Our tool divides the dataset generation process into three distinct phases: first design: the researcher designs the initial virtual environment by building the home, importing smart devices and creating contexts; second, simulation: the participant simulates his/her context-specific events; and third, aggregation: the researcher applies the replication algorithm to generate the final dataset. We conducted a study to assess the ease of use of our tool on the System Usability Scale (SUS). PMID:28468330
Bauer, Matthias R; Ibrahim, Tamer M; Vogel, Simon M; Boeckler, Frank M
2013-06-24
The application of molecular benchmarking sets helps to assess the actual performance of virtual screening (VS) workflows. To improve the efficiency of structure-based VS approaches, the selection and optimization of various parameters can be guided by benchmarking. With the DEKOIS 2.0 library, we aim to further extend and complement the collection of publicly available decoy sets. Based on BindingDB bioactivity data, we provide 81 new and structurally diverse benchmark sets for a wide variety of different target classes. To ensure a meaningful selection of ligands, we address several issues that can be found in bioactivity data. We have improved our previously introduced DEKOIS methodology with enhanced physicochemical matching, now including the consideration of molecular charges, as well as a more sophisticated elimination of latent actives in the decoy set (LADS). We evaluate the docking performance of Glide, GOLD, and AutoDock Vina with our data sets and highlight existing challenges for VS tools. All DEKOIS 2.0 benchmark sets will be made accessible at http://www.dekois.com.
Zheng, Heping; Shabalin, Ivan G.; Handing, Katarzyna B.; Bujnicki, Janusz M.; Minor, Wladek
2015-01-01
The ubiquitous presence of magnesium ions in RNA has long been recognized as a key factor governing RNA folding, and is crucial for many diverse functions of RNA molecules. In this work, Mg2+-binding architectures in RNA were systematically studied using a database of RNA crystal structures from the Protein Data Bank (PDB). Due to the abundance of poorly modeled or incorrectly identified Mg2+ ions, the set of all sites was comprehensively validated and filtered to identify a benchmark dataset of 15 334 ‘reliable’ RNA-bound Mg2+ sites. The normalized frequencies by which specific RNA atoms coordinate Mg2+ were derived for both the inner and outer coordination spheres. A hierarchical classification system of Mg2+ sites in RNA structures was designed and applied to the benchmark dataset, yielding a set of 41 types of inner-sphere and 95 types of outer-sphere coordinating patterns. This classification system has also been applied to describe six previously reported Mg2+-binding motifs and detect them in new RNA structures. Investigation of the most populous site types resulted in the identification of seven novel Mg2+-binding motifs, and all RNA structures in the PDB were screened for the presence of these motifs. PMID:25800744
Benchmarking homogenization algorithms for monthly data
NASA Astrophysics Data System (ADS)
Venema, V. K. C.; Mestre, O.; Aguilar, E.; Auer, I.; Guijarro, J. A.; Domonkos, P.; Vertacnik, G.; Szentimrey, T.; Stepanek, P.; Zahradnicek, P.; Viarre, J.; Müller-Westermeier, G.; Lakatos, M.; Williams, C. N.; Menne, M. J.; Lindau, R.; Rasol, D.; Rustemeier, E.; Kolokythas, K.; Marinova, T.; Andresen, L.; Acquaotta, F.; Fratiannil, S.; Cheval, S.; Klancar, M.; Brunetti, M.; Gruber, C.; Prohom Duran, M.; Likso, T.; Esteban, P.; Brandsma, T.; Willett, K.
2013-09-01
The COST (European Cooperation in Science and Technology) Action ES0601: Advances in homogenization methods of climate series: an integrated approach (HOME) has executed a blind intercomparison and validation study for monthly homogenization algorithms. Time series of monthly temperature and precipitation were evaluated because of their importance for climate studies. The algorithms were validated against a realistic benchmark dataset. Participants provided 25 separate homogenized contributions as part of the blind study as well as 22 additional solutions submitted after the details of the imposed inhomogeneities were revealed. These homogenized datasets were assessed by a number of performance metrics including i) the centered root mean square error relative to the true homogeneous values at various averaging scales, ii) the error in linear trend estimates and iii) traditional contingency skill scores. The metrics were computed both using the individual station series as well as the network average regional series. The performance of the contributions depends significantly on the error metric considered. Although relative homogenization algorithms typically improve the homogeneity of temperature data, only the best ones improve precipitation data. Moreover, state-of-the-art relative homogenization algorithms developed to work with an inhomogeneous reference are shown to perform best. The study showed that currently automatic algorithms can perform as well as manual ones.
Benchmarking Commercial Conformer Ensemble Generators.
Friedrich, Nils-Ole; de Bruyn Kops, Christina; Flachsenberg, Florian; Sommer, Kai; Rarey, Matthias; Kirchmair, Johannes
2017-11-27
We assess and compare the performance of eight commercial conformer ensemble generators (ConfGen, ConfGenX, cxcalc, iCon, MOE LowModeMD, MOE Stochastic, MOE Conformation Import, and OMEGA) and one leading free algorithm, the distance geometry algorithm implemented in RDKit. The comparative study is based on a new version of the Platinum Diverse Dataset, a high-quality benchmarking dataset of 2859 protein-bound ligand conformations extracted from the PDB. Differences in the performance of commercial algorithms are much smaller than those observed for free algorithms in our previous study (J. Chem. Inf. 2017, 57, 529-539). For commercial algorithms, the median minimum root-mean-square deviations measured between protein-bound ligand conformations and ensembles of a maximum of 250 conformers are between 0.46 and 0.61 Å. Commercial conformer ensemble generators are characterized by their high robustness, with at least 99% of all input molecules successfully processed and few or even no substantial geometrical errors detectable in their output conformations. The RDKit distance geometry algorithm (with minimization enabled) appears to be a good free alternative since its performance is comparable to that of the midranked commercial algorithms. Based on a statistical analysis, we elaborate on which algorithms to use and how to parametrize them for best performance in different application scenarios.
Simple mathematical law benchmarks human confrontations.
Johnson, Neil F; Medina, Pablo; Zhao, Guannan; Messinger, Daniel S; Horgan, John; Gill, Paul; Bohorquez, Juan Camilo; Mattson, Whitney; Gangi, Devon; Qi, Hong; Manrique, Pedro; Velasquez, Nicolas; Morgenstern, Ana; Restrepo, Elvira; Johnson, Nicholas; Spagat, Michael; Zarama, Roberto
2013-12-10
Many high-profile societal problems involve an individual or group repeatedly attacking another - from child-parent disputes, sexual violence against women, civil unrest, violent conflicts and acts of terror, to current cyber-attacks on national infrastructure and ultrafast cyber-trades attacking stockholders. There is an urgent need to quantify the likely severity and timing of such future acts, shed light on likely perpetrators, and identify intervention strategies. Here we present a combined analysis of multiple datasets across all these domains which account for >100,000 events, and show that a simple mathematical law can benchmark them all. We derive this benchmark and interpret it, using a minimal mechanistic model grounded by state-of-the-art fieldwork. Our findings provide quantitative predictions concerning future attacks; a tool to help detect common perpetrators and abnormal behaviors; insight into the trajectory of a 'lone wolf'; identification of a critical threshold for spreading a message or idea among perpetrators; an intervention strategy to erode the most lethal clusters; and more broadly, a quantitative starting point for cross-disciplinary theorizing about human aggression at the individual and group level, in both real and online worlds.
Simple mathematical law benchmarks human confrontations
NASA Astrophysics Data System (ADS)
Johnson, Neil F.; Medina, Pablo; Zhao, Guannan; Messinger, Daniel S.; Horgan, John; Gill, Paul; Bohorquez, Juan Camilo; Mattson, Whitney; Gangi, Devon; Qi, Hong; Manrique, Pedro; Velasquez, Nicolas; Morgenstern, Ana; Restrepo, Elvira; Johnson, Nicholas; Spagat, Michael; Zarama, Roberto
2013-12-01
Many high-profile societal problems involve an individual or group repeatedly attacking another - from child-parent disputes, sexual violence against women, civil unrest, violent conflicts and acts of terror, to current cyber-attacks on national infrastructure and ultrafast cyber-trades attacking stockholders. There is an urgent need to quantify the likely severity and timing of such future acts, shed light on likely perpetrators, and identify intervention strategies. Here we present a combined analysis of multiple datasets across all these domains which account for >100,000 events, and show that a simple mathematical law can benchmark them all. We derive this benchmark and interpret it, using a minimal mechanistic model grounded by state-of-the-art fieldwork. Our findings provide quantitative predictions concerning future attacks; a tool to help detect common perpetrators and abnormal behaviors; insight into the trajectory of a `lone wolf' identification of a critical threshold for spreading a message or idea among perpetrators; an intervention strategy to erode the most lethal clusters; and more broadly, a quantitative starting point for cross-disciplinary theorizing about human aggression at the individual and group level, in both real and online worlds.
The Filament Sensor for Near Real-Time Detection of Cytoskeletal Fiber Structures
Eltzner, Benjamin; Wollnik, Carina; Gottschlich, Carsten; Huckemann, Stephan; Rehfeldt, Florian
2015-01-01
A reliable extraction of filament data from microscopic images is of high interest in the analysis of acto-myosin structures as early morphological markers in mechanically guided differentiation of human mesenchymal stem cells and the understanding of the underlying fiber arrangement processes. In this paper, we propose the filament sensor (FS), a fast and robust processing sequence which detects and records location, orientation, length, and width for each single filament of an image, and thus allows for the above described analysis. The extraction of these features has previously not been possible with existing methods. We evaluate the performance of the proposed FS in terms of accuracy and speed in comparison to three existing methods with respect to their limited output. Further, we provide a benchmark dataset of real cell images along with filaments manually marked by a human expert as well as simulated benchmark images. The FS clearly outperforms existing methods in terms of computational runtime and filament extraction accuracy. The implementation of the FS and the benchmark database are available as open source. PMID:25996921
NASA Astrophysics Data System (ADS)
Moriarty, Patrick; Sanz Rodrigo, Javier; Gancarski, Pawel; Chuchfield, Matthew; Naughton, Jonathan W.; Hansen, Kurt S.; Machefaux, Ewan; Maguire, Eoghan; Castellani, Francesco; Terzi, Ludovico; Breton, Simon-Philippe; Ueda, Yuko
2014-06-01
Researchers within the International Energy Agency (IEA) Task 31: Wakebench have created a framework for the evaluation of wind farm flow models operating at the microscale level. The framework consists of a model evaluation protocol integrated with a web-based portal for model benchmarking (www.windbench.net). This paper provides an overview of the building-block validation approach applied to wind farm wake models, including best practices for the benchmarking and data processing procedures for validation datasets from wind farm SCADA and meteorological databases. A hierarchy of test cases has been proposed for wake model evaluation, from similarity theory of the axisymmetric wake and idealized infinite wind farm, to single-wake wind tunnel (UMN-EPFL) and field experiments (Sexbierum), to wind farm arrays in offshore (Horns Rev, Lillgrund) and complex terrain conditions (San Gregorio). A summary of results from the axisymmetric wake, Sexbierum, Horns Rev and Lillgrund benchmarks are used to discuss the state-of-the-art of wake model validation and highlight the most relevant issues for future development.
Automated benchmarking of peptide-MHC class I binding predictions.
Trolle, Thomas; Metushi, Imir G; Greenbaum, Jason A; Kim, Yohan; Sidney, John; Lund, Ole; Sette, Alessandro; Peters, Bjoern; Nielsen, Morten
2015-07-01
Numerous in silico methods predicting peptide binding to major histocompatibility complex (MHC) class I molecules have been developed over the last decades. However, the multitude of available prediction tools makes it non-trivial for the end-user to select which tool to use for a given task. To provide a solid basis on which to compare different prediction tools, we here describe a framework for the automated benchmarking of peptide-MHC class I binding prediction tools. The framework runs weekly benchmarks on data that are newly entered into the Immune Epitope Database (IEDB), giving the public access to frequent, up-to-date performance evaluations of all participating tools. To overcome potential selection bias in the data included in the IEDB, a strategy was implemented that suggests a set of peptides for which different prediction methods give divergent predictions as to their binding capability. Upon experimental binding validation, these peptides entered the benchmark study. The benchmark has run for 15 weeks and includes evaluation of 44 datasets covering 17 MHC alleles and more than 4000 peptide-MHC binding measurements. Inspection of the results allows the end-user to make educated selections between participating tools. Of the four participating servers, NetMHCpan performed the best, followed by ANN, SMM and finally ARB. Up-to-date performance evaluations of each server can be found online at http://tools.iedb.org/auto_bench/mhci/weekly. All prediction tool developers are invited to participate in the benchmark. Sign-up instructions are available at http://tools.iedb.org/auto_bench/mhci/join. mniel@cbs.dtu.dk or bpeters@liai.org Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Automated benchmarking of peptide-MHC class I binding predictions
Trolle, Thomas; Metushi, Imir G.; Greenbaum, Jason A.; Kim, Yohan; Sidney, John; Lund, Ole; Sette, Alessandro; Peters, Bjoern; Nielsen, Morten
2015-01-01
Motivation: Numerous in silico methods predicting peptide binding to major histocompatibility complex (MHC) class I molecules have been developed over the last decades. However, the multitude of available prediction tools makes it non-trivial for the end-user to select which tool to use for a given task. To provide a solid basis on which to compare different prediction tools, we here describe a framework for the automated benchmarking of peptide-MHC class I binding prediction tools. The framework runs weekly benchmarks on data that are newly entered into the Immune Epitope Database (IEDB), giving the public access to frequent, up-to-date performance evaluations of all participating tools. To overcome potential selection bias in the data included in the IEDB, a strategy was implemented that suggests a set of peptides for which different prediction methods give divergent predictions as to their binding capability. Upon experimental binding validation, these peptides entered the benchmark study. Results: The benchmark has run for 15 weeks and includes evaluation of 44 datasets covering 17 MHC alleles and more than 4000 peptide-MHC binding measurements. Inspection of the results allows the end-user to make educated selections between participating tools. Of the four participating servers, NetMHCpan performed the best, followed by ANN, SMM and finally ARB. Availability and implementation: Up-to-date performance evaluations of each server can be found online at http://tools.iedb.org/auto_bench/mhci/weekly. All prediction tool developers are invited to participate in the benchmark. Sign-up instructions are available at http://tools.iedb.org/auto_bench/mhci/join. Contact: mniel@cbs.dtu.dk or bpeters@liai.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25717196
DeepSig: deep learning improves signal peptide detection in proteins.
Savojardo, Castrense; Martelli, Pier Luigi; Fariselli, Piero; Casadio, Rita
2018-05-15
The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization. Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification. DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website. pierluigi.martelli@unibo.it. Supplementary data are available at Bioinformatics online.
Blind Pose Prediction, Scoring, and Affinity Ranking of the CSAR 2014 Dataset.
Martiny, Virginie Y; Martz, François; Selwa, Edithe; Iorga, Bogdan I
2016-06-27
The 2014 CSAR Benchmark Exercise was focused on three protein targets: coagulation factor Xa, spleen tyrosine kinase, and bacterial tRNA methyltransferase. Our protocol involved a preliminary analysis of the structural information available in the Protein Data Bank for the protein targets, which allowed the identification of the most appropriate docking software and scoring functions to be used for the rescoring of several docking conformations datasets, as well as for pose prediction and affinity ranking. The two key points of this study were (i) the prior evaluation of molecular modeling tools that are most adapted for each target and (ii) the increased search efficiency during the docking process to better explore the conformational space of big and flexible ligands.
Visualized analysis of mixed numeric and categorical data via extended self-organizing map.
Hsu, Chung-Chian; Lin, Shu-Han
2012-01-01
Many real-world datasets are of mixed types, having numeric and categorical attributes. Even though difficult, analyzing mixed-type datasets is important. In this paper, we propose an extended self-organizing map (SOM), called MixSOM, which utilizes a data structure distance hierarchy to facilitate the handling of numeric and categorical values in a direct, unified manner. Moreover, the extended model regularizes the prototype distance between neighboring neurons in proportion to their map distance so that structures of the clusters can be portrayed better on the map. Extensive experiments on several synthetic and real-world datasets are conducted to demonstrate the capability of the model and to compare MixSOM with several existing models including Kohonen's SOM, the generalized SOM and visualization-induced SOM. The results show that MixSOM is superior to the other models in reflecting the structure of the mixed-type data and facilitates further analysis of the data such as exploration at various levels of granularity.
Transductive multi-view zero-shot learning.
Fu, Yanwei; Hospedales, Timothy M; Xiang, Tao; Gong, Shaogang
2015-11-01
Most existing zero-shot learning approaches exploit transfer learning via an intermediate semantic representation shared between an annotated auxiliary dataset and a target dataset with different classes and no annotation. A projection from a low-level feature space to the semantic representation space is learned from the auxiliary dataset and applied without adaptation to the target dataset. In this paper we identify two inherent limitations with these approaches. First, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift problem and propose a novel framework, transductive multi-view embedding, to solve it. The second limitation is the prototype sparsity problem which refers to the fact that for each target class, only a single prototype is available for zero-shot learning given a semantic representation. To overcome this problem, a novel heterogeneous multi-view hypergraph label propagation method is formulated for zero-shot learning in the transductive embedding space. It effectively exploits the complementary information offered by different semantic representations and takes advantage of the manifold structures of multiple representation spaces in a coherent manner. We demonstrate through extensive experiments that the proposed approach (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) significantly outperforms existing methods for both zero-shot and N-shot recognition on three image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.
Integrative missing value estimation for microarray data.
Hu, Jianjun; Li, Haifeng; Waterman, Michael S; Zhou, Xianghong Jasmine
2006-10-12
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples. We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests. We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae
Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike
2006-01-01
Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047
A Unified Picture of Mass Segregation in Globular Clusters
NASA Astrophysics Data System (ADS)
Watkins, Laura
2017-08-01
The sensitivity, stability and longevity of HST have opened up an exciting new parameter space: we now have velocity measurements, in the form of proper motions (PMs), for stars from the tip of the red giant branch to a few magnitudes below the main-sequence turn off for a large sample of globular clusters (GCs). For the very first time, we have the opportunity to measure both kinematic and spatial dependences on stellar mass in GCs.The formation and evolution histories of GCs are poorly understood, so too are their intermediate-mass black hole populations and binary fractions. However, the current structure and dynamical state of a GC is directly determined by its past history and its components, so by understanding the former we can gain insight into the latter. Quantifying variations in spatial structure for stars of different mass is extremely difficult with photometry alone as datasets are inhomogenous and incomplete. We require kinematic data for stars that span a range of stellar masses, combined with proper dynamical modelling. We now have the data in hand, but still lack the models needed to maximise the scientific potential of our HST datasets.Here, we propose to extend existing single-mass discrete dynamical-modelling tools to include kinematic and spatial variations with stellar mass, and verify the upgrades using mock data generated from N-body models. We will then apply the models to HST PM data and directly quantify energy equipartition and mass segregation in the GCs. The theoretical phase of the project is vital for the success of the subsequent data analysis, and will serve as a benchmark for future observational campaigns with HST, JWST and beyond.
Possible world based consistency learning model for clustering and classifying uncertain data.
Liu, Han; Zhang, Xianchao; Zhang, Xiaotong
2018-06-01
Possible world has shown to be effective for handling various types of data uncertainty in uncertain data management. However, few uncertain data clustering and classification algorithms are proposed based on possible world. Moreover, existing possible world based algorithms suffer from the following issues: (1) they deal with each possible world independently and ignore the consistency principle across different possible worlds; (2) they require the extra post-processing procedure to obtain the final result, which causes that the effectiveness highly relies on the post-processing method and the efficiency is also not very good. In this paper, we propose a novel possible world based consistency learning model for uncertain data, which can be extended both for clustering and classifying uncertain data. This model utilizes the consistency principle to learn a consensus affinity matrix for uncertain data, which can make full use of the information across different possible worlds and then improve the clustering and classification performance. Meanwhile, this model imposes a new rank constraint on the Laplacian matrix of the consensus affinity matrix, thereby ensuring that the number of connected components in the consensus affinity matrix is exactly equal to the number of classes. This also means that the clustering and classification results can be directly obtained without any post-processing procedure. Furthermore, for the clustering and classification tasks, we respectively derive the efficient optimization methods to solve the proposed model. Experimental results on real benchmark datasets and real world uncertain datasets show that the proposed model outperforms the state-of-the-art uncertain data clustering and classification algorithms in effectiveness and performs competitively in efficiency. Copyright © 2018 Elsevier Ltd. All rights reserved.
Higaki, Takumi; Kutsuna, Natsumaro; Hasezawa, Seiichiro
2013-05-16
Intracellular configuration is an important feature of cell status. Recent advances in microscopic imaging techniques allow us to easily obtain a large number of microscopic images of intracellular structures. In this circumstance, automated microscopic image recognition techniques are of extreme importance to future phenomics/visible screening approaches. However, there was no benchmark microscopic image dataset for intracellular organelles in a specified plant cell type. We previously established the Live Images of Plant Stomata (LIPS) database, a publicly available collection of optical-section images of various intracellular structures of plant guard cells, as a model system of environmental signal perception and transduction. Here we report recent updates to the LIPS database and the establishment of a database table, LIPService. We updated the LIPS dataset and established a new interface named LIPService to promote efficient inspection of intracellular structure configurations. Cell nuclei, microtubules, actin microfilaments, mitochondria, chloroplasts, endoplasmic reticulum, peroxisomes, endosomes, Golgi bodies, and vacuoles can be filtered using probe names or morphometric parameters such as stomatal aperture. In addition to the serial optical sectional images of the original LIPS database, new volume-rendering data for easy web browsing of three-dimensional intracellular structures have been released to allow easy inspection of their configurations or relationships with cell status/morphology. We also demonstrated the utility of the new LIPS image database for automated organelle recognition of images from another plant cell image database with image clustering analyses. The updated LIPS database provides a benchmark image dataset for representative intracellular structures in Arabidopsis guard cells. The newly released LIPService allows users to inspect the relationship between organellar three-dimensional configurations and morphometrical parameters.
Benchmarking the mesoscale variability in global ocean eddy-permitting numerical systems
NASA Astrophysics Data System (ADS)
Cipollone, Andrea; Masina, Simona; Storto, Andrea; Iovino, Doroteaciro
2017-10-01
The role of data assimilation procedures on representing ocean mesoscale variability is assessed by applying eddy statistics to a state-of-the-art global ocean reanalysis (C-GLORS), a free global ocean simulation (performed with the NEMO system) and an observation-based dataset (ARMOR3D) used as an independent benchmark. Numerical results are computed on a 1/4 ∘ horizontal grid (ORCA025) and share the same resolution with ARMOR3D dataset. This "eddy-permitting" resolution is sufficient to allow ocean eddies to form. Further to assessing the eddy statistics from three different datasets, a global three-dimensional eddy detection system is implemented in order to bypass the need of regional-dependent definition of thresholds, typical of commonly adopted eddy detection algorithms. It thus provides full three-dimensional eddy statistics segmenting vertical profiles from local rotational velocities. This criterion is crucial for discerning real eddies from transient surface noise that inevitably affects any two-dimensional algorithm. Data assimilation enhances and corrects mesoscale variability on a wide range of features that cannot be well reproduced otherwise. The free simulation fairly reproduces eddies emerging from western boundary currents and deep baroclinic instabilities, while underestimates shallower vortexes that populate the full basin. The ocean reanalysis recovers most of the missing turbulence, shown by satellite products , that is not generated by the model itself and consistently projects surface variability deep into the water column. The comparison with the statistically reconstructed vertical profiles from ARMOR3D show that ocean data assimilation is able to embed variability into the model dynamics, constraining eddies with in situ and altimetry observation and generating them consistently with local environment.
Technical note: RabbitCT--an open platform for benchmarking 3D cone-beam reconstruction algorithms.
Rohkohl, C; Keck, B; Hofmann, H G; Hornegger, J
2009-09-01
Fast 3D cone beam reconstruction is mandatory for many clinical workflows. For that reason, researchers and industry work hard on hardware-optimized 3D reconstruction. Backprojection is a major component of many reconstruction algorithms that require a projection of each voxel onto the projection data, including data interpolation, before updating the voxel value. This step is the bottleneck of most reconstruction algorithms and the focus of optimization in recent publications. A crucial limitation, however, of these publications is that the presented results are not comparable to each other. This is mainly due to variations in data acquisitions, preprocessing, and chosen geometries and the lack of a common publicly available test dataset. The authors provide such a standardized dataset that allows for substantial comparison of hardware accelerated backprojection methods. They developed an open platform RabbitCT (www.rabbitCT.com) for worldwide comparison in backprojection performance and ranking on different architectures using a specific high resolution C-arm CT dataset of a rabbit. This includes a sophisticated benchmark interface, a prototype implementation in C++, and image quality measures. At the time of writing, six backprojection implementations are already listed on the website. Optimizations include multithreading using Intel threading building blocks and OpenMP, vectorization using SSE, and computation on the GPU using CUDA 2.0. There is a need for objectively comparing backprojection implementations for reconstruction algorithms. RabbitCT aims to provide a solution to this problem by offering an open platform with fair chances for all participants. The authors are looking forward to a growing community and await feedback regarding future evaluations of novel software- and hardware-based acceleration schemes.
Benchmark dose and the three Rs. Part I. Getting more information from the same number of animals.
Slob, Wout
2014-08-01
Evaluating dose-response data using the Benchmark dose (BMD) approach rather than by the no observed adverse effect (NOAEL) approach implies a considerable step forward from the perspective of the Reduction, Replacement, and Refinement, three Rs, in particular the R of reduction: more information is obtained from the same number of animals, or, vice versa, similar information may be obtained from fewer animals. The first part of this twin paper focusses on the former, the second on the latter aspect. Regarding the former, the BMD approach provides more information from any given dose-response dataset in various ways. First, the BMDL (= BMD lower confidence bound) provides more information by its more explicit definition. Further, as compared to the NOAEL approach the BMD approach results in more statistical precision in the value of the point of departure (PoD), for deriving exposure limits. While part of the animals in the study do not directly contribute to the numerical value of a NOAEL, all animals are effectively used and do contribute to a BMDL. In addition, the BMD approach allows for combining similar datasets for the same chemical (e.g., both sexes) in a single analysis, which further increases precision. By combining a dose-response dataset with similar historical data for other chemicals, the precision can even be substantially increased. Further, the BMD approach results in more precise estimates for relative potency factors (RPFs, or TEFs). And finally, the BMD approach is not only more precise, it also allows for quantification of the precision in the BMD estimate, which is not possible in the NOAEL approach.
3D shape representation with spatial probabilistic distribution of intrinsic shape keypoints
NASA Astrophysics Data System (ADS)
Ghorpade, Vijaya K.; Checchin, Paul; Malaterre, Laurent; Trassoudaine, Laurent
2017-12-01
The accelerated advancement in modeling, digitizing, and visualizing techniques for 3D shapes has led to an increasing amount of 3D models creation and usage, thanks to the 3D sensors which are readily available and easy to utilize. As a result, determining the similarity between 3D shapes has become consequential and is a fundamental task in shape-based recognition, retrieval, clustering, and classification. Several decades of research in Content-Based Information Retrieval (CBIR) has resulted in diverse techniques for 2D and 3D shape or object classification/retrieval and many benchmark data sets. In this article, a novel technique for 3D shape representation and object classification has been proposed based on analyses of spatial, geometric distributions of 3D keypoints. These distributions capture the intrinsic geometric structure of 3D objects. The result of the approach is a probability distribution function (PDF) produced from spatial disposition of 3D keypoints, keypoints which are stable on object surface and invariant to pose changes. Each class/instance of an object can be uniquely represented by a PDF. This shape representation is robust yet with a simple idea, easy to implement but fast enough to compute. Both Euclidean and topological space on object's surface are considered to build the PDFs. Topology-based geodesic distances between keypoints exploit the non-planar surface properties of the object. The performance of the novel shape signature is tested with object classification accuracy. The classification efficacy of the new shape analysis method is evaluated on a new dataset acquired with a Time-of-Flight camera, and also, a comparative evaluation on a standard benchmark dataset with state-of-the-art methods is performed. Experimental results demonstrate superior classification performance of the new approach on RGB-D dataset and depth data.
NASA Astrophysics Data System (ADS)
Christensen, Anders S.; Kromann, Jimmy C.; Jensen, Jan H.; Cui, Qiang
2017-10-01
To facilitate further development of approximate quantum mechanical methods for condensed phase applications, we present a new benchmark dataset of intermolecular interaction energies in the solution phase for a set of 15 dimers, each containing one charged monomer. The reference interaction energy in solution is computed via a thermodynamic cycle that integrates dimer binding energy in the gas phase at the coupled cluster level and solute-solvent interaction with density functional theory; the estimated uncertainty of such calculated interaction energy is ±1.5 kcal/mol. The dataset is used to benchmark the performance of a set of semi-empirical quantum mechanical (SQM) methods that include DFTB3-D3, DFTB3/CPE-D3, OM2-D3, PM6-D3, PM6-D3H+, and PM7 as well as the HF-3c method. We find that while all tested SQM methods tend to underestimate binding energies in the gas phase with a root-mean-squared error (RMSE) of 2-5 kcal/mol, they overestimate binding energies in the solution phase with an RMSE of 3-4 kcal/mol, with the exception of DFTB3/CPE-D3 and OM2-D3, for which the systematic deviation is less pronounced. In addition, we find that HF-3c systematically overestimates binding energies in both gas and solution phases. As most approximate QM methods are parametrized and evaluated using data measured or calculated in the gas phase, the dataset represents an important first step toward calibrating QM based methods for application in the condensed phase where polarization and exchange repulsion need to be treated in a balanced fashion.
SisFall: A Fall and Movement Dataset
Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco
2017-01-01
Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark. PMID:28117691
SisFall: A Fall and Movement Dataset.
Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco
2017-01-20
Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark.
ISRUC-Sleep: A comprehensive public dataset for sleep researchers.
Khalighi, Sirvan; Sousa, Teresa; Santos, José Moutinho; Nunes, Urbano
2016-02-01
To facilitate the performance comparison of new methods for sleep patterns analysis, datasets with quality content, publicly-available, are very important and useful. We introduce an open-access comprehensive sleep dataset, called ISRUC-Sleep. The data were obtained from human adults, including healthy subjects, subjects with sleep disorders, and subjects under the effect of sleep medication. Each recording was randomly selected between PSG recordings that were acquired by the Sleep Medicine Centre of the Hospital of Coimbra University (CHUC). The dataset comprises three groups of data: (1) data concerning 100 subjects, with one recording session per subject; (2) data gathered from 8 subjects; two recording sessions were performed per subject, and (3) data collected from one recording session related to 10 healthy subjects. The polysomnography (PSG) recordings, associated with each subject, were visually scored by two human experts. Comparing the existing sleep-related public datasets, ISRUC-Sleep provides data of a reasonable number of subjects with different characteristics such as: data useful for studies involving changes in the PSG signals over time; and data of healthy subjects useful for studies involving comparison of healthy subjects with the patients, suffering from sleep disorders. This dataset was created aiming to complement existing datasets by providing easy-to-apply data collection with some characteristics not covered yet. ISRUC-Sleep can be useful for analysis of new contributions: (i) in biomedical signal processing; (ii) in development of ASSC methods; and (iii) on sleep physiology studies. To evaluate and compare new contributions, which use this dataset as a benchmark, results of applying a subject-independent automatic sleep stage classification (ASSC) method on ISRUC-Sleep dataset are presented. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Mean velocity and turbulence measurements in a 90 deg curved duct with thin inlet boundary layer
NASA Technical Reports Server (NTRS)
Crawford, R. A.; Peters, C. E.; Steinhoff, J.; Hornkohl, J. O.; Nourinejad, J.; Ramachandran, K.
1985-01-01
The experimental database established by this investigation of the flow in a large rectangular turning duct is of benchmark quality. The experimental Reynolds numbers, Deans numbers and boundary layer characteristics are significantly different from previous benchmark curved-duct experimental parameters. This investigation extends the experimental database to higher Reynolds number and thinner entrance boundary layers. The 5% to 10% thick boundary layers, based on duct half-width, results in a large region of near-potential flow in the duct core surrounded by developing boundary layers with large crossflows. The turbulent entrance boundary layer case at R sub ed = 328,000 provides an incompressible flowfield which approaches real turbine blade cascade characteristics. The results of this investigation provide a challenging benchmark database for computational fluid dynamics code development.
Scene text detection by leveraging multi-channel information and local context
NASA Astrophysics Data System (ADS)
Wang, Runmin; Qian, Shengyou; Yang, Jianfeng; Gao, Changxin
2018-03-01
As an important information carrier, texts play significant roles in many applications. However, text detection in unconstrained scenes is a challenging problem due to cluttered backgrounds, various appearances, uneven illumination, etc.. In this paper, an approach based on multi-channel information and local context is proposed to detect texts in natural scenes. According to character candidate detection plays a vital role in text detection system, Maximally Stable Extremal Regions(MSERs) and Graph-cut based method are integrated to obtain the character candidates by leveraging the multi-channel image information. A cascaded false positive elimination mechanism are constructed from the perspective of the character and the text line respectively. Since the local context information is very valuable for us, these information is utilized to retrieve the missing characters for boosting the text detection performance. Experimental results on two benchmark datasets, i.e., the ICDAR 2011 dataset and the ICDAR 2013 dataset, demonstrate that the proposed method have achieved the state-of-the-art performance.
Performance Studies on Distributed Virtual Screening
Krüger, Jens; de la Garza, Luis; Kohlbacher, Oliver; Nagel, Wolfgang E.
2014-01-01
Virtual high-throughput screening (vHTS) is an invaluable method in modern drug discovery. It permits screening large datasets or databases of chemical structures for those structures binding possibly to a drug target. Virtual screening is typically performed by docking code, which often runs sequentially. Processing of huge vHTS datasets can be parallelized by chunking the data because individual docking runs are independent of each other. The goal of this work is to find an optimal splitting maximizing the speedup while considering overhead and available cores on Distributed Computing Infrastructures (DCIs). We have conducted thorough performance studies accounting not only for the runtime of the docking itself, but also for structure preparation. Performance studies were conducted via the workflow-enabled science gateway MoSGrid (Molecular Simulation Grid). As input we used benchmark datasets for protein kinases. Our performance studies show that docking workflows can be made to scale almost linearly up to 500 concurrent processes distributed even over large DCIs, thus accelerating vHTS campaigns significantly. PMID:25032219
The generalization ability of SVM classification based on Markov sampling.
Xu, Jie; Tang, Yuan Yan; Zou, Bin; Xu, Zongben; Li, Luoqing; Lu, Yang; Zhang, Baochang
2015-06-01
The previously known works studying the generalization ability of support vector machine classification (SVMC) algorithm are usually based on the assumption of independent and identically distributed samples. In this paper, we go far beyond this classical framework by studying the generalization ability of SVMC based on uniformly ergodic Markov chain (u.e.M.c.) samples. We analyze the excess misclassification error of SVMC based on u.e.M.c. samples, and obtain the optimal learning rate of SVMC for u.e.M.c. We also introduce a new Markov sampling algorithm for SVMC to generate u.e.M.c. samples from given dataset, and present the numerical studies on the learning performance of SVMC based on Markov sampling for benchmark datasets. The numerical studies show that the SVMC based on Markov sampling not only has better generalization ability as the number of training samples are bigger, but also the classifiers based on Markov sampling are sparsity when the size of dataset is bigger with regard to the input dimension.
Toward a Data Scalable Solution for Facilitating Discovery of Science Resources
DOE Office of Scientific and Technical Information (OSTI.GOV)
Weaver, Jesse R.; Castellana, Vito G.; Morari, Alessandro
Science is increasingly motivated by the need to process larger quantities of data. It is facing severe challenges in data collection, management, and processing, so much so that the computational demands of “data scaling” are competing with, and in many fields surpassing, the traditional objective of decreasing processing time. Example domains with large datasets include astronomy, biology, genomics, climate/weather, and material sciences. This paper presents a real-world use case in which we wish to answer queries pro- vided by domain scientists in order to facilitate discovery of relevant science resources. The problem is that the metadata for these science resourcesmore » is very large and is growing quickly, rapidly increasing the need for a data scaling solution. We propose a system – SGEM – designed for answering graph-based queries over large datasets on cluster architectures, and we re- port performance results for queries on the current RDESC dataset of nearly 1.4 billion triples, and on the well-known BSBM SPARQL query benchmark.« less
Single-shot diffraction data from the Mimivirus particle using an X-ray free-electron laser.
Ekeberg, Tomas; Svenda, Martin; Seibert, M Marvin; Abergel, Chantal; Maia, Filipe R N C; Seltzer, Virginie; DePonte, Daniel P; Aquila, Andrew; Andreasson, Jakob; Iwan, Bianca; Jönsson, Olof; Westphal, Daniel; Odić, Duško; Andersson, Inger; Barty, Anton; Liang, Meng; Martin, Andrew V; Gumprecht, Lars; Fleckenstein, Holger; Bajt, Saša; Barthelmess, Miriam; Coppola, Nicola; Claverie, Jean-Michel; Loh, N Duane; Bostedt, Christoph; Bozek, John D; Krzywinski, Jacek; Messerschmidt, Marc; Bogan, Michael J; Hampton, Christina Y; Sierra, Raymond G; Frank, Matthias; Shoeman, Robert L; Lomb, Lukas; Foucar, Lutz; Epp, Sascha W; Rolles, Daniel; Rudenko, Artem; Hartmann, Robert; Hartmann, Andreas; Kimmel, Nils; Holl, Peter; Weidenspointner, Georg; Rudek, Benedikt; Erk, Benjamin; Kassemeyer, Stephan; Schlichting, Ilme; Strüder, Lothar; Ullrich, Joachim; Schmidt, Carlo; Krasniqi, Faton; Hauser, Günter; Reich, Christian; Soltau, Heike; Schorb, Sebastian; Hirsemann, Helmut; Wunderer, Cornelia; Graafsma, Heinz; Chapman, Henry; Hajdu, Janos
2016-08-01
Free-electron lasers (FEL) hold the potential to revolutionize structural biology by producing X-ray pules short enough to outrun radiation damage, thus allowing imaging of biological samples without the limitation from radiation damage. Thus, a major part of the scientific case for the first FELs was three-dimensional (3D) reconstruction of non-crystalline biological objects. In a recent publication we demonstrated the first 3D reconstruction of a biological object from an X-ray FEL using this technique. The sample was the giant Mimivirus, which is one of the largest known viruses with a diameter of 450 nm. Here we present the dataset used for this successful reconstruction. Data-analysis methods for single-particle imaging at FELs are undergoing heavy development but data collection relies on very limited time available through a highly competitive proposal process. This dataset provides experimental data to the entire community and could boost algorithm development and provide a benchmark dataset for new algorithms.
A multimodal dataset for various forms of distracted driving
Taamneh, Salah; Tsiamyrtzis, Panagiotis; Dcosta, Malcolm; Buddharaju, Pradeep; Khatri, Ashik; Manser, Michael; Ferris, Thomas; Wunderlich, Robert; Pavlidis, Ioannis
2017-01-01
We describe a multimodal dataset acquired in a controlled experiment on a driving simulator. The set includes data for n=68 volunteers that drove the same highway under four different conditions: No distraction, cognitive distraction, emotional distraction, and sensorimotor distraction. The experiment closed with a special driving session, where all subjects experienced a startle stimulus in the form of unintended acceleration—half of them under a mixed distraction, and the other half in the absence of a distraction. During the experimental drives key response variables and several explanatory variables were continuously recorded. The response variables included speed, acceleration, brake force, steering, and lane position signals, while the explanatory variables included perinasal electrodermal activity (EDA), palm EDA, heart rate, breathing rate, and facial expression signals; biographical and psychometric covariates as well as eye tracking data were also obtained. This dataset enables research into driving behaviors under neatly abstracted distracting stressors, which account for many car crashes. The set can also be used in physiological channel benchmarking and multispectral face recognition. PMID:28809848
A dataset mapping the potential biophysical effects of vegetation cover change
NASA Astrophysics Data System (ADS)
Duveiller, Gregory; Hooker, Josh; Cescatti, Alessandro
2018-02-01
Changing the vegetation cover of the Earth has impacts on the biophysical properties of the surface and ultimately on the local climate. Depending on the specific type of vegetation change and on the background climate, the resulting competing biophysical processes can have a net warming or cooling effect, which can further vary both spatially and seasonally. Due to uncertain climate impacts and the lack of robust observations, biophysical effects are not yet considered in land-based climate policies. Here we present a dataset based on satellite remote sensing observations that provides the potential changes i) of the full surface energy balance, ii) at global scale, and iii) for multiple vegetation transitions, as would now be required for the comprehensive evaluation of land based mitigation plans. We anticipate that this dataset will provide valuable information to benchmark Earth system models, to assess future scenarios of land cover change and to develop the monitoring, reporting and verification guidelines required for the implementation of mitigation plans that account for biophysical land processes.
A dataset mapping the potential biophysical effects of vegetation cover change
Duveiller, Gregory; Hooker, Josh; Cescatti, Alessandro
2018-01-01
Changing the vegetation cover of the Earth has impacts on the biophysical properties of the surface and ultimately on the local climate. Depending on the specific type of vegetation change and on the background climate, the resulting competing biophysical processes can have a net warming or cooling effect, which can further vary both spatially and seasonally. Due to uncertain climate impacts and the lack of robust observations, biophysical effects are not yet considered in land-based climate policies. Here we present a dataset based on satellite remote sensing observations that provides the potential changes i) of the full surface energy balance, ii) at global scale, and iii) for multiple vegetation transitions, as would now be required for the comprehensive evaluation of land based mitigation plans. We anticipate that this dataset will provide valuable information to benchmark Earth system models, to assess future scenarios of land cover change and to develop the monitoring, reporting and verification guidelines required for the implementation of mitigation plans that account for biophysical land processes. PMID:29461538
Single-shot diffraction data from the Mimivirus particle using an X-ray free-electron laser
NASA Astrophysics Data System (ADS)
Ekeberg, Tomas; Svenda, Martin; Seibert, M. Marvin; Abergel, Chantal; Maia, Filipe R. N. C.; Seltzer, Virginie; Deponte, Daniel P.; Aquila, Andrew; Andreasson, Jakob; Iwan, Bianca; Jönsson, Olof; Westphal, Daniel; Odić, Duško; Andersson, Inger; Barty, Anton; Liang, Meng; Martin, Andrew V.; Gumprecht, Lars; Fleckenstein, Holger; Bajt, Saša; Barthelmess, Miriam; Coppola, Nicola; Claverie, Jean-Michel; Loh, N. Duane; Bostedt, Christoph; Bozek, John D.; Krzywinski, Jacek; Messerschmidt, Marc; Bogan, Michael J.; Hampton, Christina Y.; Sierra, Raymond G.; Frank, Matthias; Shoeman, Robert L.; Lomb, Lukas; Foucar, Lutz; Epp, Sascha W.; Rolles, Daniel; Rudenko, Artem; Hartmann, Robert; Hartmann, Andreas; Kimmel, Nils; Holl, Peter; Weidenspointner, Georg; Rudek, Benedikt; Erk, Benjamin; Kassemeyer, Stephan; Schlichting, Ilme; Strüder, Lothar; Ullrich, Joachim; Schmidt, Carlo; Krasniqi, Faton; Hauser, Günter; Reich, Christian; Soltau, Heike; Schorb, Sebastian; Hirsemann, Helmut; Wunderer, Cornelia; Graafsma, Heinz; Chapman, Henry; Hajdu, Janos
2016-08-01
Free-electron lasers (FEL) hold the potential to revolutionize structural biology by producing X-ray pules short enough to outrun radiation damage, thus allowing imaging of biological samples without the limitation from radiation damage. Thus, a major part of the scientific case for the first FELs was three-dimensional (3D) reconstruction of non-crystalline biological objects. In a recent publication we demonstrated the first 3D reconstruction of a biological object from an X-ray FEL using this technique. The sample was the giant Mimivirus, which is one of the largest known viruses with a diameter of 450 nm. Here we present the dataset used for this successful reconstruction. Data-analysis methods for single-particle imaging at FELs are undergoing heavy development but data collection relies on very limited time available through a highly competitive proposal process. This dataset provides experimental data to the entire community and could boost algorithm development and provide a benchmark dataset for new algorithms.
Diffusion-based recommendation with trust relations on tripartite graphs
NASA Astrophysics Data System (ADS)
Wang, Ximeng; Liu, Yun; Zhang, Guangquan; Xiong, Fei; Lu, Jie
2017-08-01
The diffusion-based recommendation approach is a vital branch in recommender systems, which successfully applies physical dynamics to make recommendations for users on bipartite or tripartite graphs. Trust links indicate users’ social relations and can provide the benefit of reducing data sparsity. However, traditional diffusion-based algorithms only consider rating links when making recommendations. In this paper, the complementarity of users’ implicit and explicit trust is exploited, and a novel resource-allocation strategy is proposed, which integrates these two kinds of trust relations on tripartite graphs. Through empirical studies on three benchmark datasets, our proposed method obtains better performance than most of the benchmark algorithms in terms of accuracy, diversity and novelty. According to the experimental results, our method is an effective and reasonable way to integrate additional features into the diffusion-based recommendation approach.
Assessing Ecosystem Model Performance in Semiarid Systems
NASA Astrophysics Data System (ADS)
Thomas, A.; Dietze, M.; Scott, R. L.; Biederman, J. A.
2017-12-01
In ecosystem process modelling, comparing outputs to benchmark datasets observed in the field is an important way to validate models, allowing the modelling community to track model performance over time and compare models at specific sites. Multi-model comparison projects as well as models themselves have largely been focused on temperate forests and similar biomes. Semiarid regions, on the other hand, are underrepresented in land surface and ecosystem modelling efforts, and yet will be disproportionately impacted by disturbances such as climate change due to their sensitivity to changes in the water balance. Benchmarking models at semiarid sites is an important step in assessing and improving models' suitability for predicting the impact of disturbance on semiarid ecosystems. In this study, several ecosystem models were compared at a semiarid grassland in southwestern Arizona using PEcAn, or the Predictive Ecosystem Analyzer, an open-source eco-informatics toolbox ideal for creating the repeatable model workflows necessary for benchmarking. Models included SIPNET, DALEC, JULES, ED2, GDAY, LPJ-GUESS, MAESPA, CLM, CABLE, and FATES. Comparison between model output and benchmarks such as net ecosystem exchange (NEE) tended to produce high root mean square error and low correlation coefficients, reflecting poor simulation of seasonality and the tendency for models to create much higher carbon sources than observed. These results indicate that ecosystem models do not currently adequately represent semiarid ecosystem processes.
New public dataset for spotting patterns in medieval document images
NASA Astrophysics Data System (ADS)
En, Sovann; Nicolas, Stéphane; Petitjean, Caroline; Jurie, Frédéric; Heutte, Laurent
2017-01-01
With advances in technology, a large part of our cultural heritage is becoming digitally available. In particular, in the field of historical document image analysis, there is now a growing need for indexing and data mining tools, thus allowing us to spot and retrieve the occurrences of an object of interest, called a pattern, in a large database of document images. Patterns may present some variability in terms of color, shape, or context, making the spotting of patterns a challenging task. Pattern spotting is a relatively new field of research, still hampered by the lack of available annotated resources. We present a new publicly available dataset named DocExplore dedicated to spotting patterns in historical document images. The dataset contains 1500 images and 1464 queries, and allows the evaluation of two tasks: image retrieval and pattern localization. A standardized benchmark protocol along with ad hoc metrics is provided for a fair comparison of the submitted approaches. We also provide some first results obtained with our baseline system on this new dataset, which show that there is room for improvement and that should encourage researchers of the document image analysis community to design new systems and submit improved results.
Sorting protein decoys by machine-learning-to-rank
Jing, Xiaoyang; Wang, Kai; Lu, Ruqian; Dong, Qiwen
2016-01-01
Much progress has been made in Protein structure prediction during the last few decades. As the predicted models can span a broad range of accuracy spectrum, the accuracy of quality estimation becomes one of the key elements of successful protein structure prediction. Over the past years, a number of methods have been developed to address this issue, and these methods could be roughly divided into three categories: the single-model methods, clustering-based methods and quasi single-model methods. In this study, we develop a single-model method MQAPRank based on the learning-to-rank algorithm firstly, and then implement a quasi single-model method Quasi-MQAPRank. The proposed methods are benchmarked on the 3DRobot and CASP11 dataset. The five-fold cross-validation on the 3DRobot dataset shows the proposed single model method outperforms other methods whose outputs are taken as features of the proposed method, and the quasi single-model method can further enhance the performance. On the CASP11 dataset, the proposed methods also perform well compared with other leading methods in corresponding categories. In particular, the Quasi-MQAPRank method achieves a considerable performance on the CASP11 Best150 dataset. PMID:27530967
Sorting protein decoys by machine-learning-to-rank.
Jing, Xiaoyang; Wang, Kai; Lu, Ruqian; Dong, Qiwen
2016-08-17
Much progress has been made in Protein structure prediction during the last few decades. As the predicted models can span a broad range of accuracy spectrum, the accuracy of quality estimation becomes one of the key elements of successful protein structure prediction. Over the past years, a number of methods have been developed to address this issue, and these methods could be roughly divided into three categories: the single-model methods, clustering-based methods and quasi single-model methods. In this study, we develop a single-model method MQAPRank based on the learning-to-rank algorithm firstly, and then implement a quasi single-model method Quasi-MQAPRank. The proposed methods are benchmarked on the 3DRobot and CASP11 dataset. The five-fold cross-validation on the 3DRobot dataset shows the proposed single model method outperforms other methods whose outputs are taken as features of the proposed method, and the quasi single-model method can further enhance the performance. On the CASP11 dataset, the proposed methods also perform well compared with other leading methods in corresponding categories. In particular, the Quasi-MQAPRank method achieves a considerable performance on the CASP11 Best150 dataset.
Combined evaluation of optical and microwave satellite dataset for soil moisture deficit estimation
NASA Astrophysics Data System (ADS)
Srivastava, Prashant K.; Han, Dawei; Islam, Tanvir; Singh, Sudhir Kumar; Gupta, Manika; Gupta, Dileep Kumar; Kumar, Pradeep
2016-04-01
Soil moisture is a key variable responsible for water and energy exchanges from land surface to the atmosphere (Srivastava et al., 2014). On the other hand, Soil Moisture Deficit (or SMD) can help regulating the proper use of water at specified time to avoid any agricultural losses (Srivastava et al., 2013b) and could help in preventing natural disasters, e.g. flood and drought (Srivastava et al., 2013a). In this study, evaluation of Moderate Resolution Imaging Spectroradiometer (MODIS) Land Surface Temperature (LST) and soil moisture from Soil Moisture and Ocean Salinity (SMOS) satellites are attempted for prediction of Soil Moisture Deficit (SMD). Sophisticated algorithm like Adaptive Neuro Fuzzy Inference System (ANFIS) is used for prediction of SMD using the MODIS and SMOS dataset. The benchmark SMD estimated from Probability Distributed Model (PDM) over the Brue catchment, Southwest of England, U.K. is used for all the validation. The performances are assessed in terms of Nash Sutcliffe Efficiency, Root Mean Square Error and the percentage of bias between ANFIS simulated SMD and the benchmark. The performance statistics revealed a good agreement between benchmark and the ANFIS estimated SMD using the MODIS dataset. The assessment of the products with respect to this peculiar evidence is an important step for successful development of hydro-meteorological model and forecasting system. The analysis of the satellite products (viz. SMOS soil moisture and MODIS LST) towards SMD prediction is a crucial step for successful hydrological modelling, agriculture and water resource management, and can provide important assistance in policy and decision making. Keywords: Land Surface Temperature, MODIS, SMOS, Soil Moisture Deficit, Fuzzy Logic System References: Srivastava, P.K., Han, D., Ramirez, M.A., Islam, T., 2013a. Appraisal of SMOS soil moisture at a catchment scale in a temperate maritime climate. Journal of Hydrology 498, 292-304. Srivastava, P.K., Han, D., Rico-Ramirez, M.A., Al-Shrafany, D., Islam, T., 2013b. Data fusion techniques for improving soil moisture deficit using SMOS satellite and WRF-NOAH land surface model. Water Resources Management 27, 5069-5087. Srivastava, P.K., Han, D., Rico-Ramirez, M.A., O'Neill, P., Islam, T., Gupta, M., 2014. Assessment of SMOS soil moisture retrieval parameters using tau-omega algorithms for soil moisture deficit estimation. Journal of Hydrology 519, 574-587.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jamieson, Kevin; Davis, IV, Warren L.
Active learning methods automatically adapt data collection by selecting the most informative samples in order to accelerate machine learning. Because of this, real-world testing and comparing active learning algorithms requires collecting new datasets (adaptively), rather than simply applying algorithms to benchmark datasets, as is the norm in (passive) machine learning research. To facilitate the development, testing and deployment of active learning for real applications, we have built an open-source software system for large-scale active learning research and experimentation. The system, called NEXT, provides a unique platform for realworld, reproducible active learning research. This paper details the challenges of building themore » system and demonstrates its capabilities with several experiments. The results show how experimentation can help expose strengths and weaknesses of active learning algorithms, in sometimes unexpected and enlightening ways.« less
Ju, Chunhua; Xu, Chonghuan
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.
Ju, Chunhua
2013-01-01
Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods. PMID:24381525
HTM Spatial Pooler With Memristor Crossbar Circuits for Sparse Biometric Recognition.
James, Alex Pappachen; Fedorova, Irina; Ibrayev, Timur; Kudithipudi, Dhireesha
2017-06-01
Hierarchical Temporal Memory (HTM) is an online machine learning algorithm that emulates the neo-cortex. The development of a scalable on-chip HTM architecture is an open research area. The two core substructures of HTM are spatial pooler and temporal memory. In this work, we propose a new Spatial Pooler circuit design with parallel memristive crossbar arrays for the 2D columns. The proposed design was validated on two different benchmark datasets, face recognition, and speech recognition. The circuits are simulated and analyzed using a practical memristor device model and 0.18 μm IBM CMOS technology model. The databases AR, YALE, ORL, and UFI, are used to test the performance of the design in face recognition. TIMIT dataset is used for the speech recognition.
Wang, Zhi-wei; Wu, Xiao-dong; Yue, Guang-yang; Zhao, Lin; Wang, Qian; Nan, Zhuo-tong; Qin, Yu; Wu, Tong-hua; Shi, Jian-zong; Zou, De-fu
2016-02-01
Recently considerable researches have focused on monitoring vegetation changes because of its important role in regula- ting the terrestrial carbon cycle and the climate system. There were the largest areas with high-altitudes in the Qinghai-Tibet Plateau (QTP), which is often referred to as the third pole of the world. And vegetation in this region is significantly sensitive to the global warming. Meanwhile NDVI dataset was one of the most useful tools to monitor the vegetation activity with high spatial and temporal resolution, which is a normalized transform of the near-infrared radiation (NIR) to red reflectance ratio. Therefore, an extended GIMMS NDVI dataset from 1982-2006 to 1982-2014 was presented using a unary linear regression by MODIS dataset from 2000 to 2014 in QTP. Compared with previous researches, the accuracy of the extended NDVI dataset was improved again with consideration the residuals derived from scale transformation. So the model of extend NDVI dataset could be a new method to integrate different NDVI products. With the extended NDVI dataset, we found that in growing season there was a statistically significant increase (0.000 4 yr⁻¹, r² = 0.585 9, p < 0.001) in QTP from 1982 to 2014. During the study pe- riod, the trends of NDVI were significantly increased in spring (0.000 5 yr⁻¹, r² = 0.295 4, p = 0.001), summer (0.000 3 yr⁻¹, r² = 0.105 3, p = 0.065) and autumn respectively (0.000 6 yr⁻¹, r² = 0.436 7, p < 0.001). Due to the increased vegeta- tion activity in Qinghai-Tibet Plateau from 1982 to 2014, the magnitude of carbon sink was accumulated in this region also at this same period. Then the data of temperature and precipitation was used to explore the reason of vegetation changed. Although the trends of them are both increased, the correlation between NDVI and temperature is higher than precipitation in vegetation grow- ing season, spring, summer and autumn. Furthermore, there is significant spatial heterogeneity of the changing trends for ND- VI, temperature and precipitation at Qinghai-Tibet Plateau scale.
A novel approach to identifying regulatory motifs in distantly related genomes
Van Hellemont, Ruth; Monsieurs, Pieter; Thijs, Gert; De Moor, Bart; Van de Peer, Yves; Marchal, Kathleen
2005-01-01
Although proven successful in the identification of regulatory motifs, phylogenetic footprinting methods still show some shortcomings. To assess these difficulties, most apparent when applying phylogenetic footprinting to distantly related organisms, we developed a two-step procedure that combines the advantages of sequence alignment and motif detection approaches. The results on well-studied benchmark datasets indicate that the presented method outperforms other methods when the sequences become either too long or too heterogeneous in size. PMID:16420672
Stephen, Reejis; Boxwala, Aziz; Gertman, Paul
2003-01-01
Data from Clinical Data Warehouses (CDWs) can be used for retrospective studies and for benchmarking. However, automated identification of cases from large datasets containing data items in free text fields is challenging. We developed an algorithm for categorizing pediatric patients presenting with respiratory distress into Bronchiolitis, Bacterial pneumonia and Asthma using clinical variables from a CDW. A feasibility study of this approach indicates that case selection may be automated.
Ellis, D W; Srigley, J
2016-01-01
Key quality parameters in diagnostic pathology include timeliness, accuracy, completeness, conformance with current agreed standards, consistency and clarity in communication. In this review, we argue that with worldwide developments in eHealth and big data, generally, there are two further, often overlooked, parameters if our reports are to be fit for purpose. Firstly, population-level studies have clearly demonstrated the value of providing timely structured reporting data in standardised electronic format as part of system-wide quality improvement programmes. Moreover, when combined with multiple health data sources through eHealth and data linkage, structured pathology reports become central to population-level quality monitoring, benchmarking, interventions and benefit analyses in public health management. Secondly, population-level studies, particularly for benchmarking, require a single agreed international and evidence-based standard to ensure interoperability and comparability. This has been taken for granted in tumour classification and staging for many years, yet international standardisation of cancer datasets is only now underway through the International Collaboration on Cancer Reporting (ICCR). In this review, we present evidence supporting the role of structured pathology reporting in quality improvement for both clinical care and population-level health management. Although this review of available evidence largely relates to structured reporting of cancer, it is clear that the same principles can be applied throughout anatomical pathology generally, as they are elsewhere in the health system.
Multi-Label Learning via Random Label Selection for Protein Subcellular Multi-Locations Prediction.
Wang, Xiao; Li, Guo-Zheng
2013-03-12
Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multi-location proteins to multiple proteins with single location, which doesn't take correlations among different subcellular locations into account. In this paper, a novel method named RALS (multi-label learning via RAndom Label Selection), is proposed to learn from multi-location proteins in an effective and efficient way. Through five-fold cross validation test on a benchmark dataset, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark datasets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multi-locations of proteins. The prediction web server is available at http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.
ERIC Educational Resources Information Center
Howe, Dorothea
2008-01-01
The state of Ohio made globalizing its education a priority. The Ohio Department of Education benchmarked its practices against world-class standards, expanded visiting teacher programs, and promoted Chinese Mandarin language instruction and curriculum development in Ohio classrooms. Numerous partnerships extended and supported those practices.…
The observed clustering of damaging extra-tropical cyclones in Europe
NASA Astrophysics Data System (ADS)
Cusack, S.
2015-12-01
The clustering of severe European windstorms on annual timescales has substantial impacts on the re/insurance industry. Management of the risk is impaired by large uncertainties in estimates of clustering from historical storm datasets typically covering the past few decades. The uncertainties are unusually large because clustering depends on the variance of storm counts. Eight storm datasets are gathered for analysis in this study in order to reduce these uncertainties. Six of the datasets contain more than 100~years of severe storm information to reduce sampling errors, and the diversity of information sources and analysis methods between datasets sample observational errors. All storm severity measures used in this study reflect damage, to suit re/insurance applications. It is found that the shortest storm dataset of 42 years in length provides estimates of clustering with very large sampling and observational errors. The dataset does provide some useful information: indications of stronger clustering for more severe storms, particularly for southern countries off the main storm track. However, substantially different results are produced by removal of one stormy season, 1989/1990, which illustrates the large uncertainties from a 42-year dataset. The extended storm records place 1989/1990 into a much longer historical context to produce more robust estimates of clustering. All the extended storm datasets show a greater degree of clustering with increasing storm severity and suggest clustering of severe storms is much more material than weaker storms. Further, they contain signs of stronger clustering in areas off the main storm track, and weaker clustering for smaller-sized areas, though these signals are smaller than uncertainties in actual values. Both the improvement of existing storm records and development of new historical storm datasets would help to improve management of this risk.
Simple mathematical law benchmarks human confrontations
Johnson, Neil F.; Medina, Pablo; Zhao, Guannan; Messinger, Daniel S.; Horgan, John; Gill, Paul; Bohorquez, Juan Camilo; Mattson, Whitney; Gangi, Devon; Qi, Hong; Manrique, Pedro; Velasquez, Nicolas; Morgenstern, Ana; Restrepo, Elvira; Johnson, Nicholas; Spagat, Michael; Zarama, Roberto
2013-01-01
Many high-profile societal problems involve an individual or group repeatedly attacking another – from child-parent disputes, sexual violence against women, civil unrest, violent conflicts and acts of terror, to current cyber-attacks on national infrastructure and ultrafast cyber-trades attacking stockholders. There is an urgent need to quantify the likely severity and timing of such future acts, shed light on likely perpetrators, and identify intervention strategies. Here we present a combined analysis of multiple datasets across all these domains which account for >100,000 events, and show that a simple mathematical law can benchmark them all. We derive this benchmark and interpret it, using a minimal mechanistic model grounded by state-of-the-art fieldwork. Our findings provide quantitative predictions concerning future attacks; a tool to help detect common perpetrators and abnormal behaviors; insight into the trajectory of a ‘lone wolf'; identification of a critical threshold for spreading a message or idea among perpetrators; an intervention strategy to erode the most lethal clusters; and more broadly, a quantitative starting point for cross-disciplinary theorizing about human aggression at the individual and group level, in both real and online worlds. PMID:24322528
How well does your model capture the terrestrial ecosystem dynamics of the Arctic-Boreal Region?
NASA Astrophysics Data System (ADS)
Stofferahn, E.; Fisher, J. B.; Hayes, D. J.; Huntzinger, D. N.; Schwalm, C.
2016-12-01
The Arctic-Boreal Region (ABR) is a major source of uncertainties for terrestrial biosphere model (TBM) simulations. These uncertainties are precipitated by a lack of observational data from the region, affecting the parameterizations of cold environment processes in the models. Addressing these uncertainties requires a coordinated effort of data collection and integration of the following key indicators of the ABR ecosystem: disturbance, flora / fauna and related ecosystem function, carbon pools and biogeochemistry, permafrost, and hydrology. We are developing a model-data integration framework for NASA's Arctic Boreal Vulnerability Experiment (ABoVE), wherein data collection for the key ABoVE indicators is driven by matching observations and model outputs to the ABoVE indicators. The data are used as reference datasets for a benchmarking system which evaluates TBM performance with respect to ABR processes. The benchmarking system utilizes performance metrics to identify intra-model and inter-model strengths and weaknesses, which in turn provides guidance to model development teams for reducing uncertainties in TBM simulations of the ABR. The system is directly connected to the International Land Model Benchmarking (ILaMB) system, as an ABR-focused application.
SensorWeb 3G: Extending On-Orbit Sensor Capabilities to Enable Near Realtime User Configurability
NASA Technical Reports Server (NTRS)
Mandl, Daniel; Cappelaere, Pat; Frye, Stuart; Sohlberg, Rob; Ly, Vuong; Chien, Steve; Tran, Daniel; Davies, Ashley; Sullivan, Don; Ames, Troy;
2010-01-01
This research effort prototypes an implementation of a standard interface, Web Coverage Processing Service (WCPS), which is an Open Geospatial Consortium(OGC) standard, to enable users to define, test, upload and execute algorithms for on-orbit sensor systems. The user is able to customize on-orbit data products that result from raw data streaming from an instrument. This extends the SensorWeb 2.0 concept that was developed under a previous Advanced Information System Technology (AIST) effort in which web services wrap sensors and a standardized Extensible Markup Language (XML) based scripting workflow language orchestrates processing steps across multiple domains. SensorWeb 3G extends the concept by providing the user controls into the flight software modules associated with on-orbit sensor and thus provides a degree of flexibility which does not presently exist. The successful demonstrations to date will be presented, which includes a realistic HyspIRI decadal mission testbed. Furthermore, benchmarks that were run will also be presented along with future demonstration and benchmark tests planned. Finally, we conclude with implications for the future and how this concept dovetails into efforts to develop "cloud computing" methods and standards.
Gundogdu, Erhan; Ozkan, Huseyin; Alatan, A Aydin
2017-11-01
Correlation filters have been successfully used in visual tracking due to their modeling power and computational efficiency. However, the state-of-the-art correlation filter-based (CFB) tracking algorithms tend to quickly discard the previous poses of the target, since they consider only a single filter in their models. On the contrary, our approach is to register multiple CFB trackers for previous poses and exploit the registered knowledge when an appearance change occurs. To this end, we propose a novel tracking algorithm [of complexity O(D) ] based on a large ensemble of CFB trackers. The ensemble [of size O(2 D ) ] is organized over a binary tree (depth D ), and learns the target appearance subspaces such that each constituent tracker becomes an expert of a certain appearance. During tracking, the proposed algorithm combines only the appearance-aware relevant experts to produce boosted tracking decisions. Additionally, we propose a versatile spatial windowing technique to enhance the individual expert trackers. For this purpose, spatial windows are learned for target objects as well as the correlation filters and then the windowed regions are processed for more robust correlations. In our extensive experiments on benchmark datasets, we achieve a substantial performance increase by using the proposed tracking algorithm together with the spatial windowing.
IgSimulator: a versatile immunosequencing simulator.
Safonova, Yana; Lapidus, Alla; Lill, Jennie
2015-10-01
The recent introduction of next-generation sequencing technologies to antibody studies have resulted in a growing number of immunoinformatics tools for antibody repertoire analysis. However, benchmarking these newly emerging tools remains problematic since the gold standard datasets that are needed to validate these tools are typically not available. Since simulating antibody repertoires is often the only feasible way to benchmark new immunoinformatics tools, we developed the IgSimulator tool that addresses various complications in generating realistic antibody repertoires. IgSimulator's code has modular structure and can be easily adapted to new requirements to simulation. IgSimulator is open source and freely available as a C++ and Python program running on all Unix-compatible platforms. The source code is available from yana-safonova.github.io/ig_simulator. safonova.yana@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Towards the quantitative evaluation of visual attention models.
Bylinskii, Z; DeGennaro, E M; Rajalingham, R; Ruda, H; Zhang, J; Tsotsos, J K
2015-11-01
Scores of visual attention models have been developed over the past several decades of research. Differences in implementation, assumptions, and evaluations have made comparison of these models very difficult. Taxonomies have been constructed in an attempt at the organization and classification of models, but are not sufficient at quantifying which classes of models are most capable of explaining available data. At the same time, a multitude of physiological and behavioral findings have been published, measuring various aspects of human and non-human primate visual attention. All of these elements highlight the need to integrate the computational models with the data by (1) operationalizing the definitions of visual attention tasks and (2) designing benchmark datasets to measure success on specific tasks, under these definitions. In this paper, we provide some examples of operationalizing and benchmarking different visual attention tasks, along with the relevant design considerations. Copyright © 2015 Elsevier Ltd. All rights reserved.
Integrating CFD, CAA, and Experiments Towards Benchmark Datasets for Airframe Noise Problems
NASA Technical Reports Server (NTRS)
Choudhari, Meelan M.; Yamamoto, Kazuomi
2012-01-01
Airframe noise corresponds to the acoustic radiation due to turbulent flow in the vicinity of airframe components such as high-lift devices and landing gears. The combination of geometric complexity, high Reynolds number turbulence, multiple regions of separation, and a strong coupling with adjacent physical components makes the problem of airframe noise highly challenging. Since 2010, the American Institute of Aeronautics and Astronautics has organized an ongoing series of workshops devoted to Benchmark Problems for Airframe Noise Computations (BANC). The BANC workshops are aimed at enabling a systematic progress in the understanding and high-fidelity predictions of airframe noise via collaborative investigations that integrate state of the art computational fluid dynamics, computational aeroacoustics, and in depth, holistic, and multifacility measurements targeting a selected set of canonical yet realistic configurations. This paper provides a brief summary of the BANC effort, including its technical objectives, strategy, and selective outcomes thus far.
Modeling Urban Scenarios & Experiments: Fort Indiantown Gap Data Collections Summary and Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Archer, Daniel E.; Bandstra, Mark S.; Davidson, Gregory G.
This report summarizes experimental radiation detector, contextual sensor, weather, and global positioning system (GPS) data collected to inform and validate a comprehensive, operational radiation transport modeling framework to evaluate radiation detector system and algorithm performance. This framework will be used to study the influence of systematic effects (such as geometry, background activity, background variability, environmental shielding, etc.) on detector responses and algorithm performance using synthetic time series data. This work consists of performing data collection campaigns at a canonical, controlled environment for complete radiological characterization to help construct and benchmark a high-fidelity model with quantified system geometries, detector response functions,more » and source terms for background and threat objects. This data also provides an archival, benchmark dataset that can be used by the radiation detection community. The data reported here spans four data collection campaigns conducted between May 2015 and September 2016.« less
Lazaris, Charalampos; Kelly, Stephen; Ntziachristos, Panagiotis; Aifantis, Iannis; Tsirigos, Aristotelis
2017-01-05
Chromatin conformation capture techniques have evolved rapidly over the last few years and have provided new insights into genome organization at an unprecedented resolution. Analysis of Hi-C data is complex and computationally intensive involving multiple tasks and requiring robust quality assessment. This has led to the development of several tools and methods for processing Hi-C data. However, most of the existing tools do not cover all aspects of the analysis and only offer few quality assessment options. Additionally, availability of a multitude of tools makes scientists wonder how these tools and associated parameters can be optimally used, and how potential discrepancies can be interpreted and resolved. Most importantly, investigators need to be ensured that slight changes in parameters and/or methods do not affect the conclusions of their studies. To address these issues (compare, explore and reproduce), we introduce HiC-bench, a configurable computational platform for comprehensive and reproducible analysis of Hi-C sequencing data. HiC-bench performs all common Hi-C analysis tasks, such as alignment, filtering, contact matrix generation and normalization, identification of topological domains, scoring and annotation of specific interactions using both published tools and our own. We have also embedded various tasks that perform quality assessment and visualization. HiC-bench is implemented as a data flow platform with an emphasis on analysis reproducibility. Additionally, the user can readily perform parameter exploration and comparison of different tools in a combinatorial manner that takes into account all desired parameter settings in each pipeline task. This unique feature facilitates the design and execution of complex benchmark studies that may involve combinations of multiple tool/parameter choices in each step of the analysis. To demonstrate the usefulness of our platform, we performed a comprehensive benchmark of existing and new TAD callers exploring different matrix correction methods, parameter settings and sequencing depths. Users can extend our pipeline by adding more tools as they become available. HiC-bench consists an easy-to-use and extensible platform for comprehensive analysis of Hi-C datasets. We expect that it will facilitate current analyses and help scientists formulate and test new hypotheses in the field of three-dimensional genome organization.
NASA Astrophysics Data System (ADS)
Kruithof, Maarten C.; Bouma, Henri; Fischer, Noëlle M.; Schutte, Klamer
2016-10-01
Object recognition is important to understand the content of video and allow flexible querying in a large number of cameras, especially for security applications. Recent benchmarks show that deep convolutional neural networks are excellent approaches for object recognition. This paper describes an approach of domain transfer, where features learned from a large annotated dataset are transferred to a target domain where less annotated examples are available as is typical for the security and defense domain. Many of these networks trained on natural images appear to learn features similar to Gabor filters and color blobs in the first layer. These first-layer features appear to be generic for many datasets and tasks while the last layer is specific. In this paper, we study the effect of copying all layers and fine-tuning a variable number. We performed an experiment with a Caffe-based network on 1000 ImageNet classes that are randomly divided in two equal subgroups for the transfer from one to the other. We copy all layers and vary the number of layers that is fine-tuned and the size of the target dataset. We performed additional experiments with the Keras platform on CIFAR-10 dataset to validate general applicability. We show with both platforms and both datasets that the accuracy on the target dataset improves when more target data is used. When the target dataset is large, it is beneficial to freeze only a few layers. For a large target dataset, the network without transfer learning performs better than the transfer network, especially if many layers are frozen. When the target dataset is small, it is beneficial to transfer (and freeze) many layers. For a small target dataset, the transfer network boosts generalization and it performs much better than the network without transfer learning. Learning time can be reduced by freezing many layers in a network.
The Use of Rubrics in Benchmarking and Assessing Employability Skills
ERIC Educational Resources Information Center
Riebe, Linda; Jackson, Denise
2014-01-01
Calls for employability skill development in undergraduates now extend across many culturally similar developed economies. Government initiatives, industry professional accreditation criteria, and the development of academic teaching and learning standards increasingly drive the employability agenda, further cementing the need for skill…
Electric load shape benchmarking for small- and medium-sized commercial buildings
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, Xuan; Hong, Tianzhen; Chen, Yixing
Small- and medium-sized commercial buildings owners and utility managers often look for opportunities for energy cost savings through energy efficiency and energy waste minimization. However, they currently lack easy access to low-cost tools that help interpret the massive amount of data needed to improve understanding of their energy use behaviors. Benchmarking is one of the techniques used in energy audits to identify which buildings are priorities for an energy analysis. Traditional energy performance indicators, such as the energy use intensity (annual energy per unit of floor area), consider only the total annual energy consumption, lacking consideration of the fluctuation ofmore » energy use behavior over time, which reveals the time of use information and represents distinct energy use behaviors during different time spans. To fill the gap, this study developed a general statistical method using 24-hour electric load shape benchmarking to compare a building or business/tenant space against peers. Specifically, the study developed new forms of benchmarking metrics and data analysis methods to infer the energy performance of a building based on its load shape. We first performed a data experiment with collected smart meter data using over 2,000 small- and medium-sized businesses in California. We then conducted a cluster analysis of the source data, and determined and interpreted the load shape features and parameters with peer group analysis. Finally, we implemented the load shape benchmarking feature in an open-access web-based toolkit (the Commercial Building Energy Saver) to provide straightforward and practical recommendations to users. The analysis techniques were generic and flexible for future datasets of other building types and in other utility territories.« less
Electric load shape benchmarking for small- and medium-sized commercial buildings
Luo, Xuan; Hong, Tianzhen; Chen, Yixing; ...
2017-07-28
Small- and medium-sized commercial buildings owners and utility managers often look for opportunities for energy cost savings through energy efficiency and energy waste minimization. However, they currently lack easy access to low-cost tools that help interpret the massive amount of data needed to improve understanding of their energy use behaviors. Benchmarking is one of the techniques used in energy audits to identify which buildings are priorities for an energy analysis. Traditional energy performance indicators, such as the energy use intensity (annual energy per unit of floor area), consider only the total annual energy consumption, lacking consideration of the fluctuation ofmore » energy use behavior over time, which reveals the time of use information and represents distinct energy use behaviors during different time spans. To fill the gap, this study developed a general statistical method using 24-hour electric load shape benchmarking to compare a building or business/tenant space against peers. Specifically, the study developed new forms of benchmarking metrics and data analysis methods to infer the energy performance of a building based on its load shape. We first performed a data experiment with collected smart meter data using over 2,000 small- and medium-sized businesses in California. We then conducted a cluster analysis of the source data, and determined and interpreted the load shape features and parameters with peer group analysis. Finally, we implemented the load shape benchmarking feature in an open-access web-based toolkit (the Commercial Building Energy Saver) to provide straightforward and practical recommendations to users. The analysis techniques were generic and flexible for future datasets of other building types and in other utility territories.« less
Haider, Adil H; Hashmi, Zain G; Gupta, Sonia; Zafar, Syed Nabeel; David, Jean-Stephane; Efron, David T; Stevens, Kent A; Zafar, Hasnain; Schneider, Eric B; Voiglio, Eric; Coimbra, Raul; Haut, Elliott R
2014-08-01
National trauma registries have helped improve patient outcomes across the world. Recently, the idea of an International Trauma Data Bank (ITDB) has been suggested to establish global comparative assessments of trauma outcomes. The objective of this study was to determine whether global trauma data could be combined to perform international outcomes benchmarking. We used observed/expected (O/E) mortality ratios to compare two trauma centers [European high-income country (HIC) and Asian lower-middle income country (LMIC)] with centers in the North American National Trauma Data Bank (NTDB). Patients (≥16 years) with blunt/penetrating injuries were included. Multivariable logistic regression, adjusting for known predictors of trauma mortality, was performed. Estimates were used to predict the expected deaths at each center and to calculate O/E mortality ratios for benchmarking. A total of 375,433 patients from 301 centers were included from the NTDB (2002-2010). The LMIC trauma center had 806 patients (2002-2010), whereas the HIC reported 1,003 patients (2002-2004). The most important known predictors of trauma mortality were adequately recorded in all datasets. Mortality benchmarking revealed that the HIC center performed similarly to the NTDB centers [O/E = 1.11 (95% confidence interval (CI) 0.92-1.35)], whereas the LMIC center showed significantly worse survival [O/E = 1.52 (1.23-1.88)]. Subset analyses of patients with blunt or penetrating injury showed similar results. Using only a few key covariates, aggregated global trauma data can be used to adequately perform international trauma center benchmarking. The creation of the ITDB is feasible and recommended as it may be a pivotal step towards improving global trauma outcomes.
A Visual Evaluation Study of Graph Sampling Techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Fangyan; Zhang, Song; Wong, Pak C.
2017-01-29
We evaluate a dozen prevailing graph-sampling techniques with an ultimate goal to better visualize and understand big and complex graphs that exhibit different properties and structures. The evaluation uses eight benchmark datasets with four different graph types collected from Stanford Network Analysis Platform and NetworkX to give a comprehensive comparison of various types of graphs. The study provides a practical guideline for visualizing big graphs of different sizes and structures. The paper discusses results and important observations from the study.
Anytime query-tuned kernel machine classifiers via Cholesky factorization
NASA Technical Reports Server (NTRS)
DeCoste, D.
2002-01-01
We recently demonstrated 2 to 64-fold query-time speedups of Support Vector Machine and Kernel Fisher classifiers via a new computational geometry method for anytime output bounds (DeCoste,2002). This new paper refines our approach in two key ways. First, we introduce a simple linear algebra formulation based on Cholesky factorization, yielding simpler equations and lower computational overhead. Second, this new formulation suggests new methods for achieving additional speedups, including tuning on query samples. We demonstrate effectiveness on benchmark datasets.
Reliable prediction intervals with regression neural networks.
Papadopoulos, Harris; Haralambous, Haris
2011-10-01
This paper proposes an extension to conventional regression neural networks (NNs) for replacing the point predictions they produce with prediction intervals that satisfy a required level of confidence. Our approach follows a novel machine learning framework, called Conformal Prediction (CP), for assigning reliable confidence measures to predictions without assuming anything more than that the data are independent and identically distributed (i.i.d.). We evaluate the proposed method on four benchmark datasets and on the problem of predicting Total Electron Content (TEC), which is an important parameter in trans-ionospheric links; for the latter we use a dataset of more than 60000 TEC measurements collected over a period of 11 years. Our experimental results show that the prediction intervals produced by our method are both well calibrated and tight enough to be useful in practice. Copyright © 2011 Elsevier Ltd. All rights reserved.
Hyperspectral Image Classification With Markov Random Fields and a Convolutional Neural Network
NASA Astrophysics Data System (ADS)
Cao, Xiangyong; Zhou, Feng; Xu, Lin; Meng, Deyu; Xu, Zongben; Paisley, John
2018-05-01
This paper presents a new supervised classification algorithm for remotely sensed hyperspectral image (HSI) which integrates spectral and spatial information in a unified Bayesian framework. First, we formulate the HSI classification problem from a Bayesian perspective. Then, we adopt a convolutional neural network (CNN) to learn the posterior class distributions using a patch-wise training strategy to better use the spatial information. Next, spatial information is further considered by placing a spatial smoothness prior on the labels. Finally, we iteratively update the CNN parameters using stochastic gradient decent (SGD) and update the class labels of all pixel vectors using an alpha-expansion min-cut-based algorithm. Compared with other state-of-the-art methods, the proposed classification method achieves better performance on one synthetic dataset and two benchmark HSI datasets in a number of experimental settings.
Active learning for noisy oracle via density power divergence.
Sogawa, Yasuhiro; Ueno, Tsuyoshi; Kawahara, Yoshinobu; Washio, Takashi
2013-10-01
The accuracy of active learning is critically influenced by the existence of noisy labels given by a noisy oracle. In this paper, we propose a novel pool-based active learning framework through robust measures based on density power divergence. By minimizing density power divergence, such as β-divergence and γ-divergence, one can estimate the model accurately even under the existence of noisy labels within data. Accordingly, we develop query selecting measures for pool-based active learning using these divergences. In addition, we propose an evaluation scheme for these measures based on asymptotic statistical analyses, which enables us to perform active learning by evaluating an estimation error directly. Experiments with benchmark datasets and real-world image datasets show that our active learning scheme performs better than several baseline methods. Copyright © 2013 Elsevier Ltd. All rights reserved.
Ahrenfeldt, Johanne; Skaarup, Carina; Hasman, Henrik; Pedersen, Anders Gorm; Aarestrup, Frank Møller; Lund, Ole
2017-01-05
Whole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a need for methods that can accurately and efficiently infer phylogenies from sequencing reads. In the present study we describe a new dataset that we have created for the purpose of benchmarking such WGS-based methods for epidemiological data, and also present an analysis where we use the data to compare the performance of some current methods. Our aim was to create a benchmark data set that mimics sequencing data of the sort that might be collected during an outbreak of an infectious disease. This was achieved by letting an E. coli hypermutator strain grow in the lab for 8 consecutive days, each day splitting the culture in two while also collecting samples for sequencing. The result is a data set consisting of 101 whole genome sequences with known phylogenetic relationship. Among the sequenced samples 51 correspond to internal nodes in the phylogeny because they are ancestral, while the remaining 50 correspond to leaves. We also used the newly created data set to compare three different online available methods that infer phylogenies from whole-genome sequencing reads: NDtree, CSI Phylogeny and REALPHY. One complication when comparing the output of these methods with the known phylogeny is that phylogenetic methods typically build trees where all observed sequences are placed as leafs, even though some of them are in fact ancestral. We therefore devised a method for post processing the inferred trees by collapsing short branches (thus relocating some leafs to internal nodes), and also present two new measures of tree similarity that takes into account the identity of both internal and leaf nodes. Based on this analysis we find that, among the investigated methods, CSI Phylogeny had the best performance, correctly identifying 73% of all branches in the tree and 71% of all clades. We have made all data from this experiment (raw sequencing reads, consensus whole-genome sequences, as well as descriptions of the known phylogeny in a variety of formats) publicly available, with the hope that other groups may find this data useful for benchmarking and exploring the performance of epidemiological methods. All data is freely available at: https://cge.cbs.dtu.dk/services/evolution_data.php .
TRIPOLI-4® - MCNP5 ITER A-lite neutronic model benchmarking
NASA Astrophysics Data System (ADS)
Jaboulay, J.-C.; Cayla, P.-Y.; Fausser, C.; Lee, Y.-K.; Trama, J.-C.; Li-Puma, A.
2014-06-01
The aim of this paper is to present the capability of TRIPOLI-4®, the CEA Monte Carlo code, to model a large-scale fusion reactor with complex neutron source and geometry. In the past, numerous benchmarks were conducted for TRIPOLI-4® assessment on fusion applications. Experiments (KANT, OKTAVIAN, FNG) analysis and numerical benchmarks (between TRIPOLI-4® and MCNP5) on the HCLL DEMO2007 and ITER models were carried out successively. In this previous ITER benchmark, nevertheless, only the neutron wall loading was analyzed, its main purpose was to present MCAM (the FDS Team CAD import tool) extension for TRIPOLI-4®. Starting from this work a more extended benchmark has been performed about the estimation of neutron flux, nuclear heating in the shielding blankets and tritium production rate in the European TBMs (HCLL and HCPB) and it is presented in this paper. The methodology to build the TRIPOLI-4® A-lite model is based on MCAM and the MCNP A-lite model (version 4.1). Simplified TBMs (from KIT) have been integrated in the equatorial-port. Comparisons of neutron wall loading, flux, nuclear heating and tritium production rate show a good agreement between the two codes. Discrepancies are mainly included in the Monte Carlo codes statistical error.
WWTP dynamic disturbance modelling--an essential module for long-term benchmarking development.
Gernaey, K V; Rosen, C; Jeppsson, U
2006-01-01
Intensive use of the benchmark simulation model No. 1 (BSM1), a protocol for objective comparison of the effectiveness of control strategies in biological nitrogen removal activated sludge plants, has also revealed a number of limitations. Preliminary definitions of the long-term benchmark simulation model No. 1 (BSM1_LT) and the benchmark simulation model No. 2 (BSM2) have been made to extend BSM1 for evaluation of process monitoring methods and plant-wide control strategies, respectively. Influent-related disturbances for BSM1_LT/BSM2 are to be generated with a model, and this paper provides a general overview of the modelling methods used. Typical influent dynamic phenomena generated with the BSM1_LT/BSM2 influent disturbance model, including diurnal, weekend, seasonal and holiday effects, as well as rainfall, are illustrated with simulation results. As a result of the work described in this paper, a proposed influent model/file has been released to the benchmark developers for evaluation purposes. Pending this evaluation, a final BSM1_LT/BSM2 influent disturbance model definition is foreseen. Preliminary simulations with dynamic influent data generated by the influent disturbance model indicate that default BSM1 activated sludge plant control strategies will need extensions for BSM1_LT/BSM2 to efficiently handle 1 year of influent dynamics.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arnis Judzis
2002-10-01
This document details the progress to date on the OPTIMIZATION OF MUD HAMMER DRILLING PERFORMANCE -- A PROGRAM TO BENCHMARK THE VIABILITY OF ADVANCED MUD HAMMER DRILLING contract for the quarter starting July 2002 through September 2002. Even though we are awaiting the optimization portion of the testing program, accomplishments include the following: (1) Smith International agreed to participate in the DOE Mud Hammer program. (2) Smith International chromed collars for upcoming benchmark tests at TerraTek, now scheduled for 4Q 2002. (3) ConocoPhillips had a field trial of the Smith fluid hammer offshore Vietnam. The hammer functioned properly, though themore » well encountered hole conditions and reaming problems. ConocoPhillips plan another field trial as a result. (4) DOE/NETL extended the contract for the fluid hammer program to allow Novatek to ''optimize'' their much delayed tool to 2003 and to allow Smith International to add ''benchmarking'' tests in light of SDS Digger Tools' current financial inability to participate. (5) ConocoPhillips joined the Industry Advisors for the mud hammer program. (6) TerraTek acknowledges Smith International, BP America, PDVSA, and ConocoPhillips for cost-sharing the Smith benchmarking tests allowing extension of the contract to complete the optimizations.« less
Kang, Guangliang; Du, Li; Zhang, Hong
2016-06-22
The growing complexity of biological experiment design based on high-throughput RNA sequencing (RNA-seq) is calling for more accommodative statistical tools. We focus on differential expression (DE) analysis using RNA-seq data in the presence of multiple treatment conditions. We propose a novel method, multiDE, for facilitating DE analysis using RNA-seq read count data with multiple treatment conditions. The read count is assumed to follow a log-linear model incorporating two factors (i.e., condition and gene), where an interaction term is used to quantify the association between gene and condition. The number of the degrees of freedom is reduced to one through the first order decomposition of the interaction, leading to a dramatically power improvement in testing DE genes when the number of conditions is greater than two. In our simulation situations, multiDE outperformed the benchmark methods (i.e. edgeR and DESeq2) even if the underlying model was severely misspecified, and the power gain was increasing in the number of conditions. In the application to two real datasets, multiDE identified more biologically meaningful DE genes than the benchmark methods. An R package implementing multiDE is available publicly at http://homepage.fudan.edu.cn/zhangh/softwares/multiDE . When the number of conditions is two, multiDE performs comparably with the benchmark methods. When the number of conditions is greater than two, multiDE outperforms the benchmark methods.
Benchmarks for single-phase flow in fractured porous media
NASA Astrophysics Data System (ADS)
Flemisch, Bernd; Berre, Inga; Boon, Wietse; Fumagalli, Alessio; Schwenck, Nicolas; Scotti, Anna; Stefansson, Ivar; Tatomir, Alexandru
2018-01-01
This paper presents several test cases intended to be benchmarks for numerical schemes for single-phase fluid flow in fractured porous media. A number of solution strategies are compared, including a vertex and two cell-centred finite volume methods, a non-conforming embedded discrete fracture model, a primal and a dual extended finite element formulation, and a mortar discrete fracture model. The proposed benchmarks test the schemes by increasing the difficulties in terms of network geometry, e.g. intersecting fractures, and physical parameters, e.g. low and high fracture-matrix permeability ratio as well as heterogeneous fracture permeabilities. For each problem, the results presented are the number of unknowns, the approximation errors in the porous matrix and in the fractures with respect to a reference solution, and the sparsity and condition number of the discretized linear system. All data and meshes used in this study are publicly available for further comparisons.
A high level interface to SCOP and ASTRAL implemented in python.
Casbon, James A; Crooks, Gavin E; Saqi, Mansoor A S
2006-01-10
Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources. We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL. The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.
Using CHIRPS Rainfall Dataset to detect rainfall trends in West Africa
NASA Astrophysics Data System (ADS)
Blakeley, S. L.; Husak, G. J.
2016-12-01
In West Africa, agriculture is often rain-fed, subjecting agricultural productivity and food availability to climate variability. Agricultural conditions will change as warming temperatures increase evaporative demand, and with a growing population dependent on the food supply, farmers will become more reliant on improved adaptation strategies. Development of such adaptation strategies will need to consider West African rainfall trends to remain relevant in a changing climate. Here, using the CHIRPS rainfall product (provided by the Climate Hazards Group at UC Santa Barbara), I examine trends in West African rainfall variability. My analysis will focus on seasonal rainfall totals, the structure of the rainy season, and the distribution of rainfall. I then use farmer-identified drought years to take an in-depth analysis of intra-seasonal rainfall irregularities. I will also examine other datasets such as potential evapotranspiration (PET) data, other remotely sensed rainfall data, rain gauge data in specific locations, and remotely sensed vegetation data. Farmer bad year data will also be used to isolate "bad" year markers in these additional datasets to provide benchmarks for identification in the future of problematic rainy seasons.
Canessa, Andrea; Gibaldi, Agostino; Chessa, Manuela; Fato, Marco; Solari, Fabio; Sabatini, Silvio P.
2017-01-01
Binocular stereopsis is the ability of a visual system, belonging to a live being or a machine, to interpret the different visual information deriving from two eyes/cameras for depth perception. From this perspective, the ground-truth information about three-dimensional visual space, which is hardly available, is an ideal tool both for evaluating human performance and for benchmarking machine vision algorithms. In the present work, we implemented a rendering methodology in which the camera pose mimics realistic eye pose for a fixating observer, thus including convergent eye geometry and cyclotorsion. The virtual environment we developed relies on highly accurate 3D virtual models, and its full controllability allows us to obtain the stereoscopic pairs together with the ground-truth depth and camera pose information. We thus created a stereoscopic dataset: GENUA PESTO—GENoa hUman Active fixation database: PEripersonal space STereoscopic images and grOund truth disparity. The dataset aims to provide a unified framework useful for a number of problems relevant to human and computer vision, from scene exploration and eye movement studies to 3D scene reconstruction. PMID:28350382
First stereo video dataset with ground truth for remote car pose estimation using satellite markers
NASA Astrophysics Data System (ADS)
Gil, Gustavo; Savino, Giovanni; Pierini, Marco
2018-04-01
Leading causes of PTW (Powered Two-Wheeler) crashes and near misses in urban areas are on the part of a failure or delayed prediction of the changing trajectories of other vehicles. Regrettably, misperception from both car drivers and motorcycle riders results in fatal or serious consequences for riders. Intelligent vehicles could provide early warning about possible collisions, helping to avoid the crash. There is evidence that stereo cameras can be used for estimating the heading angle of other vehicles, which is key to anticipate their imminent location, but there is limited heading ground truth data available in the public domain. Consequently, we employed a marker-based technique for creating ground truth of car pose and create a dataset∗ for computer vision benchmarking purposes. This dataset of a moving vehicle collected from a static mounted stereo camera is a simplification of a complex and dynamic reality, which serves as a test bed for car pose estimation algorithms. The dataset contains the accurate pose of the moving obstacle, and realistic imagery including texture-less and non-lambertian surfaces (e.g. reflectance and transparency).
101 Labeled Brain Images and a Consistent Human Cortical Labeling Protocol
Klein, Arno; Tourville, Jason
2012-01-01
We introduce the Mindboggle-101 dataset, the largest and most complete set of free, publicly accessible, manually labeled human brain images. To manually label the macroscopic anatomy in magnetic resonance images of 101 healthy participants, we created a new cortical labeling protocol that relies on robust anatomical landmarks and minimal manual edits after initialization with automated labels. The “Desikan–Killiany–Tourville” (DKT) protocol is intended to improve the ease, consistency, and accuracy of labeling human cortical areas. Given how difficult it is to label brains, the Mindboggle-101 dataset is intended to serve as brain atlases for use in labeling other brains, as a normative dataset to establish morphometric variation in a healthy population for comparison against clinical populations, and contribute to the development, training, testing, and evaluation of automated registration and labeling algorithms. To this end, we also introduce benchmarks for the evaluation of such algorithms by comparing our manual labels with labels automatically generated by probabilistic and multi-atlas registration-based approaches. All data and related software and updated information are available on the http://mindboggle.info/data website. PMID:23227001
A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures.
DeCost, Brian L; Holm, Elizabeth A
2016-12-01
This data article presents a data set comprised of 2048 synthetic scanning electron microscope (SEM) images of powder materials and descriptions of the corresponding 3D structures that they represent. These images were created using open source rendering software, and the generating scripts are included with the data set. Eight particle size distributions are represented with 256 independent images from each. The particle size distributions are relatively similar to each other, so that the dataset offers a useful benchmark to assess the fidelity of image analysis techniques. The characteristics of the PSDs and the resulting images are described and analyzed in more detail in the research article "Characterizing powder materials using keypoint-based computer vision methods" (B.L. DeCost, E.A. Holm, 2016) [1]. These data are freely available in a Mendeley Data archive "A large dataset of synthetic SEM images of powder materials and their ground truth 3D structures" (B.L. DeCost, E.A. Holm, 2016) located at http://dx.doi.org/10.17632/tj4syyj9mr.1[2] for any academic, educational, or research purposes.
Systematic Poisoning Attacks on and Defenses for Machine Learning in Healthcare.
Mozaffari-Kermani, Mehran; Sur-Kolay, Susmita; Raghunathan, Anand; Jha, Niraj K
2015-11-01
Machine learning is being used in a wide range of application domains to discover patterns in large datasets. Increasingly, the results of machine learning drive critical decisions in applications related to healthcare and biomedicine. Such health-related applications are often sensitive, and thus, any security breach would be catastrophic. Naturally, the integrity of the results computed by machine learning is of great importance. Recent research has shown that some machine-learning algorithms can be compromised by augmenting their training datasets with malicious data, leading to a new class of attacks called poisoning attacks. Hindrance of a diagnosis may have life-threatening consequences and could cause distrust. On the other hand, not only may a false diagnosis prompt users to distrust the machine-learning algorithm and even abandon the entire system but also such a false positive classification may cause patient distress. In this paper, we present a systematic, algorithm-independent approach for mounting poisoning attacks across a wide range of machine-learning algorithms and healthcare datasets. The proposed attack procedure generates input data, which, when added to the training set, can either cause the results of machine learning to have targeted errors (e.g., increase the likelihood of classification into a specific class), or simply introduce arbitrary errors (incorrect classification). These attacks may be applied to both fixed and evolving datasets. They can be applied even when only statistics of the training dataset are available or, in some cases, even without access to the training dataset, although at a lower efficacy. We establish the effectiveness of the proposed attacks using a suite of six machine-learning algorithms and five healthcare datasets. Finally, we present countermeasures against the proposed generic attacks that are based on tracking and detecting deviations in various accuracy metrics, and benchmark their effectiveness.
NASA Astrophysics Data System (ADS)
Hiebl, Johann; Frei, Christoph
2018-04-01
Spatial precipitation datasets that are long-term consistent, highly resolved and extend over several decades are an increasingly popular basis for modelling and monitoring environmental processes and planning tasks in hydrology, agriculture, energy resources management, etc. Here, we present a grid dataset of daily precipitation for Austria meant to promote such applications. It has a grid spacing of 1 km, extends back till 1961 and is continuously updated. It is constructed with the classical two-tier analysis, involving separate interpolations for mean monthly precipitation and daily relative anomalies. The former was accomplished by kriging with topographic predictors as external drift utilising 1249 stations. The latter is based on angular distance weighting and uses 523 stations. The input station network was kept largely stationary over time to avoid artefacts on long-term consistency. Example cases suggest that the new analysis is at least as plausible as previously existing datasets. Cross-validation and comparison against experimental high-resolution observations (WegenerNet) suggest that the accuracy of the dataset depends on interpretation. Users interpreting grid point values as point estimates must expect systematic overestimates for light and underestimates for heavy precipitation as well as substantial random errors. Grid point estimates are typically within a factor of 1.5 from in situ observations. Interpreting grid point values as area mean values, conditional biases are reduced and the magnitude of random errors is considerably smaller. Together with a similar dataset of temperature, the new dataset (SPARTACUS) is an interesting basis for modelling environmental processes, studying climate change impacts and monitoring the climate of Austria.
Lucas, Rico; Groeneveld, Jürgen; Harms, Hauke; Johst, Karin; Frank, Karin; Kleinsteuber, Sabine
2017-01-01
In times of global change and intensified resource exploitation, advanced knowledge of ecophysiological processes in natural and engineered systems driven by complex microbial communities is crucial for both safeguarding environmental processes and optimising rational control of biotechnological processes. To gain such knowledge, high-throughput molecular techniques are routinely employed to investigate microbial community composition and dynamics within a wide range of natural or engineered environments. However, for molecular dataset analyses no consensus about a generally applicable alpha diversity concept and no appropriate benchmarking of corresponding statistical indices exist yet. To overcome this, we listed criteria for the appropriateness of an index for such analyses and systematically scrutinised commonly employed ecological indices describing diversity, evenness and richness based on artificial and real molecular datasets. We identified appropriate indices warranting interstudy comparability and intuitive interpretability. The unified diversity concept based on 'effective numbers of types' provides the mathematical framework for describing community composition. Additionally, the Bray-Curtis dissimilarity as a beta-diversity index was found to reflect compositional changes. The employed statistical procedure is presented comprising commented R-scripts and example datasets for user-friendly trial application. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Peltz-Lewis, L. A.; Blake-Coleman, W.; Johnston, J.; DeLoatch, I. B.
2014-12-01
The Federal Geographic Data Committee (FGDC) is designing a portfolio management process for 193 geospatial datasets contained within the 16 topical National Spatial Data Infrastructure themes managed under OMB Circular A-16 "Coordination of Geographic Information and Related Spatial Data Activities." The 193 datasets are designated as National Geospatial Data Assets (NGDA) because of their significance in implementing to the missions of multiple levels of government, partners and stakeholders. As a starting point, the data managers of these NGDAs will conduct a baseline maturity assessment of the dataset(s) for which they are responsible. The maturity is measured against benchmarks related to each of the seven stages of the data lifecycle management framework promulgated within the OMB Circular A-16 Supplemental Guidance issued by OMB in November 2010. This framework was developed by the interagency Lifecycle Management Work Group (LMWG), consisting of 16 Federal agencies, under the 2004 Presidential Initiative the Geospatial Line of Business,using OMB Circular A-130" Management of Federal Information Resources" as guidance The seven lifecycle stages are: Define, Inventory/Evaluate, Obtain, Access, Maintain, Use/Evaluate, and Archive. This paper will focus on the Lifecycle Baseline Maturity Assessment, and efforts to integration the FGDC approach with other data maturity assessments.
Dashtban, M; Balafar, Mohammadali
2017-03-01
Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset. Copyright © 2017 Elsevier Inc. All rights reserved.
Benchmark Simulation Model No 2: finalisation of plant layout and default control strategy.
Nopens, I; Benedetti, L; Jeppsson, U; Pons, M-N; Alex, J; Copp, J B; Gernaey, K V; Rosen, C; Steyer, J-P; Vanrolleghem, P A
2010-01-01
The COST/IWA Benchmark Simulation Model No 1 (BSM1) has been available for almost a decade. Its primary purpose has been to create a platform for control strategy benchmarking of activated sludge processes. The fact that the research work related to the benchmark simulation models has resulted in more than 300 publications worldwide demonstrates the interest in and need of such tools within the research community. Recent efforts within the IWA Task Group on "Benchmarking of control strategies for WWTPs" have focused on an extension of the benchmark simulation model. This extension aims at facilitating control strategy development and performance evaluation at a plant-wide level and, consequently, includes both pretreatment of wastewater as well as the processes describing sludge treatment. The motivation for the extension is the increasing interest and need to operate and control wastewater treatment systems not only at an individual process level but also on a plant-wide basis. To facilitate the changes, the evaluation period has been extended to one year. A prolonged evaluation period allows for long-term control strategies to be assessed and enables the use of control handles that cannot be evaluated in a realistic fashion in the one week BSM1 evaluation period. In this paper, the finalised plant layout is summarised and, as was done for BSM1, a default control strategy is proposed. A demonstration of how BSM2 can be used to evaluate control strategies is also given.
Vreck, D; Gernaey, K V; Rosen, C; Jeppsson, U
2006-01-01
In this paper, implementation of the Benchmark Simulation Model No 2 (BSM2) within Matlab-Simulink is presented. The BSM2 is developed for plant-wide WWTP control strategy evaluation on a long-term basis. It consists of a pre-treatment process, an activated sludge process and sludge treatment processes. Extended evaluation criteria are proposed for plant-wide control strategy assessment. Default open-loop and closed-loop strategies are also proposed to be used as references with which to compare other control strategies. Simulations indicate that the BM2 is an appropriate tool for plant-wide control strategy evaluation.
Speeding up the Consensus Clustering methodology for microarray data analysis
2011-01-01
Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up. Results Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations. Conclusions In summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures. PMID:21235792
A shortest-path graph kernel for estimating gene product semantic similarity.
Alvarez, Marco A; Qi, Xiaojun; Yan, Changhui
2011-07-29
Existing methods for calculating semantic similarity between gene products using the Gene Ontology (GO) often rely on external resources, which are not part of the ontology. Consequently, changes in these external resources like biased term distribution caused by shifting of hot research topics, will affect the calculation of semantic similarity. One way to avoid this problem is to use semantic methods that are "intrinsic" to the ontology, i.e. independent of external knowledge. We present a shortest-path graph kernel (spgk) method that relies exclusively on the GO and its structure. In spgk, a gene product is represented by an induced subgraph of the GO, which consists of all the GO terms annotating it. Then a shortest-path graph kernel is used to compute the similarity between two graphs. In a comprehensive evaluation using a benchmark dataset, spgk compares favorably with other methods that depend on external resources. Compared with simUI, a method that is also intrinsic to GO, spgk achieves slightly better results on the benchmark dataset. Statistical tests show that the improvement is significant when the resolution and EC similarity correlation coefficient are used to measure the performance, but is insignificant when the Pfam similarity correlation coefficient is used. Spgk uses a graph kernel method in polynomial time to exploit the structure of the GO to calculate semantic similarity between gene products. It provides an alternative to both methods that use external resources and "intrinsic" methods with comparable performance.
Evaluating virtual hosted desktops for graphics-intensive astronomy
NASA Astrophysics Data System (ADS)
Meade, B. F.; Fluke, C. J.
2018-04-01
Visualisation of data is critical to understanding astronomical phenomena. Today, many instruments produce datasets that are too big to be downloaded to a local computer, yet many of the visualisation tools used by astronomers are deployed only on desktop computers. Cloud computing is increasingly used to provide a computation and simulation platform in astronomy, but it also offers great potential as a visualisation platform. Virtual hosted desktops, with graphics processing unit (GPU) acceleration, allow interactive, graphics-intensive desktop applications to operate co-located with astronomy datasets stored in remote data centres. By combining benchmarking and user experience testing, with a cohort of 20 astronomers, we investigate the viability of replacing physical desktop computers with virtual hosted desktops. In our work, we compare two Apple MacBook computers (one old and one new, representing hardware and opposite ends of the useful lifetime) with two virtual hosted desktops: one commercial (Amazon Web Services) and one in a private research cloud (the Australian NeCTAR Research Cloud). For two-dimensional image-based tasks and graphics-intensive three-dimensional operations - typical of astronomy visualisation workflows - we found that benchmarks do not necessarily provide the best indication of performance. When compared to typical laptop computers, virtual hosted desktops can provide a better user experience, even with lower performing graphics cards. We also found that virtual hosted desktops are equally simple to use, provide greater flexibility in choice of configuration, and may actually be a more cost-effective option for typical usage profiles.
Wang, Huilin; Wang, Mingjun; Tan, Hao; Li, Yuan; Zhang, Ziding; Song, Jiangning
2014-01-01
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H
2015-01-01
Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595
DeepPap: Deep Convolutional Networks for Cervical Cell Classification.
Zhang, Ling; Le Lu; Nogues, Isabella; Summers, Ronald M; Liu, Shaoxiong; Yao, Jianhua
2017-11-01
Automation-assisted cervical screening via Pap smear or liquid-based cytology (LBC) is a highly effective cell imaging based cancer detection tool, where cells are partitioned into "abnormal" and "normal" categories. However, the success of most traditional classification methods relies on the presence of accurate cell segmentations. Despite sixty years of research in this field, accurate segmentation remains a challenge in the presence of cell clusters and pathologies. Moreover, previous classification methods are only built upon the extraction of hand-crafted features, such as morphology and texture. This paper addresses these limitations by proposing a method to directly classify cervical cells-without prior segmentation-based on deep features, using convolutional neural networks (ConvNets). First, the ConvNet is pretrained on a natural image dataset. It is subsequently fine-tuned on a cervical cell dataset consisting of adaptively resampled image patches coarsely centered on the nuclei. In the testing phase, aggregation is used to average the prediction scores of a similar set of image patches. The proposed method is evaluated on both Pap smear and LBC datasets. Results show that our method outperforms previous algorithms in classification accuracy (98.3%), area under the curve (0.99) values, and especially specificity (98.3%), when applied to the Herlev benchmark Pap smear dataset and evaluated using five-fold cross validation. Similar superior performances are also achieved on the HEMLBC (H&E stained manual LBC) dataset. Our method is promising for the development of automation-assisted reading systems in primary cervical screening.
An Active Patch Model for Real World Texture and Appearance Classification
Mao, Junhua; Zhu, Jun; Yuille, Alan L.
2014-01-01
This paper addresses the task of natural texture and appearance classification. Our goal is to develop a simple and intuitive method that performs at state of the art on datasets ranging from homogeneous texture (e.g., material texture), to less homogeneous texture (e.g., the fur of animals), and to inhomogeneous texture (the appearance patterns of vehicles). Our method uses a bag-of-words model where the features are based on a dictionary of active patches. Active patches are raw intensity patches which can undergo spatial transformations (e.g., rotation and scaling) and adjust themselves to best match the image regions. The dictionary of active patches is required to be compact and representative, in the sense that we can use it to approximately reconstruct the images that we want to classify. We propose a probabilistic model to quantify the quality of image reconstruction and design a greedy learning algorithm to obtain the dictionary. We classify images using the occurrence frequency of the active patches. Feature extraction is fast (about 100 ms per image) using the GPU. The experimental results show that our method improves the state of the art on a challenging material texture benchmark dataset (KTH-TIPS2). To test our method on less homogeneous or inhomogeneous images, we construct two new datasets consisting of appearance image patches of animals and vehicles cropped from the PASCAL VOC dataset. Our method outperforms competing methods on these datasets. PMID:25531013
Zheng, Guangyong; Xu, Yaochen; Zhang, Xiujun; Liu, Zhi-Ping; Wang, Zhuo; Chen, Luonan; Zhu, Xin-Guang
2016-12-23
A gene regulatory network (GRN) represents interactions of genes inside a cell or tissue, in which vertexes and edges stand for genes and their regulatory interactions respectively. Reconstruction of gene regulatory networks, in particular, genome-scale networks, is essential for comparative exploration of different species and mechanistic investigation of biological processes. Currently, most of network inference methods are computationally intensive, which are usually effective for small-scale tasks (e.g., networks with a few hundred genes), but are difficult to construct GRNs at genome-scale. Here, we present a software package for gene regulatory network reconstruction at a genomic level, in which gene interaction is measured by the conditional mutual information measurement using a parallel computing framework (so the package is named CMIP). The package is a greatly improved implementation of our previous PCA-CMI algorithm. In CMIP, we provide not only an automatic threshold determination method but also an effective parallel computing framework for network inference. Performance tests on benchmark datasets show that the accuracy of CMIP is comparable to most current network inference methods. Moreover, running tests on synthetic datasets demonstrate that CMIP can handle large datasets especially genome-wide datasets within an acceptable time period. In addition, successful application on a real genomic dataset confirms its practical applicability of the package. This new software package provides a powerful tool for genomic network reconstruction to biological community. The software can be accessed at http://www.picb.ac.cn/CMIP/ .
Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi
2014-01-01
Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.
The Vulnerability Framework Integrates Various Models of Generating Surplus Revenue
ERIC Educational Resources Information Center
Maniaci, Vincent
2004-01-01
Budgets operationalize the strategic planning process, and institutions must have surplus revenue to be able to cope with future operations. There are three approaches to generate surplus revenue: increased revenue, decreased cost, and reallocation of resources. Extending their earlier work, where they established strategic benchmarks for annual…
Modeling conservation practices in APEX: From the field to the watershed
USDA-ARS?s Scientific Manuscript database
The evaluation of USDA conservation programs is required as part of the Conservation Effects Assessment Project (CEAP). The Agricultural Policy/Environmental eXtender (APEX) model was applied to the St. Joseph River Watershed, one of CEAP’s benchmark watersheds. Using a previously calibrated and val...
2017-01-01
The authors use four criteria to examine a novel community detection algorithm: (a) effectiveness in terms of producing high values of normalized mutual information (NMI) and modularity, using well-known social networks for testing; (b) examination, meaning the ability to examine mitigating resolution limit problems using NMI values and synthetic networks; (c) correctness, meaning the ability to identify useful community structure results in terms of NMI values and Lancichinetti-Fortunato-Radicchi (LFR) benchmark networks; and (d) scalability, or the ability to produce comparable modularity values with fast execution times when working with large-scale real-world networks. In addition to describing a simple hierarchical arc-merging (HAM) algorithm that uses network topology information, we introduce rule-based arc-merging strategies for identifying community structures. Five well-studied social network datasets and eight sets of LFR benchmark networks were employed to validate the correctness of a ground-truth community, eight large-scale real-world complex networks were used to measure its efficiency, and two synthetic networks were used to determine its susceptibility to two resolution limit problems. Our experimental results indicate that the proposed HAM algorithm exhibited satisfactory performance efficiency, and that HAM-identified and ground-truth communities were comparable in terms of social and LFR benchmark networks, while mitigating resolution limit problems. PMID:29121100
Mu, John C.; Tootoonchi Afshar, Pegah; Mohiyuddin, Marghoob; Chen, Xi; Li, Jian; Bani Asadi, Narges; Gerstein, Mark B.; Wong, Wing H.; Lam, Hugo Y. K.
2015-01-01
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools. PMID:26412485
Brandenburg, Marcus; Hahn, Gerd J
2018-06-01
Process industries typically involve complex manufacturing operations and thus require adequate decision support for aggregate production planning (APP). The need for powerful and efficient approaches to solve complex APP problems persists. Problem-specific solution approaches are advantageous compared to standardized approaches that are designed to provide basic decision support for a broad range of planning problems but inadequate to optimize under consideration of specific settings. This in turn calls for methods to compare different approaches regarding their computational performance and solution quality. In this paper, we present a benchmarking problem for APP in the chemical process industry. The presented problem focuses on (i) sustainable operations planning involving multiple alternative production modes/routings with specific production-related carbon emission and the social dimension of varying operating rates and (ii) integrated campaign planning with production mix/volume on the operational level. The mutual trade-offs between economic, environmental and social factors can be considered as externalized factors (production-related carbon emission and overtime working hours) as well as internalized ones (resulting costs). We provide data for all problem parameters in addition to a detailed verbal problem statement. We refer to Hahn and Brandenburg [1] for a first numerical analysis based on and for future research perspectives arising from this benchmarking problem.
Detecting text in natural scenes with multi-level MSER and SWT
NASA Astrophysics Data System (ADS)
Lu, Tongwei; Liu, Renjun
2018-04-01
The detection of the characters in the natural scene is susceptible to factors such as complex background, variable viewing angle and diverse forms of language, which leads to poor detection results. Aiming at these problems, a new text detection method was proposed, which consisted of two main stages, candidate region extraction and text region detection. At first stage, the method used multiple scale transformations of original image and multiple thresholds of maximally stable extremal regions (MSER) to detect the text regions which could detect character regions comprehensively. At second stage, obtained SWT maps by using the stroke width transform (SWT) algorithm to compute the candidate regions, then using cascaded classifiers to propose non-text regions. The proposed method was evaluated on the standard benchmark datasets of ICDAR2011 and the datasets that we made our own data sets. The experiment results showed that the proposed method have greatly improved that compared to other text detection methods.
A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification.
Kang, Qi; Chen, XiaoShuang; Li, SiSi; Zhou, MengChu
2017-12-01
Under-sampling is a popular data preprocessing method in dealing with class imbalance problems, with the purposes of balancing datasets to achieve a high classification rate and avoiding the bias toward majority class examples. It always uses full minority data in a training dataset. However, some noisy minority examples may reduce the performance of classifiers. In this paper, a new under-sampling scheme is proposed by incorporating a noise filter before executing resampling. In order to verify the efficiency, this scheme is implemented based on four popular under-sampling methods, i.e., Undersampling + Adaboost, RUSBoost, UnderBagging, and EasyEnsemble through benchmarks and significance analysis. Furthermore, this paper also summarizes the relationship between algorithm performance and imbalanced ratio. Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, -measure, and -mean.
Ravindran, Sindhu; Jambek, Asral Bahari; Muthusamy, Hariharan; Neoh, Siew-Chin
2015-01-01
A novel clinical decision support system is proposed in this paper for evaluating the fetal well-being from the cardiotocogram (CTG) dataset through an Improved Adaptive Genetic Algorithm (IAGA) and Extreme Learning Machine (ELM). IAGA employs a new scaling technique (called sigma scaling) to avoid premature convergence and applies adaptive crossover and mutation techniques with masking concepts to enhance population diversity. Also, this search algorithm utilizes three different fitness functions (two single objective fitness functions and multi-objective fitness function) to assess its performance. The classification results unfold that promising classification accuracy of 94% is obtained with an optimal feature subset using IAGA. Also, the classification results are compared with those of other Feature Reduction techniques to substantiate its exhaustive search towards the global optimum. Besides, five other benchmark datasets are used to gauge the strength of the proposed IAGA algorithm.
NegBio: a high-performance tool for negation and uncertainty detection in radiology reports.
Peng, Yifan; Wang, Xiaosong; Lu, Le; Bagheri, Mohammadhadi; Summers, Ronald; Lu, Zhiyong
2018-01-01
Negative and uncertain medical findings are frequent in radiology reports, but discriminating them from positive findings remains challenging for information extraction. Here, we propose a new algorithm, NegBio, to detect negative and uncertain findings in radiology reports. Unlike previous rule-based methods, NegBio utilizes patterns on universal dependencies to identify the scope of triggers that are indicative of negation or uncertainty. We evaluated NegBio on four datasets, including two public benchmarking corpora of radiology reports, a new radiology corpus that we annotated for this work, and a public corpus of general clinical texts. Evaluation on these datasets demonstrates that NegBio is highly accurate for detecting negative and uncertain findings and compares favorably to a widely-used state-of-the-art system NegEx (an average of 9.5% improvement in precision and 5.1% in F1-score). https://github.com/ncbi-nlp/NegBio.
RBind: computational network method to predict RNA binding sites.
Wang, Kaili; Jian, Yiren; Wang, Huiwen; Zeng, Chen; Zhao, Yunjie
2018-04-26
Non-coding RNA molecules play essential roles by interacting with other molecules to perform various biological functions. However, it is difficult to determine RNA structures due to their flexibility. At present, the number of experimentally solved RNA-ligand and RNA-protein structures is still insufficient. Therefore, binding sites prediction of non-coding RNA is required to understand their functions. Current RNA binding site prediction algorithms produce many false positive nucleotides that are distance away from the binding sites. Here, we present a network approach, RBind, to predict the RNA binding sites. We benchmarked RBind in RNA-ligand and RNA-protein datasets. The average accuracy of 0.82 in RNA-ligand and 0.63 in RNA-protein testing showed that this network strategy has a reliable accuracy for binding sites prediction. The codes and datasets are available at https://zhaolab.com.cn/RBind. yjzhaowh@mail.ccnu.edu.cn. Supplementary data are available at Bioinformatics online.
On macromolecular refinement at subatomic resolution withinteratomic scatterers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Afonine, Pavel V.; Grosse-Kunstleve, Ralf W.; Adams, Paul D.
2007-11-09
A study of the accurate electron density distribution in molecular crystals at subatomic resolution, better than {approx} 1.0 {angstrom}, requires more detailed models than those based on independent spherical atoms. A tool conventionally used in small-molecule crystallography is the multipolar model. Even at upper resolution limits of 0.8-1.0 {angstrom}, the number of experimental data is insufficient for the full multipolar model refinement. As an alternative, a simpler model composed of conventional independent spherical atoms augmented by additional scatterers to model bonding effects has been proposed. Refinement of these mixed models for several benchmark datasets gave results comparable in quality withmore » results of multipolar refinement and superior of those for conventional models. Applications to several datasets of both small- and macro-molecules are shown. These refinements were performed using the general-purpose macromolecular refinement module phenix.refine of the PHENIX package.« less
MetaQUAST: evaluation of metagenome assemblies.
Mikheenko, Alla; Saveliev, Vladislav; Gurevich, Alexey
2016-04-01
During the past years we have witnessed the rapid development of new metagenome assembly methods. Although there are many benchmark utilities designed for single-genome assemblies, there is no well-recognized evaluation and comparison tool for metagenomic-specific analogues. In this article, we present MetaQUAST, a modification of QUAST, the state-of-the-art tool for genome assembly evaluation based on alignment of contigs to a reference. MetaQUAST addresses such metagenome datasets features as (i) unknown species content by detecting and downloading reference sequences, (ii) huge diversity by giving comprehensive reports for multiple genomes and (iii) presence of highly relative species by detecting chimeric contigs. We demonstrate MetaQUAST performance by comparing several leading assemblers on one simulated and two real datasets. http://bioinf.spbau.ru/metaquast aleksey.gurevich@spbu.ru Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Buttenfield, B.P.; Stanislawski, L.V.; Brewer, C.A.
2011-01-01
This paper reports on generalization and data modeling to create reduced scale versions of the National Hydrographic Dataset (NHD) for dissemination through The National Map, the primary data delivery portal for USGS. Our approach distinguishes local differences in physiographic factors, to demonstrate that knowledge about varying terrain (mountainous, hilly or flat) and varying climate (dry or humid) can support decisions about algorithms, parameters, and processing sequences to create generalized, smaller scale data versions which preserve distinct hydrographic patterns in these regions. We work with multiple subbasins of the NHD that provide a range of terrain and climate characteristics. Specifically tailored generalization sequences are used to create simplified versions of the high resolution data, which was compiled for 1:24,000 scale mapping. Results are evaluated cartographically and metrically against a medium resolution benchmark version compiled for 1:100,000, developing coefficients of linear and areal correspondence.
Shot boundary detection and label propagation for spatio-temporal video segmentation
NASA Astrophysics Data System (ADS)
Piramanayagam, Sankaranaryanan; Saber, Eli; Cahill, Nathan D.; Messinger, David
2015-02-01
This paper proposes a two stage algorithm for streaming video segmentation. In the first stage, shot boundaries are detected within a window of frames by comparing dissimilarity between 2-D segmentations of each frame. In the second stage, the 2-D segments are propagated across the window of frames in both spatial and temporal direction. The window is moved across the video to find all shot transitions and obtain spatio-temporal segments simultaneously. As opposed to techniques that operate on entire video, the proposed approach consumes significantly less memory and enables segmentation of lengthy videos. We tested our segmentation based shot detection method on the TRECVID 2007 video dataset and compared it with block-based technique. Cut detection results on the TRECVID 2007 dataset indicate that our algorithm has comparable results to the best of the block-based methods. The streaming video segmentation routine also achieves promising results on a challenging video segmentation benchmark database.
Wan, Shibiao; Mak, Man-Wai; Kung, Sun-Yuan
2016-12-02
In the postgenomic era, the number of unreviewed protein sequences is remarkably larger and grows tremendously faster than that of reviewed ones. However, existing methods for protein subchloroplast localization often ignore the information from these unlabeled proteins. This paper proposes a multi-label predictor based on ensemble linear neighborhood propagation (LNP), namely, LNP-Chlo, which leverages hybrid sequence-based feature information from both labeled and unlabeled proteins for predicting localization of both single- and multi-label chloroplast proteins. Experimental results on a stringent benchmark dataset and a novel independent dataset suggest that LNP-Chlo performs at least 6% (absolute) better than state-of-the-art predictors. This paper also demonstrates that ensemble LNP significantly outperforms LNP based on individual features. For readers' convenience, the online Web server LNP-Chlo is freely available at http://bioinfo.eie.polyu.edu.hk/LNPChloServer/ .
Keuleers, Emmanuel; Balota, David A
2015-01-01
This paper introduces and summarizes the special issue on megastudies, crowdsourcing, and large datasets in psycholinguistics. We provide a brief historical overview and show how the papers in this issue have extended the field by compiling new databases and making important theoretical contributions. In addition, we discuss several studies that use text corpora to build distributional semantic models to tackle various interesting problems in psycholinguistics. Finally, as is the case across the papers, we highlight some methodological issues that are brought forth via the analyses of such datasets.
Improving average ranking precision in user searches for biomedical research datasets
Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick
2017-01-01
Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data PMID:29220475
Evaluation of CHO Benchmarks on the Arria 10 FPGA using Intel FPGA SDK for OpenCL
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Zheming; Yoshii, Kazutomo; Finkel, Hal
The OpenCL standard is an open programming model for accelerating algorithms on heterogeneous computing system. OpenCL extends the C-based programming language for developing portable codes on different platforms such as CPU, Graphics processing units (GPUs), Digital Signal Processors (DSPs) and Field Programmable Gate Arrays (FPGAs). The Intel FPGA SDK for OpenCL is a suite of tools that allows developers to abstract away the complex FPGA-based development flow for a high-level software development flow. Users can focus on the design of hardware-accelerated kernel functions in OpenCL and then direct the tools to generate the low-level FPGA implementations. The approach makes themore » FPGA-based development more accessible to software users as the needs for hybrid computing using CPUs and FPGAs are increasing. It can also significantly reduce the hardware development time as users can evaluate different ideas with high-level language without deep FPGA domain knowledge. Benchmarking of OpenCL-based framework is an effective way for analyzing the performance of system by studying the execution of the benchmark applications. CHO is a suite of benchmark applications that provides support for OpenCL [1]. The authors presented CHO as an OpenCL port of the CHStone benchmark. Using Altera OpenCL (AOCL) compiler to synthesize the benchmark applications, they listed the resource usage and performance of each kernel that can be successfully synthesized by the compiler. In this report, we evaluate the resource usage and performance of the CHO benchmark applications using the Intel FPGA SDK for OpenCL and Nallatech 385A FPGA board that features an Arria 10 FPGA device. The focus of the report is to have a better understanding of the resource usage and performance of the kernel implementations using Arria-10 FPGA devices compared to Stratix-5 FPGA devices. In addition, we also gain knowledge about the limitations of the current compiler when it fails to synthesize a benchmark application.« less
Skipping Strategy (SS) for Initial Population of Job-Shop Scheduling Problem
NASA Astrophysics Data System (ADS)
Abdolrazzagh-Nezhad, M.; Nababan, E. B.; Sarim, H. M.
2018-03-01
Initial population in job-shop scheduling problem (JSSP) is an essential step to obtain near optimal solution. Techniques used to solve JSSP are computationally demanding. Skipping strategy (SS) is employed to acquire initial population after sequence of job on machine and sequence of operations (expressed in Plates-jobs and mPlates-jobs) are determined. The proposed technique is applied to benchmark datasets and the results are compared to that of other initialization techniques. It is shown that the initial population obtained from the SS approach could generate optimal solution.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reiche, Helmut Matthias; Vogel, Sven C.
New in situ data for the U-C system are presented, with the goal of improving knowledge of the phase diagram to enable production of new ceramic fuels. The none quenchable, cubic, δ-phase, which in turn is fundamental to computational methods, was identified. Rich datasets of the formation synthesis of uranium carbide yield kinetics data which allow the benchmarking of modeling, thermodynamic parameters etc. The order-disorder transition (carbon sublattice melting) was observed due to equal sensitivity of neutrons to both elements. This dynamic has not been accurately described in some recent simulation-based publications.
NASA Astrophysics Data System (ADS)
Umbarkar, A. J.; Balande, U. T.; Seth, P. D.
2017-06-01
The field of nature inspired computing and optimization techniques have evolved to solve difficult optimization problems in diverse fields of engineering, science and technology. The firefly attraction process is mimicked in the algorithm for solving optimization problems. In Firefly Algorithm (FA) sorting of fireflies is done by using sorting algorithm. The original FA is proposed with bubble sort for ranking the fireflies. In this paper, the quick sort replaces bubble sort to decrease the time complexity of FA. The dataset used is unconstrained benchmark functions from CEC 2005 [22]. The comparison of FA using bubble sort and FA using quick sort is performed with respect to best, worst, mean, standard deviation, number of comparisons and execution time. The experimental result shows that FA using quick sort requires less number of comparisons but requires more execution time. The increased number of fireflies helps to converge into optimal solution whereas by varying dimension for algorithm performed better at a lower dimension than higher dimension.
Benchmarking Data for the Proposed Signature of Used Fuel Casks
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rauch, Eric Benton
2016-09-23
A set of benchmarking measurements to test facets of the proposed extended storage signature was conducted on May 17, 2016. The measurements were designed to test the overall concept of how the proposed signature can be used to identify a used fuel cask based only on the distribution of neutron sources within the cask. To simulate the distribution, 4 Cf-252 sources were chosen and arranged on a 3x3 grid in 3 different patterns and raw neutron totals counts were taken at 6 locations around the grid. This is a very simplified test of the typical geometry studied previously in simulationmore » with simulated used nuclear fuel.« less
NetCTLpan: pan-specific MHC class I pathway epitope predictions
Larsen, Mette Voldby; Lundegaard, Claus; Nielsen, Morten
2010-01-01
Reliable predictions of immunogenic peptides are essential in rational vaccine design and can minimize the experimental effort needed to identify epitopes. In this work, we describe a pan-specific major histocompatibility complex (MHC) class I epitope predictor, NetCTLpan. The method integrates predictions of proteasomal cleavage, transporter associated with antigen processing (TAP) transport efficiency, and MHC class I binding affinity into a MHC class I pathway likelihood score and is an improved and extended version of NetCTL. The NetCTLpan method performs predictions for all MHC class I molecules with known protein sequence and allows predictions for 8-, 9-, 10-, and 11-mer peptides. In order to meet the need for a low false positive rate, the method is optimized to achieve high specificity. The method was trained and validated on large datasets of experimentally identified MHC class I ligands and cytotoxic T lymphocyte (CTL) epitopes. It has been reported that MHC molecules are differentially dependent on TAP transport and proteasomal cleavage. Here, we did not find any consistent signs of such MHC dependencies, and the NetCTLpan method is implemented with fixed weights for proteasomal cleavage and TAP transport for all MHC molecules. The predictive performance of the NetCTLpan method was shown to outperform other state-of-the-art CTL epitope prediction methods. Our results further confirm the importance of using full-type human leukocyte antigen restriction information when identifying MHC class I epitopes. Using the NetCTLpan method, the experimental effort to identify 90% of new epitopes can be reduced by 15% and 40%, respectively, when compared to the NetMHCpan and NetCTL methods. The method and benchmark datasets are available at http://www.cbs.dtu.dk/services/NetCTLpan/. Electronic supplementary material The online version of this article (doi:10.1007/s00251-010-0441-4) contains supplementary material, which is available to authorized users. PMID:20379710
DataMed - an open source discovery index for finding biomedical datasets.
Chen, Xiaoling; Gururaj, Anupama E; Ozyurt, Burak; Liu, Ruiling; Soysal, Ergin; Cohen, Trevor; Tiryaki, Firat; Li, Yueling; Zong, Nansu; Jiang, Min; Rogith, Deevakar; Salimi, Mandana; Kim, Hyeon-Eui; Rocca-Serra, Philippe; Gonzalez-Beltran, Alejandra; Farcas, Claudiu; Johnson, Todd; Margolis, Ron; Alter, George; Sansone, Susanna-Assunta; Fore, Ian M; Ohno-Machado, Lucila; Grethe, Jeffrey S; Xu, Hua
2018-01-13
Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health-funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community. © The Author 2018. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction.
Han, Youngmahn; Kim, Dongsup
2017-12-28
Computational scanning of peptide candidates that bind to a specific major histocompatibility complex (MHC) can speed up the peptide-based vaccine development process and therefore various methods are being actively developed. Recently, machine-learning-based methods have generated successful results by training large amounts of experimental data. However, many machine learning-based methods are generally less sensitive in recognizing locally-clustered interactions, which can synergistically stabilize peptide binding. Deep convolutional neural network (DCNN) is a deep learning method inspired by visual recognition process of animal brain and it is known to be able to capture meaningful local patterns from 2D images. Once the peptide-MHC interactions can be encoded into image-like array(ILA) data, DCNN can be employed to build a predictive model for peptide-MHC binding prediction. In this study, we demonstrated that DCNN is able to not only reliably predict peptide-MHC binding, but also sensitively detect locally-clustered interactions. Nonapeptide-HLA-A and -B binding data were encoded into ILA data. A DCNN, as a pan-specific prediction model, was trained on the ILA data. The DCNN showed higher performance than other prediction tools for the latest benchmark datasets, which consist of 43 datasets for 15 HLA-A alleles and 25 datasets for 10 HLA-B alleles. In particular, the DCNN outperformed other tools for alleles belonging to the HLA-A3 supertype. The F1 scores of the DCNN were 0.86, 0.94, and 0.67 for HLA-A*31:01, HLA-A*03:01, and HLA-A*68:01 alleles, respectively, which were significantly higher than those of other tools. We found that the DCNN was able to recognize locally-clustered interactions that could synergistically stabilize peptide binding. We developed ConvMHC, a web server to provide user-friendly web interfaces for peptide-MHC class I binding predictions using the DCNN. ConvMHC web server can be accessible via http://jumong.kaist.ac.kr:8080/convmhc . We developed a novel method for peptide-HLA-I binding predictions using DCNN trained on ILA data that encode peptide binding data and demonstrated the reliable performance of the DCNN in nonapeptide binding predictions through the independent evaluation on the latest IEDB benchmark datasets. Our approaches can be applied to characterize locally-clustered patterns in molecular interactions, such as protein/DNA, protein/RNA, and drug/protein interactions.
Geraghty, John P; Grogan, Garry; Ebert, Martin A
2013-04-30
This study investigates the variation in segmentation of several pelvic anatomical structures on computed tomography (CT) between multiple observers and a commercial automatic segmentation method, in the context of quality assurance and evaluation during a multicentre clinical trial. CT scans of two prostate cancer patients ('benchmarking cases'), one high risk (HR) and one intermediate risk (IR), were sent to multiple radiotherapy centres for segmentation of prostate, rectum and bladder structures according to the TROG 03.04 "RADAR" trial protocol definitions. The same structures were automatically segmented using iPlan software for the same two patients, allowing structures defined by automatic segmentation to be quantitatively compared with those defined by multiple observers. A sample of twenty trial patient datasets were also used to automatically generate anatomical structures for quantitative comparison with structures defined by individual observers for the same datasets. There was considerable agreement amongst all observers and automatic segmentation of the benchmarking cases for bladder (mean spatial variations < 0.4 cm across the majority of image slices). Although there was some variation in interpretation of the superior-inferior (cranio-caudal) extent of rectum, human-observer contours were typically within a mean 0.6 cm of automatically-defined contours. Prostate structures were more consistent for the HR case than the IR case with all human observers segmenting a prostate with considerably more volume (mean +113.3%) than that automatically segmented. Similar results were seen across the twenty sample datasets, with disagreement between iPlan and observers dominant at the prostatic apex and superior part of the rectum, which is consistent with observations made during quality assurance reviews during the trial. This study has demonstrated quantitative analysis for comparison of multi-observer segmentation studies. For automatic segmentation algorithms based on image-registration as in iPlan, it is apparent that agreement between observer and automatic segmentation will be a function of patient-specific image characteristics, particularly for anatomy with poor contrast definition. For this reason, it is suggested that automatic registration based on transformation of a single reference dataset adds a significant systematic bias to the resulting volumes and their use in the context of a multicentre trial should be carefully considered.
ERIC Educational Resources Information Center
Orrill, Chandra Hawley; Brown, Rachael Eriksen; Burke, James P.; Millett, John; Nagar, Gili Gal; Park, Jinsook; Weiland, Travis
2017-01-01
In this study we extend our prior exploration focused on the extent to which middle school teachers appropriately identified proportional situations and whether there were relationships between attributes of the teachers and their ability to identify proportional situations. For this study, we analyzed both a larger dataset (n = 32) and two…
Large-scale Cross-modality Search via Collective Matrix Factorization Hashing.
Ding, Guiguang; Guo, Yuchen; Zhou, Jile; Gao, Yue
2016-09-08
By transforming data into binary representation, i.e., Hashing, we can perform high-speed search with low storage cost, and thus Hashing has collected increasing research interest in the recent years. Recently, how to generate Hashcode for multimodal data (e.g., images with textual tags, documents with photos, etc) for large-scale cross-modality search (e.g., searching semantically related images in database for a document query) is an important research issue because of the fast growth of multimodal data in the Web. To address this issue, a novel framework for multimodal Hashing is proposed, termed as Collective Matrix Factorization Hashing (CMFH). The key idea of CMFH is to learn unified Hashcodes for different modalities of one multimodal instance in the shared latent semantic space in which different modalities can be effectively connected. Therefore, accurate cross-modality search is supported. Based on the general framework, we extend it in the unsupervised scenario where it tries to preserve the Euclidean structure, and in the supervised scenario where it fully exploits the label information of data. The corresponding theoretical analysis and the optimization algorithms are given. We conducted comprehensive experiments on three benchmark datasets for cross-modality search. The experimental results demonstrate that CMFH can significantly outperform several state-of-the-art cross-modality Hashing methods, which validates the effectiveness of the proposed CMFH.
Multi-views Fusion CNN for Left Ventricular Volumes Estimation on Cardiac MR Images.
Luo, Gongning; Dong, Suyu; Wang, Kuanquan; Zuo, Wangmeng; Cao, Shaodong; Zhang, Henggui
2017-10-13
Left ventricular (LV) volumes estimation is a critical procedure for cardiac disease diagnosis. The objective of this paper is to address direct LV volumes prediction task. In this paper, we propose a direct volumes prediction method based on the end-to-end deep convolutional neural networks (CNN). We study the end-to-end LV volumes prediction method in items of the data preprocessing, networks structure, and multi-views fusion strategy. The main contributions of this paper are the following aspects. First, we propose a new data preprocessing method on cardiac magnetic resonance (CMR). Second, we propose a new networks structure for end-to-end LV volumes estimation. Third, we explore the representational capacity of different slices, and propose a fusion strategy to improve the prediction accuracy. The evaluation results show that the proposed method outperforms other state-of-the-art LV volumes estimation methods on the open accessible benchmark datasets. The clinical indexes derived from the predicted volumes agree well with the ground truth (EDV: R=0.974, RMSE=9.6ml; ESV: R=0.976, RMSE=7.1ml; EF: R=0.828, RMSE =4.71%). Experimental results prove that the proposed method has high accuracy and efficiency on LV volumes prediction task. The proposed method not only has application potential for cardiac diseases screening for large-scale CMR data, but also can be extended to other medical image research fields.
Autoreject: Automated artifact rejection for MEG and EEG data.
Jas, Mainak; Engemann, Denis A; Bekhti, Yousra; Raimondo, Federico; Gramfort, Alexandre
2017-10-01
We present an automated algorithm for unified rejection and repair of bad trials in magnetoencephalography (MEG) and electroencephalography (EEG) signals. Our method capitalizes on cross-validation in conjunction with a robust evaluation metric to estimate the optimal peak-to-peak threshold - a quantity commonly used for identifying bad trials in M/EEG. This approach is then extended to a more sophisticated algorithm which estimates this threshold for each sensor yielding trial-wise bad sensors. Depending on the number of bad sensors, the trial is then repaired by interpolation or by excluding it from subsequent analysis. All steps of the algorithm are fully automated thus lending itself to the name Autoreject. In order to assess the practical significance of the algorithm, we conducted extensive validation and comparisons with state-of-the-art methods on four public datasets containing MEG and EEG recordings from more than 200 subjects. The comparisons include purely qualitative efforts as well as quantitatively benchmarking against human supervised and semi-automated preprocessing pipelines. The algorithm allowed us to automate the preprocessing of MEG data from the Human Connectome Project (HCP) going up to the computation of the evoked responses. The automated nature of our method minimizes the burden of human inspection, hence supporting scalability and reliability demanded by data analysis in modern neuroscience. Copyright © 2017 Elsevier Inc. All rights reserved.
A study on the use of Gumbel approximation with the Bernoulli spatial scan statistic.
Read, S; Bath, P A; Willett, P; Maheswaran, R
2013-08-30
The Bernoulli version of the spatial scan statistic is a well established method of detecting localised spatial clusters in binary labelled point data, a typical application being the epidemiological case-control study. A recent study suggests the inferential accuracy of several versions of the spatial scan statistic (principally the Poisson version) can be improved, at little computational cost, by using the Gumbel distribution, a method now available in SaTScan(TM) (www.satscan.org). We study in detail the effect of this technique when applied to the Bernoulli version and demonstrate that it is highly effective, albeit with some increase in false alarm rates at certain significance thresholds. We explain how this increase is due to the discrete nature of the Bernoulli spatial scan statistic and demonstrate that it can affect even small p-values. Despite this, we argue that the Gumbel method is actually preferable for very small p-values. Furthermore, we extend previous research by running benchmark trials on 12 000 synthetic datasets, thus demonstrating that the overall detection capability of the Bernoulli version (i.e. ratio of power to false alarm rate) is not noticeably affected by the use of the Gumbel method. We also provide an example application of the Gumbel method using data on hospital admissions for chronic obstructive pulmonary disease. Copyright © 2013 John Wiley & Sons, Ltd.
BusyBee Web: metagenomic data analysis by bootstrapped supervised binning and annotation
Kiefer, Christina; Fehlmann, Tobias; Backes, Christina
2017-01-01
Abstract Metagenomics-based studies of mixed microbial communities are impacting biotechnology, life sciences and medicine. Computational binning of metagenomic data is a powerful approach for the culture-independent recovery of population-resolved genomic sequences, i.e. from individual or closely related, constituent microorganisms. Existing binning solutions often require a priori characterized reference genomes and/or dedicated compute resources. Extending currently available reference-independent binning tools, we developed the BusyBee Web server for the automated deconvolution of metagenomic data into population-level genomic bins using assembled contigs (Illumina) or long reads (Pacific Biosciences, Oxford Nanopore Technologies). A reversible compression step as well as bootstrapped supervised binning enable quick turnaround times. The binning results are represented in interactive 2D scatterplots. Moreover, bin quality estimates, taxonomic annotations and annotations of antibiotic resistance genes are computed and visualized. Ground truth-based benchmarks of BusyBee Web demonstrate comparably high performance to state-of-the-art binning solutions for assembled contigs and markedly improved performance for long reads (median F1 scores: 70.02–95.21%). Furthermore, the applicability to real-world metagenomic datasets is shown. In conclusion, our reference-independent approach automatically bins assembled contigs or long reads, exhibits high sensitivity and precision, enables intuitive inspection of the results, and only requires FASTA-formatted input. The web-based application is freely accessible at: https://ccb-microbe.cs.uni-saarland.de/busybee. PMID:28472498
Evaluating real-time Java for mission-critical large-scale embedded systems
NASA Technical Reports Server (NTRS)
Sharp, D. C.; Pla, E.; Luecke, K. R.; Hassan, R. J.
2003-01-01
This paper describes benchmarking results on an RT JVM. This paper extends previously published results by including additional tests, by being run on a recently available pre-release version of the first commercially supported RTSJ implementation, and by assessing results based on our experience with avionics systems in other languages.
Benchmarking Problems Used in Second Year Level Organic Chemistry Instruction
ERIC Educational Resources Information Center
Raker, Jeffrey R.; Towns, Marcy H.
2010-01-01
Investigations of the problem types used in college-level general chemistry examinations have been reported in this Journal and were first reported in the "Journal of Chemical Education" in 1924. This study extends the findings from general chemistry to the problems of four college-level organic chemistry courses. Three problem…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Green, Timothy F. G., E-mail: tim.green@materials.ox.ac.uk; Yates, Jonathan R., E-mail: jonathan.yates@materials.ox.ac.uk
2014-06-21
We present a method for the first-principles calculation of nuclear magnetic resonance (NMR) J-coupling in extended systems using state-of-the-art ultrasoft pseudopotentials and including scalar-relativistic effects. The use of ultrasoft pseudopotentials is allowed by extending the projector augmented wave (PAW) method of Joyce et al. [J. Chem. Phys. 127, 204107 (2007)]. We benchmark it against existing local-orbital quantum chemical calculations and experiments for small molecules containing light elements, with good agreement. Scalar-relativistic effects are included at the zeroth-order regular approximation level of theory and benchmarked against existing local-orbital quantum chemical calculations and experiments for a number of small molecules containing themore » heavy row six elements W, Pt, Hg, Tl, and Pb, with good agreement. Finally, {sup 1}J(P-Ag) and {sup 2}J(P-Ag-P) couplings are calculated in some larger molecular crystals and compared against solid-state NMR experiments. Some remarks are also made as to improving the numerical stability of dipole perturbations using PAW.« less
A Simulation Environment for Benchmarking Sensor Fusion-Based Pose Estimators.
Ligorio, Gabriele; Sabatini, Angelo Maria
2015-12-19
In-depth analysis and performance evaluation of sensor fusion-based estimators may be critical when performed using real-world sensor data. For this reason, simulation is widely recognized as one of the most powerful tools for algorithm benchmarking. In this paper, we present a simulation framework suitable for assessing the performance of sensor fusion-based pose estimators. The systems used for implementing the framework were magnetic/inertial measurement units (MIMUs) and a camera, although the addition of further sensing modalities is straightforward. Typical nuisance factors were also included for each sensor. The proposed simulation environment was validated using real-life sensor data employed for motion tracking. The higher mismatch between real and simulated sensors was about 5% of the measured quantity (for the camera simulation), whereas a lower correlation was found for an axis of the gyroscope (0.90). In addition, a real benchmarking example of an extended Kalman filter for pose estimation from MIMU and camera data is presented.
Prediction of Body Fluids where Proteins are Secreted into Based on Protein Interaction Network
Hu, Le-Le; Huang, Tao; Cai, Yu-Dong; Chou, Kuo-Chen
2011-01-01
Determining the body fluids where secreted proteins can be secreted into is important for protein function annotation and disease biomarker discovery. In this study, we developed a network-based method to predict which kind of body fluids human proteins can be secreted into. For a newly constructed benchmark dataset that consists of 529 human-secreted proteins, the prediction accuracy for the most possible body fluid location predicted by our method via the jackknife test was 79.02%, significantly higher than the success rate by a random guess (29.36%). The likelihood that the predicted body fluids of the first four orders contain all the true body fluids where the proteins can be secreted into is 62.94%. Our method was further demonstrated with two independent datasets: one contains 57 proteins that can be secreted into blood; while the other contains 61 proteins that can be secreted into plasma/serum and were possible biomarkers associated with various cancers. For the 57 proteins in first dataset, 55 were correctly predicted as blood-secrete proteins. For the 61 proteins in the second dataset, 58 were predicted to be most possible in plasma/serum. These encouraging results indicate that the network-based prediction method is quite promising. It is anticipated that the method will benefit the relevant areas for both basic research and drug development. PMID:21829572
NASA Astrophysics Data System (ADS)
Gross, M. B.; Mayernik, M. S.; Rowan, L. R.; Khan, H.; Boler, F. M.; Maull, K. E.; Stott, D.; Williams, S.; Corson-Rikert, J.; Johns, E. M.; Daniels, M. D.; Krafft, D. B.
2015-12-01
UNAVCO, UCAR, and Cornell University are working together to leverage semantic web technologies to enable discovery of people, datasets, publications and other research products, as well as the connections between them. The EarthCollab project, an EarthCube Building Block, is enhancing an existing open-source semantic web application, VIVO, to address connectivity gaps across distributed networks of researchers and resources related to the following two geoscience-based communities: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR's Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects informed by geodesy. People, publications, datasets and grant information have been mapped to an extended version of the VIVO-ISF ontology and ingested into VIVO's database. Data is ingested using a custom set of scripts that include the ability to perform basic automated and curated disambiguation. VIVO can display a page for every object ingested, including connections to other objects in the VIVO database. A dataset page, for example, includes the dataset type, time interval, DOI, related publications, and authors. The dataset type field provides a connection to all other datasets of the same type. The author's page will show, among other information, related datasets and co-authors. Information previously spread across several unconnected databases is now stored in a single location. In addition to VIVO's default display, the new database can also be queried using SPARQL, a query language for semantic data. EarthCollab will also extend the VIVO web application. One such extension is the ability to cross-link separate VIVO instances across institutions, allowing local display of externally curated information. For example, Cornell's VIVO faculty pages will display UNAVCO's dataset information and UNAVCO's VIVO will display Cornell faculty member contact and position information. Additional extensions, including enhanced geospatial capabilities, will be developed following task-centered usability testing.
Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data
2014-01-01
Background The rapid evolution in high-throughput sequencing (HTS) technologies has opened up new perspectives in several research fields and led to the production of large volumes of sequence data. A fundamental step in HTS data analysis is the mapping of reads onto reference sequences. Choosing a suitable mapper for a given technology and a given application is a subtle task because of the difficulty of evaluating mapping algorithms. Results In this paper, we present a benchmark procedure to compare mapping algorithms used in HTS using both real and simulated datasets and considering four evaluation criteria: computational resource and time requirements, robustness of mapping, ability to report positions for reads in repetitive regions, and ability to retrieve true genetic variation positions. To measure robustness, we introduced a new definition for a correctly mapped read taking into account not only the expected start position of the read but also the end position and the number of indels and substitutions. We developed CuReSim, a new read simulator, that is able to generate customized benchmark data for any kind of HTS technology by adjusting parameters to the error types. CuReSim and CuReSimEval, a tool to evaluate the mapping quality of the CuReSim simulated reads, are freely available. We applied our benchmark procedure to evaluate 14 mappers in the context of whole genome sequencing of small genomes with Ion Torrent data for which such a comparison has not yet been established. Conclusions A benchmark procedure to compare HTS data mappers is introduced with a new definition for the mapping correctness as well as tools to generate simulated reads and evaluate mapping quality. The application of this procedure to Ion Torrent data from the whole genome sequencing of small genomes has allowed us to validate our benchmark procedure and demonstrate that it is helpful for selecting a mapper based on the intended application, questions to be addressed, and the technology used. This benchmark procedure can be used to evaluate existing or in-development mappers as well as to optimize parameters of a chosen mapper for any application and any sequencing platform. PMID:24708189
Robust prediction of protein subcellular localization combining PCA and WSVMs.
Tian, Jiang; Gu, Hong; Liu, Wenqi; Gao, Chiyang
2011-08-01
Automated prediction of protein subcellular localization is an important tool for genome annotation and drug discovery, and Support Vector Machines (SVMs) can effectively solve this problem in a supervised manner. However, the datasets obtained from real experiments are likely to contain outliers or noises, which can lead to poor generalization ability and classification accuracy. To explore this problem, we adopt strategies to lower the effect of outliers. First we design a method based on Weighted SVMs, different weights are assigned to different data points, so the training algorithm will learn the decision boundary according to the relative importance of the data points. Second we analyse the influence of Principal Component Analysis (PCA) on WSVM classification, propose a hybrid classifier combining merits of both PCA and WSVM. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means algorithm can generate more suitable weights for the training, as PCA transforms the data into a new coordinate system with largest variances affected greatly by the outliers. Experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy. Copyright © 2011 Elsevier Ltd. All rights reserved.
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments
Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D.; Kluger, Yuval
2012-01-01
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development. PMID:22307239
Picking ChIP-seq peak detectors for analyzing chromatin modification experiments.
Micsinai, Mariann; Parisi, Fabio; Strino, Francesco; Asp, Patrik; Dynlacht, Brian D; Kluger, Yuval
2012-05-01
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development.
Comparing NetCDF and SciDB on managing and querying 5D hydrologic dataset
NASA Astrophysics Data System (ADS)
Liu, Haicheng; Xiao, Xiao
2016-11-01
Efficiently extracting information from high dimensional hydro-meteorological modelling datasets requires smart solutions. Traditional methods are mostly based on files, which can be edited and accessed handily. But they have problems of efficiency due to contiguous storage structure. Others propose databases as an alternative for advantages such as native functionalities for manipulating multidimensional (MD) arrays, smart caching strategy and scalability. In this research, NetCDF file based solutions and the multidimensional array database management system (DBMS) SciDB applying chunked storage structure are benchmarked to determine the best solution for storing and querying 5D large hydrologic modelling dataset. The effect of data storage configurations including chunk size, dimension order and compression on query performance is explored. Results indicate that dimension order to organize storage of 5D data has significant influence on query performance if chunk size is very large. But the effect becomes insignificant when chunk size is properly set. Compression of SciDB mostly has negative influence on query performance. Caching is an advantage but may be influenced by execution of different query processes. On the whole, NetCDF solution without compression is in general more efficient than the SciDB DBMS.
Classification as clustering: a Pareto cooperative-competitive GP approach.
McIntyre, Andrew R; Heywood, Malcolm I
2011-01-01
Intuitively population based algorithms such as genetic programming provide a natural environment for supporting solutions that learn to decompose the overall task between multiple individuals, or a team. This work presents a framework for evolving teams without recourse to prespecifying the number of cooperating individuals. To do so, each individual evolves a mapping to a distribution of outcomes that, following clustering, establishes the parameterization of a (Gaussian) local membership function. This gives individuals the opportunity to represent subsets of tasks, where the overall task is that of classification under the supervised learning domain. Thus, rather than each team member representing an entire class, individuals are free to identify unique subsets of the overall classification task. The framework is supported by techniques from evolutionary multiobjective optimization (EMO) and Pareto competitive coevolution. EMO establishes the basis for encouraging individuals to provide accurate yet nonoverlaping behaviors; whereas competitive coevolution provides the mechanism for scaling to potentially large unbalanced datasets. Benchmarking is performed against recent examples of nonlinear SVM classifiers over 12 UCI datasets with between 150 and 200,000 training instances. Solutions from the proposed coevolutionary multiobjective GP framework appear to provide a good balance between classification performance and model complexity, especially as the dataset instance count increases.
Yang, Guanxue; Wang, Lin; Wang, Xiaofan
2017-06-07
Reconstruction of networks underlying complex systems is one of the most crucial problems in many areas of engineering and science. In this paper, rather than identifying parameters of complex systems governed by pre-defined models or taking some polynomial and rational functions as a prior information for subsequent model selection, we put forward a general framework for nonlinear causal network reconstruction from time-series with limited observations. With obtaining multi-source datasets based on the data-fusion strategy, we propose a novel method to handle nonlinearity and directionality of complex networked systems, namely group lasso nonlinear conditional granger causality. Specially, our method can exploit different sets of radial basis functions to approximate the nonlinear interactions between each pair of nodes and integrate sparsity into grouped variables selection. The performance characteristic of our approach is firstly assessed with two types of simulated datasets from nonlinear vector autoregressive model and nonlinear dynamic models, and then verified based on the benchmark datasets from DREAM3 Challenge4. Effects of data size and noise intensity are also discussed. All of the results demonstrate that the proposed method performs better in terms of higher area under precision-recall curve.
Scaling up: What coupled land-atmosphere models can tell us about critical zone processes
NASA Astrophysics Data System (ADS)
FitzGerald, K. A.; Masarik, M. T.; Rudisill, W. J.; Gelb, L.; Flores, A. N.
2017-12-01
A significant limitation to extending our knowledge of critical zone (CZ) evolution and function is a lack of hydrometeorological information at sufficiently fine spatial and temporal resolutions to resolve topo-climatic gradients and adequate spatial and temporal extent to capture a range of climatic conditions across ecoregions. Research at critical zone observatories (CZOs) suggests hydrometeorological stores and fluxes exert key controls on processes such as hydrologic partitioning and runoff generation, landscape evolution, soil formation, biogeochemical cycling, and vegetation dynamics. However, advancing fundamental understanding of CZ processes necessitates understanding how hydrometeorological drivers vary across space and time. As a result of recent advances in computational capabilities it has become possible, although still computationally expensive, to simulate hydrometeorological conditions via high resolution coupled land-atmosphere models. Using the Weather Research and Forecasting (WRF) model, we developed a high spatiotemporal resolution dataset extending from water year 1987 to present for the Snake River Basin in the northwestern USA including the Reynolds Creek and Dry Creek Experimental Watersheds, both part of the Reynolds Creek CZO, as well as a range of other ecosystems including shrubland desert, montane forests, and alpine tundra. Drawing from hypotheses generated by work at these sites and across the CZO network, we use the resulting dataset in combination with CZO observations and publically available datasets to provide insights regarding hydrologic partitioning, vegetation distribution, and erosional processes. This dataset provides key context in interpreting and reconciling what observations obtained at particular sites reveal about underlying CZ structure and function. While this dataset does not extend to future climates, the same modeling framework can be used to dynamically downscale coarse global climate model output to scales relevant to CZ processes. This presents an opportunity to better characterize the impact of climate change on the CZ. We also argue that opportunities exist beyond the one way flow of information and that what we learn at CZOs has the potential to contribute significantly to improved Earth system models.
Passivity-based Robust Control of Aerospace Systems
NASA Technical Reports Server (NTRS)
Kelkar, Atul G.; Joshi, Suresh M. (Technical Monitor)
2000-01-01
This report provides a brief summary of the research work performed over the duration of the cooperative research agreement between NASA Langley Research Center and Kansas State University. The cooperative agreement which was originally for the duration the three years was extended by another year through no-cost extension in order to accomplish the goals of the project. The main objective of the research was to develop passivity-based robust control methodology for passive and non-passive aerospace systems. The focus of the first-year's research was limited to the investigation of passivity-based methods for the robust control of Linear Time-Invariant (LTI) single-input single-output (SISO), open-loop stable, minimum-phase non-passive systems. The second year's focus was mainly on extending the passivity-based methodology to a larger class of non-passive LTI systems which includes unstable and nonminimum phase SISO systems. For LTI non-passive systems, five different passification. methods were developed. The primary effort during the years three and four was on the development of passification methodology for MIMO systems, development of methods for checking robustness of passification, and developing synthesis techniques for passifying compensators. For passive LTI systems optimal synthesis procedure was also developed for the design of constant-gain positive real controllers. For nonlinear passive systems, numerical optimization-based technique was developed for the synthesis of constant as well as time-varying gain positive-real controllers. The passivity-based control design methodology developed during the duration of this project was demonstrated by its application to various benchmark examples. These example systems included longitudinal model of an F-18 High Alpha Research Vehicle (HARV) for pitch axis control, NASA's supersonic transport wind tunnel model, ACC benchmark model, 1-D acoustic duct model, piezo-actuated flexible link model, and NASA's Benchmark Active Controls Technology (BACT) Wing model. Some of the stability results for linear passive systems were also extended to nonlinear passive systems. Several publications and conference presentations resulted from this research.
Hansen, Laura S; Sloth, Erik; Hjortdal, Vibeke E; Jakobsen, Carl-Johan
2015-08-01
Short-term (30 days) mortality frequently is used as an outcome measure after cardiac surgery, although it has been proposed that the follow-up period should be extended to 120 days to allow for more accurate benchmarking. The authors aimed to evaluate whether mortality rates 120 days after surgery were comparable to general mortality and to compare causes of death between the cohort and the general population. A multicenter descriptive cohort study using prospectively entered registry data. University hospital. The cohort was obtained from the Western Denmark Heart Registry and matched to the Danish National Hospital Register as well as the Danish Register of Causes of Death. A weighted, age-matched general population consisting of all Danish patients who died within the study period was identified through the central authority on Danish statistics. A total of 11,988 patients (>15 years) who underwent cardiac-surgery at Aarhus, Aalborg and Odense University Hospitals from April 1, 2006 to December 31, 2012 were included. Coronary artery bypass grafting, valve surgery and combinations. Mortality after cardiac surgery matches with mortality in the general population after 140 days. Mortality curves run almost parallel from this point onwards, regardless of The European system for cardiac operative risk evaluation (EuroSCORE) and intervention. The causes of death in the cohort differed statistically significantly from the background population (p<0.0001; one-sample t-test) throughout the first postoperative year. The leading cause of death in the cohort was cardiac (38%); 53% of which was categorized as heart failure. A total of 54% of these patients were assessed preoperatively as having normal or mildly impaired heart function (EuroSCORE). This study supported an extended follow-up period after cardiac surgery when benchmarking cardiac surgery centers. Regardless of preoperative heart function, heart failure was the consistent leading cause of death. Copyright © 2015 Elsevier Inc. All rights reserved.
Gomez, David; Byrne, James P; Alali, Aziz S; Xiong, Wei; Hoeft, Chris; Neal, Melanie; Subacius, Harris; Nathens, Avery B
2017-12-01
The Glasgow Coma Scale (GCS) is the most widely used measure of traumatic brain injury (TBI) severity. Currently, the arrival GCS motor component (mGCS) score is used in risk-adjustment models for external benchmarking of mortality. However, there is evidence that the highest mGCS score in the first 24 hours after injury might be a better predictor of death. Our objective was to evaluate the impact of including the highest mGCS score on the performance of risk-adjustment models and subsequent external benchmarking results. Data were derived from the Trauma Quality Improvement Program analytic dataset (January 2014 through March 2015) and were limited to the severe TBI cohort (16 years or older, isolated head injury, GCS ≤8). Risk-adjustment models were created that varied in the mGCS covariates only (initial score, highest score, or both initial and highest mGCS scores). Model performance and fit, as well as external benchmarking results, were compared. There were 6,553 patients with severe TBI across 231 trauma centers included. Initial and highest mGCS scores were different in 47% of patients (n = 3,097). Model performance and fit improved when both initial and highest mGCS scores were included, as evidenced by improved C-statistic, Akaike Information Criterion, and adjusted R-squared values. Three-quarters of centers changed their adjusted odds ratio decile, 2.6% of centers changed outlier status, and 45% of centers exhibited a ≥0.5-SD change in the odds ratio of death after including highest mGCS score in the model. This study supports the concept that additional clinical information has the potential to not only improve the performance of current risk-adjustment models, but can also have a meaningful impact on external benchmarking strategies. Highest mGCS score is a good potential candidate for inclusion in additional models. Copyright © 2017 American College of Surgeons. Published by Elsevier Inc. All rights reserved.
BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences
Gao, Jianzhao; Faraggi, Eshel; Zhou, Yaoqi; Ruan, Jishou; Kurgan, Lukasz
2012-01-01
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods. PMID:22761950
Zou, Lingyun; Nan, Chonghan; Hu, Fuquan
2013-12-15
Various human pathogens secret effector proteins into hosts cells via the type IV secretion system (T4SS). These proteins play important roles in the interaction between bacteria and hosts. Computational methods for T4SS effector prediction have been developed for screening experimental targets in several isolated bacterial species; however, widely applicable prediction approaches are still unavailable In this work, four types of distinctive features, namely, amino acid composition, dipeptide composition, .position-specific scoring matrix composition and auto covariance transformation of position-specific scoring matrix, were calculated from primary sequences. A classifier, T4EffPred, was developed using the support vector machine with these features and their different combinations for effector prediction. Various theoretical tests were performed in a newly established dataset, and the results were measured with four indexes. We demonstrated that T4EffPred can discriminate IVA and IVB effectors in benchmark datasets with positive rates of 76.7% and 89.7%, respectively. The overall accuracy of 95.9% shows that the present method is accurate for distinguishing the T4SS effector in unidentified sequences. A classifier ensemble was designed to synthesize all single classifiers. Notable performance improvement was observed using this ensemble system in benchmark tests. To demonstrate the model's application, a genome-scale prediction of effectors was performed in Bartonella henselae, an important zoonotic pathogen. A number of putative candidates were distinguished. A web server implementing the prediction method and the source code are both available at http://bioinfo.tmmu.edu.cn/T4EffPred.
NASA Astrophysics Data System (ADS)
Zhang, Z.; Zimmermann, N. E.; Poulter, B.
2015-12-01
Simulations of the spatial-temporal dynamics of wetlands is key to understanding the role of wetland biogeochemistry under past and future climate variability. Hydrologic inundation models, such as TOPMODEL, are based on a fundamental parameter known as the compound topographic index (CTI) and provide a computationally cost-efficient approach to simulate global wetland dynamics. However, there remains large discrepancy in the implementations of TOPMODEL in land-surface models (LSMs) and thus their performance against observations. This study describes new improvements to TOPMODEL implementation and estimates of global wetland dynamics using the LPJ-wsl DGVM, and quantifies uncertainties by comparing three digital elevation model products (HYDRO1k, GMTED, and HydroSHEDS) at different spatial resolution and accuracy on simulated inundation dynamics. We found that calibrating TOPMODEL with a benchmark dataset can help to successfully predict the seasonal and interannual variations of wetlands, as well as improve the spatial distribution of wetlands to be consistent with inventories. The HydroSHEDS DEM, using a river-basin scheme for aggregating the CTI, shows best accuracy for capturing the spatio-temporal dynamics of wetland among three DEM products. This study demonstrates the feasibility to capture spatial heterogeneity of inundation and to estimate seasonal and interannual variations in wetland by coupling a hydrological module in LSMs with appropriate benchmark datasets. It additionally highlight the importance of an adequate understanding of topographic indices for simulating global wetlands and show the opportunity to converge wetland estimations in LSMs by identifying the uncertainty associated with existing wetland products.
Crabtree, Nathaniel M; Moore, Jason H; Bowyer, John F; George, Nysia I
2017-01-01
A computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets. Pareto optimization allows CESs to balance accuracy with model complexity when evolving classifiers. Using Pareto optimization, a CES is able to identify a very small number of features while maintaining high classification accuracy. A CES can be designed for various types of data, and the user can exploit expert knowledge about the classification problem in order to improve discrimination between classes. These characteristics give CES an advantage over other classification and feature selection algorithms, particularly when the goal is to identify a small number of highly relevant, non-redundant biomarkers. Previously, CESs have been developed only for binary class datasets. In this study, we developed a multi-class CES. The multi-class CES was compared to three common feature selection and classification algorithms: support vector machine (SVM), random k-nearest neighbor (RKNN), and random forest (RF). The algorithms were evaluated on three distinct multi-class RNA sequencing datasets. The comparison criteria were run-time, classification accuracy, number of selected features, and stability of selected feature set (as measured by the Tanimoto distance). The performance of each algorithm was data-dependent. CES performed best on the dataset with the smallest sample size, indicating that CES has a unique advantage since the accuracy of most classification methods suffer when sample size is small. The multi-class extension of CES increases the appeal of its application to complex, multi-class datasets in order to identify important biomarkers and features.
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.
Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H
2015-01-01
The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Haque, Mohammad Nazmul; Noman, Nasimul; Berretta, Regina; Moscato, Pablo
2016-01-01
Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble's output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) - k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer's disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases.
Haque, Mohammad Nazmul; Noman, Nasimul; Berretta, Regina; Moscato, Pablo
2016-01-01
Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble’s output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) − k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer’s disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases. PMID:26764911
Harnessing Diversity towards the Reconstructing of Large Scale Gene Regulatory Networks
Yamanaka, Ryota; Kitano, Hiroaki
2013-01-01
Elucidating gene regulatory network (GRN) from large scale experimental data remains a central challenge in systems biology. Recently, numerous techniques, particularly consensus driven approaches combining different algorithms, have become a potentially promising strategy to infer accurate GRNs. Here, we develop a novel consensus inference algorithm, TopkNet that can integrate multiple algorithms to infer GRNs. Comprehensive performance benchmarking on a cloud computing framework demonstrated that (i) a simple strategy to combine many algorithms does not always lead to performance improvement compared to the cost of consensus and (ii) TopkNet integrating only high-performance algorithms provide significant performance improvement compared to the best individual algorithms and community prediction. These results suggest that a priori determination of high-performance algorithms is a key to reconstruct an unknown regulatory network. Similarity among gene-expression datasets can be useful to determine potential optimal algorithms for reconstruction of unknown regulatory networks, i.e., if expression-data associated with known regulatory network is similar to that with unknown regulatory network, optimal algorithms determined for the known regulatory network can be repurposed to infer the unknown regulatory network. Based on this observation, we developed a quantitative measure of similarity among gene-expression datasets and demonstrated that, if similarity between the two expression datasets is high, TopkNet integrating algorithms that are optimal for known dataset perform well on the unknown dataset. The consensus framework, TopkNet, together with the similarity measure proposed in this study provides a powerful strategy towards harnessing the wisdom of the crowds in reconstruction of unknown regulatory networks. PMID:24278007
Massive Open Online Course Completion Rates Revisited: Assessment, Length and Attrition
ERIC Educational Resources Information Center
Jordan, Katy
2015-01-01
This analysis is based upon enrolment and completion data collected for a total of 221 Massive Open Online Courses (MOOCs). It extends previously reported work (Jordan, 2014) with an expanded dataset; the original work is extended to include a multiple regression analysis of factors that affect completion rates and analysis of attrition rates…
Present Status and Extensions of the Monte Carlo Performance Benchmark
NASA Astrophysics Data System (ADS)
Hoogenboom, J. Eduard; Petrovic, Bojan; Martin, William R.
2014-06-01
The NEA Monte Carlo Performance benchmark started in 2011 aiming to monitor over the years the abilities to perform a full-size Monte Carlo reactor core calculation with a detailed power production for each fuel pin with axial distribution. This paper gives an overview of the contributed results thus far. It shows that reaching a statistical accuracy of 1 % for most of the small fuel zones requires about 100 billion neutron histories. The efficiency of parallel execution of Monte Carlo codes on a large number of processor cores shows clear limitations for computer clusters with common type computer nodes. However, using true supercomputers the speedup of parallel calculations is increasing up to large numbers of processor cores. More experience is needed from calculations on true supercomputers using large numbers of processors in order to predict if the requested calculations can be done in a short time. As the specifications of the reactor geometry for this benchmark test are well suited for further investigations of full-core Monte Carlo calculations and a need is felt for testing other issues than its computational performance, proposals are presented for extending the benchmark to a suite of benchmark problems for evaluating fission source convergence for a system with a high dominance ratio, for coupling with thermal-hydraulics calculations to evaluate the use of different temperatures and coolant densities and to study the correctness and effectiveness of burnup calculations. Moreover, other contemporary proposals for a full-core calculation with realistic geometry and material composition will be discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gerhard Strydom; Cristian Rabiti; Andrea Alfonsi
2012-10-01
PHISICS is a neutronics code system currently under development at the Idaho National Laboratory (INL). Its goal is to provide state of the art simulation capability to reactor designers. The different modules for PHISICS currently under development are a nodal and semi-structured transport core solver (INSTANT), a depletion module (MRTAU) and a cross section interpolation (MIXER) module. The INSTANT module is the most developed of the mentioned above. Basic functionalities are ready to use, but the code is still in continuous development to extend its capabilities. This paper reports on the effort of coupling the nodal kinetics code package PHISICSmore » (INSTANT/MRTAU/MIXER) to the thermal hydraulics system code RELAP5-3D, to enable full core and system modeling. This will enable the possibility to model coupled (thermal-hydraulics and neutronics) problems with more options for 3D neutron kinetics, compared to the existing diffusion theory neutron kinetics module in RELAP5-3D (NESTLE). In the second part of the paper, an overview of the OECD/NEA MHTGR-350 MW benchmark is given. This benchmark has been approved by the OECD, and is based on the General Atomics 350 MW Modular High Temperature Gas Reactor (MHTGR) design. The benchmark includes coupled neutronics thermal hydraulics exercises that require more capabilities than RELAP5-3D with NESTLE offers. Therefore, the MHTGR benchmark makes extensive use of the new PHISICS/RELAP5-3D coupling capabilities. The paper presents the preliminary results of the three steady state exercises specified in Phase I of the benchmark using PHISICS/RELAP5-3D.« less
Mohammadhassanzadeh, Hossein; Van Woensel, William; Abidi, Samina Raza; Abidi, Syed Sibte Raza
2017-01-01
Capturing complete medical knowledge is challenging-often due to incomplete patient Electronic Health Records (EHR), but also because of valuable, tacit medical knowledge hidden away in physicians' experiences. To extend the coverage of incomplete medical knowledge-based systems beyond their deductive closure, and thus enhance their decision-support capabilities, we argue that innovative, multi-strategy reasoning approaches should be applied. In particular, plausible reasoning mechanisms apply patterns from human thought processes, such as generalization, similarity and interpolation, based on attributional, hierarchical, and relational knowledge. Plausible reasoning mechanisms include inductive reasoning , which generalizes the commonalities among the data to induce new rules, and analogical reasoning , which is guided by data similarities to infer new facts. By further leveraging rich, biomedical Semantic Web ontologies to represent medical knowledge, both known and tentative, we increase the accuracy and expressivity of plausible reasoning, and cope with issues such as data heterogeneity, inconsistency and interoperability. In this paper, we present a Semantic Web-based, multi-strategy reasoning approach, which integrates deductive and plausible reasoning and exploits Semantic Web technology to solve complex clinical decision support queries. We evaluated our system using a real-world medical dataset of patients with hepatitis, from which we randomly removed different percentages of data (5%, 10%, 15%, and 20%) to reflect scenarios with increasing amounts of incomplete medical knowledge. To increase the reliability of the results, we generated 5 independent datasets for each percentage of missing values, which resulted in 20 experimental datasets (in addition to the original dataset). The results show that plausibly inferred knowledge extends the coverage of the knowledge base by, on average, 2%, 7%, 12%, and 16% for datasets with, respectively, 5%, 10%, 15%, and 20% of missing values. This expansion in the KB coverage allowed solving complex disease diagnostic queries that were previously unresolvable, without losing the correctness of the answers. However, compared to deductive reasoning, data-intensive plausible reasoning mechanisms yield a significant performance overhead. We observed that plausible reasoning approaches, by generating tentative inferences and leveraging domain knowledge of experts, allow us to extend the coverage of medical knowledge bases, resulting in improved clinical decision support. Second, by leveraging OWL ontological knowledge, we are able to increase the expressivity and accuracy of plausible reasoning methods. Third, our approach is applicable to clinical decision support systems for a range of chronic diseases.
Adsorption structures and energetics of molecules on metal surfaces: Bridging experiment and theory
NASA Astrophysics Data System (ADS)
Maurer, Reinhard J.; Ruiz, Victor G.; Camarillo-Cisneros, Javier; Liu, Wei; Ferri, Nicola; Reuter, Karsten; Tkatchenko, Alexandre
2016-05-01
Adsorption geometry and stability of organic molecules on surfaces are key parameters that determine the observable properties and functions of hybrid inorganic/organic systems (HIOSs). Despite many recent advances in precise experimental characterization and improvements in first-principles electronic structure methods, reliable databases of structures and energetics for large adsorbed molecules are largely amiss. In this review, we present such a database for a range of molecules adsorbed on metal single-crystal surfaces. The systems we analyze include noble-gas atoms, conjugated aromatic molecules, carbon nanostructures, and heteroaromatic compounds adsorbed on five different metal surfaces. The overall objective is to establish a diverse benchmark dataset that enables an assessment of current and future electronic structure methods, and motivates further experimental studies that provide ever more reliable data. Specifically, the benchmark structures and energetics from experiment are here compared with the recently developed van der Waals (vdW) inclusive density-functional theory (DFT) method, DFT + vdWsurf. In comparison to 23 adsorption heights and 17 adsorption energies from experiment we find a mean average deviation of 0.06 Å and 0.16 eV, respectively. This confirms the DFT + vdWsurf method as an accurate and efficient approach to treat HIOSs. A detailed discussion identifies remaining challenges to be addressed in future development of electronic structure methods, for which the here presented benchmark database may serve as an important reference.
Vehicle Sprung Mass Estimation for Rough Terrain
2011-03-01
distributions are greater than zero. The multivariate polynomials are functions of the Legendre polynomials (Poularikas (1999...developed methods based on polynomial chaos theory and on the maximum likelihood approach to estimate the most likely value of the vehicle sprung...mass. The polynomial chaos estimator is compared to benchmark algorithms including recursive least squares, recursive total least squares, extended
MAKER-P: a tool-kit for the creation, management, and quality control of plant genome annotations
USDA-ARS?s Scientific Manuscript database
We have optimized and extended the widely used annotation-engine MAKER for use on plant genomes. We have benchmarked the resulting software, MAKER-P, using the A. thaliana genome and the TAIR10 gene models. Here we demonstrate the ability of the MAKER-P toolkit to generate de novo repeat databases, ...
ERIC Educational Resources Information Center
Virnig, Sean M.
2013-01-01
In this age of educational accountability, public schools are presumed to have the innate organizational capability to meet academic achievement benchmarks. Fair or not, this presumption also extends to schools serving students who are deaf, a population whose academic achievement continues to be unsatisfactory. This dissertation investigated how…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Greiner, Miles
Radial hydride formation in high-burnup used fuel cladding has the potential to radically reduce its ductility and suitability for long-term storage and eventual transport. To avoid this formation, the maximum post-reactor temperature must remain sufficiently low to limit the cladding hoop stress, and so that hydrogen from the existing circumferential hydrides will not dissolve and become available to re-precipitate into radial hydrides under the slow cooling conditions during drying, transfer and early dry-cask storage. The objective of this research is to develop and experimentallybenchmark computational fluid dynamics simulations of heat transfer in post-pool-storage drying operations, when high-burnup fuel cladding ismore » likely to experience its highest temperature. These benchmarked tools can play a key role in evaluating dry cask storage systems for extended storage of high-burnup fuels and post-storage transportation, including fuel retrievability. The benchmarked tools will be used to aid the design of efficient drying processes, as well as estimate variations of surface temperatures as a means of inferring helium integrity inside the canister or cask. This work will be conducted effectively because the principal investigator has experience developing these types of simulations, and has constructed a test facility that can be used to benchmark them.« less
NASA Astrophysics Data System (ADS)
Bartlett, Philip L.; Stelbovics, Andris T.
2010-02-01
The propagating exterior complex scaling (PECS) method is extended to all four-body processes in electron impact on helium in an S-wave model. Total and energy-differential cross sections are presented with benchmark accuracy for double ionization, single ionization with excitation, and double excitation (to autoionizing states) for incident-electron energies from threshold to 500 eV. While the PECS three-body cross sections for this model given in the preceding article [Phys. Rev. A 81, 022715 (2010)] are in good agreement with other methods, there are considerable discrepancies for these four-body processes. With this model we demonstrate the suitability of the PECS method for the complete solution of the electron-helium system.
NASA Astrophysics Data System (ADS)
Feldt, Jonas; Miranda, Sebastião; Pratas, Frederico; Roma, Nuno; Tomás, Pedro; Mata, Ricardo A.
2017-12-01
In this work, we present an optimized perturbative quantum mechanics/molecular mechanics (QM/MM) method for use in Metropolis Monte Carlo simulations. The model adopted is particularly tailored for the simulation of molecular systems in solution but can be readily extended to other applications, such as catalysis in enzymatic environments. The electrostatic coupling between the QM and MM systems is simplified by applying perturbation theory to estimate the energy changes caused by a movement in the MM system. This approximation, together with the effective use of GPU acceleration, leads to a negligible added computational cost for the sampling of the environment. Benchmark calculations are carried out to evaluate the impact of the approximations applied and the overall computational performance.
Feldt, Jonas; Miranda, Sebastião; Pratas, Frederico; Roma, Nuno; Tomás, Pedro; Mata, Ricardo A
2017-12-28
In this work, we present an optimized perturbative quantum mechanics/molecular mechanics (QM/MM) method for use in Metropolis Monte Carlo simulations. The model adopted is particularly tailored for the simulation of molecular systems in solution but can be readily extended to other applications, such as catalysis in enzymatic environments. The electrostatic coupling between the QM and MM systems is simplified by applying perturbation theory to estimate the energy changes caused by a movement in the MM system. This approximation, together with the effective use of GPU acceleration, leads to a negligible added computational cost for the sampling of the environment. Benchmark calculations are carried out to evaluate the impact of the approximations applied and the overall computational performance.
New fuzzy support vector machine for the class imbalance problem in medical datasets classification.
Gu, Xiaoqing; Ni, Tongguang; Wang, Hongyuan
2014-01-01
In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.
Dataset definition for CMS operations and physics analyses
NASA Astrophysics Data System (ADS)
Franzoni, Giovanni; Compact Muon Solenoid Collaboration
2016-04-01
Data recorded at the CMS experiment are funnelled into streams, integrated in the HLT menu, and further organised in a hierarchical structure of primary datasets and secondary datasets/dedicated skims. Datasets are defined according to the final-state particles reconstructed by the high level trigger, the data format and the use case (physics analysis, alignment and calibration, performance studies). During the first LHC run, new workflows have been added to this canonical scheme, to exploit at best the flexibility of the CMS trigger and data acquisition systems. The concepts of data parking and data scouting have been introduced to extend the physics reach of CMS, offering the opportunity of defining physics triggers with extremely loose selections (e.g. dijet resonance trigger collecting data at a 1 kHz). In this presentation, we review the evolution of the dataset definition during the LHC run I, and we discuss the plans for the run II.
NASA Astrophysics Data System (ADS)
Ito, Akihiko; Nishina, Kazuya; Reyer, Christopher P. O.; François, Louis; Henrot, Alexandra-Jane; Munhoven, Guy; Jacquemin, Ingrid; Tian, Hanqin; Yang, Jia; Pan, Shufen; Morfopoulos, Catherine; Betts, Richard; Hickler, Thomas; Steinkamp, Jörg; Ostberg, Sebastian; Schaphoff, Sibyll; Ciais, Philippe; Chang, Jinfeng; Rafique, Rashid; Zeng, Ning; Zhao, Fang
2017-08-01
Simulating vegetation photosynthetic productivity (or gross primary production, GPP) is a critical feature of the biome models used for impact assessments of climate change. We conducted a benchmarking of global GPP simulated by eight biome models participating in the second phase of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP2a) with four meteorological forcing datasets (30 simulations), using independent GPP estimates and recent satellite data of solar-induced chlorophyll fluorescence as a proxy of GPP. The simulated global terrestrial GPP ranged from 98 to 141 Pg C yr-1 (1981-2000 mean); considerable inter-model and inter-data differences were found. Major features of spatial distribution and seasonal change of GPP were captured by each model, showing good agreement with the benchmarking data. All simulations showed incremental trends of annual GPP, seasonal-cycle amplitude, radiation-use efficiency, and water-use efficiency, mainly caused by the CO2 fertilization effect. The incremental slopes were higher than those obtained by remote sensing studies, but comparable with those by recent atmospheric observation. Apparent differences were found in the relationship between GPP and incoming solar radiation, for which forcing data differed considerably. The simulated GPP trends co-varied with a vegetation structural parameter, leaf area index, at model-dependent strengths, implying the importance of constraining canopy properties. In terms of extreme events, GPP anomalies associated with a historical El Niño event and large volcanic eruption were not consistently simulated in the model experiments due to deficiencies in both forcing data and parameterized environmental responsiveness. Although the benchmarking demonstrated the overall advancement of contemporary biome models, further refinements are required, for example, for solar radiation data and vegetation canopy schemes.
García-Jacas, César R; Contreras-Torres, Ernesto; Marrero-Ponce, Yovani; Pupo-Meriño, Mario; Barigye, Stephen J; Cabrera-Leyva, Lisset
2016-01-01
Recently, novel 3D alignment-free molecular descriptors (also known as QuBiLS-MIDAS) based on two-linear, three-linear and four-linear algebraic forms have been introduced. These descriptors codify chemical information for relations between two, three and four atoms by using several (dis-)similarity metrics and multi-metrics. Several studies aimed at assessing the quality of these novel descriptors have been performed. However, a deeper analysis of their performance is necessary. Therefore, in the present manuscript an assessment and statistical validation of the performance of these novel descriptors in QSAR studies is performed. To this end, eight molecular datasets (angiotensin converting enzyme, acetylcholinesterase inhibitors, benzodiazepine receptor, cyclooxygenase-2 inhibitors, dihydrofolate reductase inhibitors, glycogen phosphorylase b, thermolysin inhibitors, thrombin inhibitors) widely used as benchmarks in the evaluation of several procedures are utilized. Three to nine variable QSAR models based on Multiple Linear Regression are built for each chemical dataset according to the original division into training/test sets. Comparisons with respect to leave-one-out cross-validation correlation coefficients[Formula: see text] reveal that the models based on QuBiLS-MIDAS indices possess superior predictive ability in 7 of the 8 datasets analyzed, outperforming methodologies based on similar or more complex techniques such as: Partial Least Square, Neural Networks, Support Vector Machine and others. On the other hand, superior external correlation coefficients[Formula: see text] are attained in 6 of the 8 test sets considered, confirming the good predictive power of the obtained models. For the [Formula: see text] values non-parametric statistic tests were performed, which demonstrated that the models based on QuBiLS-MIDAS indices have the best global performance and yield significantly better predictions in 11 of the 12 QSAR procedures used in the comparison. Lastly, a study concerning to the performance of the indices according to several conformer generation methods was performed. This demonstrated that the quality of predictions of the QSAR models based on QuBiLS-MIDAS indices depend on 3D structure generation method considered, although in this preliminary study the results achieved do not present significant statistical differences among them. As conclusions it can be stated that the QuBiLS-MIDAS indices are suitable for extracting structural information of the molecules and thus, constitute a promissory alternative to build models that contribute to the prediction of pharmacokinetic, pharmacodynamics and toxicological properties on novel compounds.Graphical abstractComparative graphical representation of the performance of the novel QuBiLS-MIDAS 3D-MDs with respect to other methodologies in QSAR modeling of eight chemical datasets.
NASA Astrophysics Data System (ADS)
Gómez, D. D.; Piñón, D. A.; Smalley, R.; Bevis, M.; Cimbaro, S. R.; Lenzano, L. E.; Barón, J.
2016-03-01
The 2010, (Mw 8.8) Maule, Chile, earthquake produced large co-seismic displacements and non-secular, post-seismic deformation, within latitudes 28°S-40°S extending from the Pacific to the Atlantic oceans. Although these effects are easily resolvable by fitting geodetic extended trajectory models (ETM) to continuous GPS (CGPS) time series, the co- and post-seismic deformation cannot be determined at locations without CGPS (e.g., on passive geodetic benchmarks). To estimate the trajectories of passive geodetic benchmarks, we used CGPS time series to fit an ETM that includes the secular South American plate motion and plate boundary deformation, the co-seismic discontinuity, and the non-secular, logarithmic post-seismic transient produced by the earthquake in the Posiciones Geodésicas Argentinas 2007 (POSGAR07) reference frame (RF). We then used least squares collocation (LSC) to model both the background secular inter-seismic and the non-secular post-seismic components of the ETM at the locations without CGPS. We tested the LSC modeled trajectories using campaign and CGPS data that was not used to generate the model and found standard deviations (95 % confidence level) for position estimates for the north and east components of 3.8 and 5.5 mm, respectively, indicating that the model predicts the post-seismic deformation field very well. Finally, we added the co-seismic displacement field, estimated using an elastic finite element model. The final, trajectory model allows accessing the POSGAR07 RF using post-Maule earthquake coordinates within 5 cm for ˜ 91 % of the passive test benchmarks.
Extreme learning machine for ranking: generalization analysis and applications.
Chen, Hong; Peng, Jiangtao; Zhou, Yicong; Li, Luoqing; Pan, Zhibin
2014-05-01
The extreme learning machine (ELM) has attracted increasing attention recently with its successful applications in classification and regression. In this paper, we investigate the generalization performance of ELM-based ranking. A new regularized ranking algorithm is proposed based on the combinations of activation functions in ELM. The generalization analysis is established for the ELM-based ranking (ELMRank) in terms of the covering numbers of hypothesis space. Empirical results on the benchmark datasets show the competitive performance of the ELMRank over the state-of-the-art ranking methods. Copyright © 2014 Elsevier Ltd. All rights reserved.
Bartschat, Klaus; Kushner, Mark J.
2016-01-01
Electron collisions with atoms, ions, molecules, and surfaces are critically important to the understanding and modeling of low-temperature plasmas (LTPs), and so in the development of technologies based on LTPs. Recent progress in obtaining experimental benchmark data and the development of highly sophisticated computational methods is highlighted. With the cesium-based diode-pumped alkali laser and remote plasma etching of Si3N4 as examples, we demonstrate how accurate and comprehensive datasets for electron collisions enable complex modeling of plasma-using technologies that empower our high-technology–based society. PMID:27317740
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ramos-Mendez, J; Faddegon, B; Perl, J
2015-06-15
Purpose: To develop and verify an extension to TOPAS for calculation of dose response models (TCP/NTCP). TOPAS wraps and extends Geant4. Methods: The TOPAS DICOM interface was extended to include structure contours, for subsequent calculation of DVH’s and TCP/NTCP. The following dose response models were implemented: Lyman-Kutcher-Burman (LKB), critical element (CE), population based critical volume (CV), parallel-serials, a sigmoid-based model of Niemierko for NTCP and TCP, and a Poisson-based model for TCP. For verification, results for the parallel-serial and Poisson models, with 6 MV x-ray dose distributions calculated with TOPAS and Pinnacle v9.2, were compared to data from the benchmarkmore » configuration of the AAPM Task Group 166 (TG166). We provide a benchmark configuration suitable for proton therapy along with results for the implementation of the Niemierko, CV and CE models. Results: The maximum difference in DVH calculated with Pinnacle and TOPAS was 2%. Differences between TG166 data and Monte Carlo calculations of up to 4.2%±6.1% were found for the parallel-serial model and up to 1.0%±0.7% for the Poisson model (including the uncertainty due to lack of knowledge of the point spacing in TG166). For CE, CV and Niemierko models, the discrepancies between the Pinnacle and TOPAS results are 74.5%, 34.8% and 52.1% when using 29.7 cGy point spacing, the differences being highly sensitive to dose spacing. On the other hand, with our proposed benchmark configuration, the largest differences were 12.05%±0.38%, 3.74%±1.6%, 1.57%±4.9% and 1.97%±4.6% for the CE, CV, Niemierko and LKB models, respectively. Conclusion: Several dose response models were successfully implemented with the extension module. Reference data was calculated for future benchmarking. Dose response calculated for the different models varied much more widely for the TG166 benchmark than for the proposed benchmark, which had much lower sensitivity to the choice of DVH dose points. This work was supported by National Cancer Institute Grant R01CA140735.« less
Gene selection heuristic algorithm for nutrigenomics studies.
Valour, D; Hue, I; Grimard, B; Valour, B
2013-07-15
Large datasets from -omics studies need to be deeply investigated. The aim of this paper is to provide a new method (LEM method) for the search of transcriptome and metabolome connections. The heuristic algorithm here described extends the classical canonical correlation analysis (CCA) to a high number of variables (without regularization) and combines well-conditioning and fast-computing in "R." Reduced CCA models are summarized in PageRank matrices, the product of which gives a stochastic matrix that resumes the self-avoiding walk covered by the algorithm. Then, a homogeneous Markov process applied to this stochastic matrix converges the probabilities of interconnection between genes, providing a selection of disjointed subsets of genes. This is an alternative to regularized generalized CCA for the determination of blocks within the structure matrix. Each gene subset is thus linked to the whole metabolic or clinical dataset that represents the biological phenotype of interest. Moreover, this selection process reaches the aim of biologists who often need small sets of genes for further validation or extended phenotyping. The algorithm is shown to work efficiently on three published datasets, resulting in meaningfully broadened gene networks.
DAPAGLOCO - A global daily precipitation dataset from satellite and rain-gauge measurements
NASA Astrophysics Data System (ADS)
Spangehl, T.; Danielczok, A.; Dietzsch, F.; Andersson, A.; Schroeder, M.; Fennig, K.; Ziese, M.; Becker, A.
2017-12-01
The BMBF funded project framework MiKlip(Mittelfristige Klimaprognosen) develops a global climate forecast system on decadal time scales for operational applications. Herein, the DAPAGLOCO project (Daily Precipitation Analysis for the validation of Global medium-range Climate predictions Operationalized) provides a global precipitation dataset as a combination of microwave-based satellite measurements over ocean and rain gauge measurements over land on daily scale. The DAPAGLOCO dataset is created for the evaluation of the MiKlip forecast system in the first place. The HOAPS dataset (Hamburg Ocean Atmosphere Parameter and Fluxes from Satellite data) is used for the derivation of precipitation rates over ocean and is extended by the use of measurements from TMI, GMI, and AMSR-E, in addition to measurements from SSM/I and SSMIS. A 1D-Var retrieval scheme is developed to retrieve rain rates from microwave imager data, which also allows for the determination of uncertainty estimates. Over land, the GPCC (Global Precipitation Climatology Center) Full Data Daily product is used. It consists of rain gauge measurements that are interpolated on a regular grid by ordinary Kriging. The currently available dataset is based on a neuronal network approach, consists of 21 years of data from 1988 to 2008 and is currently extended until 2015 using the 1D-Var scheme and with improved sampling. Three different spatial resolved dataset versions are available with 1° and 2.5° global, and 0.5° for Europe. The evaluation of the MiKlip forecast system by DAPAGLOCO is based on ETCCDI (Expert Team on Climate Change and Detection Indices). Hindcasts are used for the index-based comparison between model and observations. These indices allow for the evaluation of precipitation extremes, their spatial and temporal distribution as well as for the duration of dry and wet spells, average precipitation amounts and percentiles on global scale. Besides, an ETCCDI-based climatology of the DAPAGLOCO precipitation dataset has been derived.
The Global Precipitation Climatology Project (GPCP) Combined Precipitation Dataset
NASA Technical Reports Server (NTRS)
Huffman, George J.; Adler, Robert F.; Arkin, Philip; Chang, Alfred; Ferraro, Ralph; Gruber, Arnold; Janowiak, John; McNab, Alan; Rudolf, Bruno; Schneider, Udo
1997-01-01
The Global Precipitation Climatology Project (GPCP) has released the GPCP Version 1 Combined Precipitation Data Set, a global, monthly precipitation dataset covering the period July 1987 through December 1995. The primary product in the dataset is a merged analysis incorporating precipitation estimates from low-orbit-satellite microwave data, geosynchronous-orbit -satellite infrared data, and rain gauge observations. The dataset also contains the individual input fields, a combination of the microwave and infrared satellite estimates, and error estimates for each field. The data are provided on 2.5 deg x 2.5 deg latitude-longitude global grids. Preliminary analyses show general agreement with prior studies of global precipitation and extends prior studies of El Nino-Southern Oscillation precipitation patterns. At the regional scale there are systematic differences with standard climatologies.
Ren, Kai; Wang, Yuan; Liu, Tingxi; Wang, Guanli
2017-02-01
The data presented in this paper are related to the research article entitled "Exploration of Outdoor Behavior System and Spatial Pattern in the Third Place in Cold Area- based on the perspective of new energy structure" (Ren, 2016) [1]. The dataset was from a field sub-time extended investigation to residents of Power Home Community in Inner Mongolia of China that belongs to cold region of ID area according to Chinese design code for buildings. This filed data provided descriptive statistics about environment-behavior symbiosis system, environment loading, behavior system, spatial demanding and spatial pattern for all kinds of residents (Older, younger, children). The field data set is made publicly available to enable critical or extended analyzes.
Wang, Huilin; Wang, Mingjun; Tan, Hao; Li, Yuan; Zhang, Ziding; Song, Jiangning
2014-01-01
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed ‘PredPPCrys’ using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys. PMID:25148528
NASA Astrophysics Data System (ADS)
Ahmad, Kashif; Conci, Nicola; Boato, Giulia; De Natale, Francesco G. B.
2017-11-01
Over the last few years, a rapid growth has been witnessed in the number of digital photos produced per year. This rapid process poses challenges in the organization and management of multimedia collections, and one viable solution consists of arranging the media on the basis of the underlying events. However, album-level annotation and the presence of irrelevant pictures in photo collections make event-based organization of personal photo albums a more challenging task. To tackle these challenges, in contrast to conventional approaches relying on supervised learning, we propose a pipeline for event recognition in personal photo collections relying on a multiple instance-learning (MIL) strategy. MIL is a modified form of supervised learning and fits well for such applications with weakly labeled data. The experimental evaluation of the proposed approach is carried out on two large-scale datasets including a self-collected and a benchmark dataset. On both, our approach significantly outperforms the existing state-of-the-art.
Robust QRS peak detection by multimodal information fusion of ECG and blood pressure signals.
Ding, Quan; Bai, Yong; Erol, Yusuf Bugra; Salas-Boni, Rebeca; Zhang, Xiaorong; Hu, Xiao
2016-11-01
QRS peak detection is a challenging problem when ECG signal is corrupted. However, additional physiological signals may also provide information about the QRS position. In this study, we focus on a unique benchmark provided by PhysioNet/Computing in Cardiology Challenge 2014 and Physiological Measurement focus issue: robust detection of heart beats in multimodal data, which aimed to explore robust methods for QRS detection in multimodal physiological signals. A dataset of 200 training and 210 testing records are used, where the testing records are hidden for evaluating the performance only. An information fusion framework for robust QRS detection is proposed by leveraging existing ECG and ABP analysis tools and combining heart beats derived from different sources. Results show that our approach achieves an overall accuracy of 90.94% and 88.66% on the training and testing datasets, respectively. Furthermore, we observe expected performance at each step of the proposed approach, as an evidence of the effectiveness of our approach. Discussion on the limitations of our approach is also provided.
Effect of the time window on the heat-conduction information filtering model
NASA Astrophysics Data System (ADS)
Guo, Qiang; Song, Wen-Jun; Hou, Lei; Zhang, Yi-Lu; Liu, Jian-Guo
2014-05-01
Recommendation systems have been proposed to filter out the potential tastes and preferences of the normal users online, however, the physics of the time window effect on the performance is missing, which is critical for saving the memory and decreasing the computation complexity. In this paper, by gradually expanding the time window, we investigate the impact of the time window on the heat-conduction information filtering model with ten similarity measures. The experimental results on the benchmark dataset Netflix indicate that by only using approximately 11.11% recent rating records, the accuracy could be improved by an average of 33.16% and the diversity could be improved by 30.62%. In addition, the recommendation performance on the dataset MovieLens could be preserved by only considering approximately 10.91% recent records. Under the circumstance of improving the recommendation performance, our discoveries possess significant practical value by largely reducing the computational time and shortening the data storage space.
Improved hybrid information filtering based on limited time window
NASA Astrophysics Data System (ADS)
Song, Wen-Jun; Guo, Qiang; Liu, Jian-Guo
2014-12-01
Adopting the entire collecting information of users, the hybrid information filtering of heat conduction and mass diffusion (HHM) (Zhou et al., 2010) was successfully proposed to solve the apparent diversity-accuracy dilemma. Since the recent behaviors are more effective to capture the users' potential interests, we present an improved hybrid information filtering of adopting the partial recent information. We expand the time window to generate a series of training sets, each of which is treated as known information to predict the future links proven by the testing set. The experimental results on one benchmark dataset Netflix indicate that by only using approximately 31% recent rating records, the accuracy could be improved by an average of 4.22% and the diversity could be improved by 13.74%. In addition, the performance on the dataset MovieLens could be preserved by considering approximately 60% recent records. Furthermore, we find that the improved algorithm is effective to solve the cold-start problem. This work could improve the information filtering performance and shorten the computational time.
Processing large remote sensing image data sets on Beowulf clusters
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Schmidt, Gail
2003-01-01
High-performance computing is often concerned with the speed at which floating- point calculations can be performed. The architectures of many parallel computers and/or their network topologies are based on these investigations. Often, benchmarks resulting from these investigations are compiled with little regard to how a large dataset would move about in these systems. This part of the Beowulf study addresses that concern by looking at specific applications software and system-level modifications. Applications include an implementation of a smoothing filter for time-series data, a parallel implementation of the decision tree algorithm used in the Landcover Characterization project, a parallel Kriging algorithm used to fit point data collected in the field on invasive species to a regular grid, and modifications to the Beowulf project's resampling algorithm to handle larger, higher resolution datasets at a national scale. Systems-level investigations include a feasibility study on Flat Neighborhood Networks and modifications of that concept with Parallel File Systems.
On representation of temporal variability in electricity capacity planning models
Merrick, James H.
2016-08-23
This study systematically investigates how to represent intra-annual temporal variability in models of optimum electricity capacity investment. Inappropriate aggregation of temporal resolution can introduce substantial error into model outputs and associated economic insight. The mechanisms underlying the introduction of this error are shown. How many representative periods are needed to fully capture the variability is then investigated. For a sample dataset, a scenario-robust aggregation of hourly (8760) resolution is possible in the order of 10 representative hours when electricity demand is the only source of variability. The inclusion of wind and solar supply variability increases the resolution of the robustmore » aggregation to the order of 1000. A similar scale of expansion is shown for representative days and weeks. These concepts can be applied to any such temporal dataset, providing, at the least, a benchmark that any other aggregation method can aim to emulate. Finally, how prior information about peak pricing hours can potentially reduce resolution further is also discussed.« less
On representation of temporal variability in electricity capacity planning models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Merrick, James H.
This study systematically investigates how to represent intra-annual temporal variability in models of optimum electricity capacity investment. Inappropriate aggregation of temporal resolution can introduce substantial error into model outputs and associated economic insight. The mechanisms underlying the introduction of this error are shown. How many representative periods are needed to fully capture the variability is then investigated. For a sample dataset, a scenario-robust aggregation of hourly (8760) resolution is possible in the order of 10 representative hours when electricity demand is the only source of variability. The inclusion of wind and solar supply variability increases the resolution of the robustmore » aggregation to the order of 1000. A similar scale of expansion is shown for representative days and weeks. These concepts can be applied to any such temporal dataset, providing, at the least, a benchmark that any other aggregation method can aim to emulate. Finally, how prior information about peak pricing hours can potentially reduce resolution further is also discussed.« less
A multi-center study benchmarks software tools for label-free proteome quantification
Gillet, Ludovic C; Bernhardt, Oliver M.; MacLean, Brendan; Röst, Hannes L.; Tate, Stephen A.; Tsou, Chih-Chiang; Reiter, Lukas; Distler, Ute; Rosenberger, George; Perez-Riverol, Yasset; Nesvizhskii, Alexey I.; Aebersold, Ruedi; Tenzer, Stefan
2016-01-01
The consistent and accurate quantification of proteins by mass spectrometry (MS)-based proteomics depends on the performance of instruments, acquisition methods and data analysis software. In collaboration with the software developers, we evaluated OpenSWATH, SWATH2.0, Skyline, Spectronaut and DIA-Umpire, five of the most widely used software methods for processing data from SWATH-MS (sequential window acquisition of all theoretical fragment ion spectra), a method that uses data-independent acquisition (DIA) for label-free protein quantification. We analyzed high-complexity test datasets from hybrid proteome samples of defined quantitative composition acquired on two different MS instruments using different SWATH isolation windows setups. For consistent evaluation we developed LFQbench, an R-package to calculate metrics of precision and accuracy in label-free quantitative MS, and report the identification performance, robustness and specificity of each software tool. Our reference datasets enabled developers to improve their software tools. After optimization, all tools provided highly convergent identification and reliable quantification performance, underscoring their robustness for label-free quantitative proteomics. PMID:27701404
Multiclass Posterior Probability Twin SVM for Motor Imagery EEG Classification.
She, Qingshan; Ma, Yuliang; Meng, Ming; Luo, Zhizeng
2015-01-01
Motor imagery electroencephalography is widely used in the brain-computer interface systems. Due to inherent characteristics of electroencephalography signals, accurate and real-time multiclass classification is always challenging. In order to solve this problem, a multiclass posterior probability solution for twin SVM is proposed by the ranking continuous output and pairwise coupling in this paper. First, two-class posterior probability model is constructed to approximate the posterior probability by the ranking continuous output techniques and Platt's estimating method. Secondly, a solution of multiclass probabilistic outputs for twin SVM is provided by combining every pair of class probabilities according to the method of pairwise coupling. Finally, the proposed method is compared with multiclass SVM and twin SVM via voting, and multiclass posterior probability SVM using different coupling approaches. The efficacy on the classification accuracy and time complexity of the proposed method has been demonstrated by both the UCI benchmark datasets and real world EEG data from BCI Competition IV Dataset 2a, respectively.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pucci, Fabrizio, E-mail: fapucci@ulb.ac.be; Bourgeas, Raphaël, E-mail: rbourgeas@ulb.ac.be; Rooman, Marianne, E-mail: mrooman@ulb.ac.be
We have set up and manually curated a dataset containing experimental information on the impact of amino acid substitutions in a protein on its thermal stability. It consists of a repository of experimentally measured melting temperatures (T{sub m}) and their changes upon point mutations (ΔT{sub m}) for proteins having a well-resolved x-ray structure. This high-quality dataset is designed for being used for the training or benchmarking of in silico thermal stability prediction methods. It also reports other experimentally measured thermodynamic quantities when available, i.e., the folding enthalpy (ΔH) and heat capacity (ΔC{sub P}) of the wild type proteins and theirmore » changes upon mutations (ΔΔH and ΔΔC{sub P}), as well as the change in folding free energy (ΔΔG) at a reference temperature. These data are analyzed in view of improving our insights into the correlation between thermal and thermodynamic stabilities, the asymmetry between the number of stabilizing and destabilizing mutations, and the difference in stabilization potential of thermostable versus mesostable proteins.« less
Guo, Song; Liu, Chunhua; Zhou, Peng; Li, Yanling
2016-01-01
Tyrosine sulfation is one of the ubiquitous protein posttranslational modifications, where some sulfate groups are added to the tyrosine residues. It plays significant roles in various physiological processes in eukaryotic cells. To explore the molecular mechanism of tyrosine sulfation, one of the prerequisites is to correctly identify possible protein tyrosine sulfation residues. In this paper, a novel method was presented to predict protein tyrosine sulfation residues from primary sequences. By means of informative feature construction and elaborate feature selection and parameter optimization scheme, the proposed predictor achieved promising results and outperformed many other state-of-the-art predictors. Using the optimal features subset, the proposed method achieved mean MCC of 94.41% on the benchmark dataset, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be effective in identifying the important protein posttranslational modifications and the feature selection scheme would be powerful in protein functional residues prediction research fields.
Liu, Chunhua; Zhou, Peng; Li, Yanling
2016-01-01
Tyrosine sulfation is one of the ubiquitous protein posttranslational modifications, where some sulfate groups are added to the tyrosine residues. It plays significant roles in various physiological processes in eukaryotic cells. To explore the molecular mechanism of tyrosine sulfation, one of the prerequisites is to correctly identify possible protein tyrosine sulfation residues. In this paper, a novel method was presented to predict protein tyrosine sulfation residues from primary sequences. By means of informative feature construction and elaborate feature selection and parameter optimization scheme, the proposed predictor achieved promising results and outperformed many other state-of-the-art predictors. Using the optimal features subset, the proposed method achieved mean MCC of 94.41% on the benchmark dataset, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be effective in identifying the important protein posttranslational modifications and the feature selection scheme would be powerful in protein functional residues prediction research fields. PMID:27034949
Stylized facts in social networks: Community-based static modeling
NASA Astrophysics Data System (ADS)
Jo, Hang-Hyun; Murase, Yohsuke; Török, János; Kertész, János; Kaski, Kimmo
2018-06-01
The past analyses of datasets of social networks have enabled us to make empirical findings of a number of aspects of human society, which are commonly featured as stylized facts of social networks, such as broad distributions of network quantities, existence of communities, assortative mixing, and intensity-topology correlations. Since the understanding of the structure of these complex social networks is far from complete, for deeper insight into human society more comprehensive datasets and modeling of the stylized facts are needed. Although the existing dynamical and static models can generate some stylized facts, here we take an alternative approach by devising a community-based static model with heterogeneous community sizes and larger communities having smaller link density and weight. With these few assumptions we are able to generate realistic social networks that show most stylized facts for a wide range of parameters, as demonstrated numerically and analytically. Since our community-based static model is simple to implement and easily scalable, it can be used as a reference system, benchmark, or testbed for further applications.
Brain tumor segmentation using holistically nested neural networks in MRI images.
Zhuge, Ying; Krauze, Andra V; Ning, Holly; Cheng, Jason Y; Arora, Barbara C; Camphausen, Kevin; Miller, Robert W
2017-10-01
Gliomas are rapidly progressive, neurologically devastating, largely fatal brain tumors. Magnetic resonance imaging (MRI) is a widely used technique employed in the diagnosis and management of gliomas in clinical practice. MRI is also the standard imaging modality used to delineate the brain tumor target as part of treatment planning for the administration of radiation therapy. Despite more than 20 yr of research and development, computational brain tumor segmentation in MRI images remains a challenging task. We are presenting a novel method of automatic image segmentation based on holistically nested neural networks that could be employed for brain tumor segmentation of MRI images. Two preprocessing techniques were applied to MRI images. The N4ITK method was employed for correction of bias field distortion. A novel landmark-based intensity normalization method was developed so that tissue types have a similar intensity scale in images of different subjects for the same MRI protocol. The holistically nested neural networks (HNN), which extend from the convolutional neural networks (CNN) with a deep supervision through an additional weighted-fusion output layer, was trained to learn the multiscale and multilevel hierarchical appearance representation of the brain tumor in MRI images and was subsequently applied to produce a prediction map of the brain tumor on test images. Finally, the brain tumor was obtained through an optimum thresholding on the prediction map. The proposed method was evaluated on both the Multimodal Brain Tumor Image Segmentation (BRATS) Benchmark 2013 training datasets, and clinical data from our institute. A dice similarity coefficient (DSC) and sensitivity of 0.78 and 0.81 were achieved on 20 BRATS 2013 training datasets with high-grade gliomas (HGG), based on a two-fold cross-validation. The HNN model built on the BRATS 2013 training data was applied to ten clinical datasets with HGG from a locally developed database. DSC and sensitivity of 0.83 and 0.85 were achieved. A quantitative comparison indicated that the proposed method outperforms the popular fully convolutional network (FCN) method. In terms of efficiency, the proposed method took around 10 h for training with 50,000 iterations, and approximately 30 s for testing of a typical MRI image in the BRATS 2013 dataset with a size of 160 × 216 × 176, using a DELL PRECISION workstation T7400, with an NVIDIA Tesla K20c GPU. An effective brain tumor segmentation method for MRI images based on a HNN has been developed. The high level of accuracy and efficiency make this method practical in brain tumor segmentation. It may play a crucial role in both brain tumor diagnostic analysis and in the treatment planning of radiation therapy. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.
Suresh, V; Parthasarathy, S
2014-01-01
We developed a support vector machine based web server called SVM-PB-Pred, to predict the Protein Block for any given amino acid sequence. The input features of SVM-PB-Pred include i) sequence profiles (PSSM) and ii) actual secondary structures (SS) from DSSP method or predicted secondary structures from NPS@ and GOR4 methods. There were three combined input features PSSM+SS(DSSP), PSSM+SS(NPS@) and PSSM+SS(GOR4) used to test and train the SVM models. Similarly, four datasets RS90, DB433, LI1264 and SP1577 were used to develop the SVM models. These four SVM models developed were tested using three different benchmarking tests namely; (i) self consistency, (ii) seven fold cross validation test and (iii) independent case test. The maximum possible prediction accuracy of ~70% was observed in self consistency test for the SVM models of both LI1264 and SP1577 datasets, where PSSM+SS(DSSP) input features was used to test. The prediction accuracies were reduced to ~53% for PSSM+SS(NPS@) and ~43% for PSSM+SS(GOR4) in independent case test, for the SVM models of above two same datasets. Using our method, it is possible to predict the protein block letters for any query protein sequence with ~53% accuracy, when the SP1577 dataset and predicted secondary structure from NPS@ server were used. The SVM-PB-Pred server can be freely accessed through http://bioinfo.bdu.ac.in/~svmpbpred.
Systematic evaluation of deep learning based detection frameworks for aerial imagery
NASA Astrophysics Data System (ADS)
Sommer, Lars; Steinmann, Lucas; Schumann, Arne; Beyerer, Jürgen
2018-04-01
Object detection in aerial imagery is crucial for many applications in the civil and military domain. In recent years, deep learning based object detection frameworks significantly outperformed conventional approaches based on hand-crafted features on several datasets. However, these detection frameworks are generally designed and optimized for common benchmark datasets, which considerably differ from aerial imagery especially in object sizes. As already demonstrated for Faster R-CNN, several adaptations are necessary to account for these differences. In this work, we adapt several state-of-the-art detection frameworks including Faster R-CNN, R-FCN, and Single Shot MultiBox Detector (SSD) to aerial imagery. We discuss adaptations that mainly improve the detection accuracy of all frameworks in detail. As the output of deeper convolutional layers comprise more semantic information, these layers are generally used in detection frameworks as feature map to locate and classify objects. However, the resolution of these feature maps is insufficient for handling small object instances, which results in an inaccurate localization or incorrect classification of small objects. Furthermore, state-of-the-art detection frameworks perform bounding box regression to predict the exact object location. Therefore, so called anchor or default boxes are used as reference. We demonstrate how an appropriate choice of anchor box sizes can considerably improve detection performance. Furthermore, we evaluate the impact of the performed adaptations on two publicly available datasets to account for various ground sampling distances or differing backgrounds. The presented adaptations can be used as guideline for further datasets or detection frameworks.
USDA-ARS?s Scientific Manuscript database
Computer simulation is a useful tool for benchmarking the electrical and fuel energy consumption and water use in a fluid milk plant. In this study, a computer simulation model of the fluid milk process based on high temperature short time (HTST) pasteurization was extended to include models for pr...
Using ontology databases for scalable query answering, inconsistency detection, and data integration
Dou, Dejing
2011-01-01
An ontology database is a basic relational database management system that models an ontology plus its instances. To reason over the transitive closure of instances in the subsumption hierarchy, for example, an ontology database can either unfold views at query time or propagate assertions using triggers at load time. In this paper, we use existing benchmarks to evaluate our method—using triggers—and we demonstrate that by forward computing inferences, we not only improve query time, but the improvement appears to cost only more space (not time). However, we go on to show that the true penalties were simply opaque to the benchmark, i.e., the benchmark inadequately captures load-time costs. We have applied our methods to two case studies in biomedicine, using ontologies and data from genetics and neuroscience to illustrate two important applications: first, ontology databases answer ontology-based queries effectively; second, using triggers, ontology databases detect instance-based inconsistencies—something not possible using views. Finally, we demonstrate how to extend our methods to perform data integration across multiple, distributed ontology databases. PMID:22163378
NASA Astrophysics Data System (ADS)
Dimitriadis, Panayiotis; Tegos, Aristoteles; Oikonomou, Athanasios; Pagana, Vassiliki; Koukouvinos, Antonios; Mamassis, Nikos; Koutsoyiannis, Demetris; Efstratiadis, Andreas
2016-03-01
One-dimensional and quasi-two-dimensional hydraulic freeware models (HEC-RAS, LISFLOOD-FP and FLO-2d) are widely used for flood inundation mapping. These models are tested on a benchmark test with a mixed rectangular-triangular channel cross section. Using a Monte-Carlo approach, we employ extended sensitivity analysis by simultaneously varying the input discharge, longitudinal and lateral gradients and roughness coefficients, as well as the grid cell size. Based on statistical analysis of three output variables of interest, i.e. water depths at the inflow and outflow locations and total flood volume, we investigate the uncertainty enclosed in different model configurations and flow conditions, without the influence of errors and other assumptions on topography, channel geometry and boundary conditions. Moreover, we estimate the uncertainty associated to each input variable and we compare it to the overall one. The outcomes of the benchmark analysis are further highlighted by applying the three models to real-world flood propagation problems, in the context of two challenging case studies in Greece.
On the predictability of land surface fluxes from meteorological variables
NASA Astrophysics Data System (ADS)
Haughton, Ned; Abramowitz, Gab; Pitman, Andy J.
2018-01-01
Previous research has shown that land surface models (LSMs) are performing poorly when compared with relatively simple empirical models over a wide range of metrics and environments. Atmospheric driving data appear to provide information about land surface fluxes that LSMs are not fully utilising. Here, we further quantify the information available in the meteorological forcing data that are used by LSMs for predicting land surface fluxes, by interrogating FLUXNET data, and extending the benchmarking methodology used in previous experiments. We show that substantial performance improvement is possible for empirical models using meteorological data alone, with no explicit vegetation or soil properties, thus setting lower bounds on a priori expectations on LSM performance. The process also identifies key meteorological variables that provide predictive power. We provide an ensemble of empirical benchmarks that are simple to reproduce and provide a range of behaviours and predictive performance, acting as a baseline benchmark set for future studies. We reanalyse previously published LSM simulations and show that there is more diversity between LSMs than previously indicated, although it remains unclear why LSMs are broadly performing so much worse than simple empirical models.
Extending TOPS: Ontology-driven Anomaly Detection and Analysis System
NASA Astrophysics Data System (ADS)
Votava, P.; Nemani, R. R.; Michaelis, A.
2010-12-01
Terrestrial Observation and Prediction System (TOPS) is a flexible modeling software system that integrates ecosystem models with frequent satellite and surface weather observations to produce ecosystem nowcasts (assessments of current conditions) and forecasts useful in natural resources management, public health and disaster management. We have been extending the Terrestrial Observation and Prediction System (TOPS) to include a capability for automated anomaly detection and analysis of both on-line (streaming) and off-line data. In order to best capture the knowledge about data hierarchies, Earth science models and implied dependencies between anomalies and occurrences of observable events such as urbanization, deforestation, or fires, we have developed an ontology to serve as a knowledge base. We can query the knowledge base and answer questions about dataset compatibilities, similarities and dependencies so that we can, for example, automatically analyze similar datasets in order to verify a given anomaly occurrence in multiple data sources. We are further extending the system to go beyond anomaly detection towards reasoning about possible causes of anomalies that are also encoded in the knowledge base as either learned or implied knowledge. This enables us to scale up the analysis by eliminating a large number of anomalies early on during the processing by either failure to verify them from other sources, or matching them directly with other observable events without having to perform an extensive and time-consuming exploration and analysis. The knowledge is captured using OWL ontology language, where connections are defined in a schema that is later extended by including specific instances of datasets and models. The information is stored using Sesame server and is accessible through both Java API and web services using SeRQL and SPARQL query languages. Inference is provided using OWLIM component integrated with Sesame.
Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.
Wu, Chenxue; Liu, Zhao; Zhu, Yunhong
2017-01-01
Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687
V-FOR-WaTer - a new virtual research environment for environmental research
NASA Astrophysics Data System (ADS)
Strobl, Marcus; Azmi, Elnaz; Hassler, Sibylle; Mälicke, Mirko; Meyer, Jörg; Zehe, Erwin
2017-04-01
The preparation of heterogeneous datasets for scientific analysis is still a demanding task. Data preprocessing for hydrological models typically involves gathering datasets from different sources, extensive work within geoinformation systems, data transformation, the generation of computational grids and the definition of initial and boundary conditions. V-FOR-WaTer, a standardized and scalable data hub with compatible analysis tools, will ease comprehensive studies and significantly reduce data preparation time. The idea behind V-FOR-WaTer is to bring together various datasets (e.g. point measurements, 2D/3D data, time series data) from different sources (e.g. gathered in research projects, or as part of regular monitoring of state offices) and to provide common as well as innovative scaling tools in space and time to generate a coherent data grid. Each dataset holds detailed standardized metadata to ensure usability of the data, offer a comprehensive search function and provide reference information for appropriate citation of the dataset creators. V-FOR-WaTer includes a basis of data and tools, but its purpose is to grow by users who extend the virtual research environment with their own tools and research data. Researchers who upload new data or tools can receive a digital object identifier, or protect their data and tools from others until publication. Access to data and tools provided from V-FOR-WaTer happens via an easy-to-use web portal. Due to its modular architecture the portal is ready to be extended with new tools and features and also offers interfaces to Matlab, Python and R.
Comprehensive decision tree models in bioinformatics.
Stiglic, Gregor; Kocbek, Simon; Pernek, Igor; Kokol, Peter
2012-01-01
Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.
Comprehensive Decision Tree Models in Bioinformatics
Stiglic, Gregor; Kocbek, Simon; Pernek, Igor; Kokol, Peter
2012-01-01
Purpose Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible. Methods This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree. Results The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree. Conclusions The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics. PMID:22479449
Scalable and cost-effective NGS genotyping in the cloud.
Souilmi, Yassine; Lancaster, Alex K; Jung, Jae-Yoon; Rizzo, Ettore; Hawkins, Jared B; Powles, Ryan; Amzazi, Saaïd; Ghazal, Hassan; Tonellato, Peter J; Wall, Dennis P
2015-10-15
While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars. We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.
Big heart data: advancing health informatics through data sharing in cardiovascular imaging.
Suinesiaputra, Avan; Medrano-Gracia, Pau; Cowan, Brett R; Young, Alistair A
2015-07-01
The burden of heart disease is rapidly worsening due to the increasing prevalence of obesity and diabetes. Data sharing and open database resources for heart health informatics are important for advancing our understanding of cardiovascular function, disease progression and therapeutics. Data sharing enables valuable information, often obtained at considerable expense and effort, to be reused beyond the specific objectives of the original study. Many government funding agencies and journal publishers are requiring data reuse, and are providing mechanisms for data curation and archival. Tools and infrastructure are available to archive anonymous data from a wide range of studies, from descriptive epidemiological data to gigabytes of imaging data. Meta-analyses can be performed to combine raw data from disparate studies to obtain unique comparisons or to enhance statistical power. Open benchmark datasets are invaluable for validating data analysis algorithms and objectively comparing results. This review provides a rationale for increased data sharing and surveys recent progress in the cardiovascular domain. We also highlight the potential of recent large cardiovascular epidemiological studies enabling collaborative efforts to facilitate data sharing, algorithms benchmarking, disease modeling and statistical atlases.
DeltaSA tool for source apportionment benchmarking, description and sensitivity analysis
NASA Astrophysics Data System (ADS)
Pernigotti, D.; Belis, C. A.
2018-05-01
DeltaSA is an R-package and a Java on-line tool developed at the EC-Joint Research Centre to assist and benchmark source apportionment applications. Its key functionalities support two critical tasks in this kind of studies: the assignment of a factor to a source in factor analytical models (source identification) and the model performance evaluation. The source identification is based on the similarity between a given factor and source chemical profiles from public databases. The model performance evaluation is based on statistical indicators used to compare model output with reference values generated in intercomparison exercises. The references values are calculated as the ensemble average of the results reported by participants that have passed a set of testing criteria based on chemical profiles and time series similarity. In this study, a sensitivity analysis of the model performance criteria is accomplished using the results of a synthetic dataset where "a priori" references are available. The consensus modulated standard deviation punc gives the best choice for the model performance evaluation when a conservative approach is adopted.
Object Detection and Classification by Decision-Level Fusion for Intelligent Vehicle Systems.
Oh, Sang-Il; Kang, Hang-Bong
2017-01-22
To understand driving environments effectively, it is important to achieve accurate detection and classification of objects detected by sensor-based intelligent vehicle systems, which are significantly important tasks. Object detection is performed for the localization of objects, whereas object classification recognizes object classes from detected object regions. For accurate object detection and classification, fusing multiple sensor information into a key component of the representation and perception processes is necessary. In this paper, we propose a new object-detection and classification method using decision-level fusion. We fuse the classification outputs from independent unary classifiers, such as 3D point clouds and image data using a convolutional neural network (CNN). The unary classifiers for the two sensors are the CNN with five layers, which use more than two pre-trained convolutional layers to consider local to global features as data representation. To represent data using convolutional layers, we apply region of interest (ROI) pooling to the outputs of each layer on the object candidate regions generated using object proposal generation to realize color flattening and semantic grouping for charge-coupled device and Light Detection And Ranging (LiDAR) sensors. We evaluate our proposed method on a KITTI benchmark dataset to detect and classify three object classes: cars, pedestrians and cyclists. The evaluation results show that the proposed method achieves better performance than the previous methods. Our proposed method extracted approximately 500 proposals on a 1226 × 370 image, whereas the original selective search method extracted approximately 10 6 × n proposals. We obtained classification performance with 77.72% mean average precision over the entirety of the classes in the moderate detection level of the KITTI benchmark dataset.
Lenselink, Eelke B; Ten Dijke, Niels; Bongers, Brandon; Papadatos, George; van Vlijmen, Herman W T; Kowalczyk, Wojtek; IJzerman, Adriaan P; van Westen, Gerard J P
2017-08-14
The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method ('DNN_PCM') performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized 'DNN_PCM'). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols. Graphical Abstract .
Object Detection and Classification by Decision-Level Fusion for Intelligent Vehicle Systems
Oh, Sang-Il; Kang, Hang-Bong
2017-01-01
To understand driving environments effectively, it is important to achieve accurate detection and classification of objects detected by sensor-based intelligent vehicle systems, which are significantly important tasks. Object detection is performed for the localization of objects, whereas object classification recognizes object classes from detected object regions. For accurate object detection and classification, fusing multiple sensor information into a key component of the representation and perception processes is necessary. In this paper, we propose a new object-detection and classification method using decision-level fusion. We fuse the classification outputs from independent unary classifiers, such as 3D point clouds and image data using a convolutional neural network (CNN). The unary classifiers for the two sensors are the CNN with five layers, which use more than two pre-trained convolutional layers to consider local to global features as data representation. To represent data using convolutional layers, we apply region of interest (ROI) pooling to the outputs of each layer on the object candidate regions generated using object proposal generation to realize color flattening and semantic grouping for charge-coupled device and Light Detection And Ranging (LiDAR) sensors. We evaluate our proposed method on a KITTI benchmark dataset to detect and classify three object classes: cars, pedestrians and cyclists. The evaluation results show that the proposed method achieves better performance than the previous methods. Our proposed method extracted approximately 500 proposals on a 1226×370 image, whereas the original selective search method extracted approximately 106×n proposals. We obtained classification performance with 77.72% mean average precision over the entirety of the classes in the moderate detection level of the KITTI benchmark dataset. PMID:28117742
SparseBeads data: benchmarking sparsity-regularized computed tomography
NASA Astrophysics Data System (ADS)
Jørgensen, Jakob S.; Coban, Sophia B.; Lionheart, William R. B.; McDonald, Samuel A.; Withers, Philip J.
2017-12-01
Sparsity regularization (SR) such as total variation (TV) minimization allows accurate image reconstruction in x-ray computed tomography (CT) from fewer projections than analytical methods. Exactly how few projections suffice and how this number may depend on the image remain poorly understood. Compressive sensing connects the critical number of projections to the image sparsity, but does not cover CT, however empirical results suggest a similar connection. The present work establishes for real CT data a connection between gradient sparsity and the sufficient number of projections for accurate TV-regularized reconstruction. A collection of 48 x-ray CT datasets called SparseBeads was designed for benchmarking SR reconstruction algorithms. Beadpacks comprising glass beads of five different sizes as well as mixtures were scanned in a micro-CT scanner to provide structured datasets with variable image sparsity levels, number of projections and noise levels to allow the systematic assessment of parameters affecting performance of SR reconstruction algorithms6. Using the SparseBeads data, TV-regularized reconstruction quality was assessed as a function of numbers of projections and gradient sparsity. The critical number of projections for satisfactory TV-regularized reconstruction increased almost linearly with the gradient sparsity. This establishes a quantitative guideline from which one may predict how few projections to acquire based on expected sample sparsity level as an aid in planning of dose- or time-critical experiments. The results are expected to hold for samples of similar characteristics, i.e. consisting of few, distinct phases with relatively simple structure. Such cases are plentiful in porous media, composite materials, foams, as well as non-destructive testing and metrology. For samples of other characteristics the proposed methodology may be used to investigate similar relations.
NASA Astrophysics Data System (ADS)
Zhang, Z.; Zimmermann, N. E.; Poulter, B.
2015-11-01
Simulations of the spatial-temporal dynamics of wetlands are key to understanding the role of wetland biogeochemistry under past and future climate variability. Hydrologic inundation models, such as TOPMODEL, are based on a fundamental parameter known as the compound topographic index (CTI) and provide a computationally cost-efficient approach to simulate wetland dynamics at global scales. However, there remains large discrepancy in the implementations of TOPMODEL in land-surface models (LSMs) and thus their performance against observations. This study describes new improvements to TOPMODEL implementation and estimates of global wetland dynamics using the LPJ-wsl dynamic global vegetation model (DGVM), and quantifies uncertainties by comparing three digital elevation model products (HYDRO1k, GMTED, and HydroSHEDS) at different spatial resolution and accuracy on simulated inundation dynamics. In addition, we found that calibrating TOPMODEL with a benchmark wetland dataset can help to successfully delineate the seasonal and interannual variations of wetlands, as well as improve the spatial distribution of wetlands to be consistent with inventories. The HydroSHEDS DEM, using a river-basin scheme for aggregating the CTI, shows best accuracy for capturing the spatio-temporal dynamics of wetlands among the three DEM products. The estimate of global wetland potential/maximum is ∼ 10.3 Mkm2 (106 km2), with a mean annual maximum of ∼ 5.17 Mkm2 for 1980-2010. This study demonstrates the feasibility to capture spatial heterogeneity of inundation and to estimate seasonal and interannual variations in wetland by coupling a hydrological module in LSMs with appropriate benchmark datasets. It additionally highlights the importance of an adequate investigation of topographic indices for simulating global wetlands and shows the opportunity to converge wetland estimates across LSMs by identifying the uncertainty associated with existing wetland products.
A novel binary shape context for 3D local surface description
NASA Astrophysics Data System (ADS)
Dong, Zhen; Yang, Bisheng; Liu, Yuan; Liang, Fuxun; Li, Bijun; Zang, Yufu
2017-08-01
3D local surface description is now at the core of many computer vision technologies, such as 3D object recognition, intelligent driving, and 3D model reconstruction. However, most of the existing 3D feature descriptors still suffer from low descriptiveness, weak robustness, and inefficiency in both time and memory. To overcome these challenges, this paper presents a robust and descriptive 3D Binary Shape Context (BSC) descriptor with high efficiency in both time and memory. First, a novel BSC descriptor is generated for 3D local surface description, and the performance of the BSC descriptor under different settings of its parameters is analyzed. Next, the descriptiveness, robustness, and efficiency in both time and memory of the BSC descriptor are evaluated and compared to those of several state-of-the-art 3D feature descriptors. Finally, the performance of the BSC descriptor for 3D object recognition is also evaluated on a number of popular benchmark datasets, and an urban-scene dataset is collected by a terrestrial laser scanner system. Comprehensive experiments demonstrate that the proposed BSC descriptor obtained high descriptiveness, strong robustness, and high efficiency in both time and memory and achieved high recognition rates of 94.8%, 94.1% and 82.1% on the considered UWA, Queen, and WHU datasets, respectively.
Zhang, Jian; Gao, Bo; Chai, Haiting; Ma, Zhiqiang; Yang, Guifu
2016-08-26
DNA-binding proteins (DBPs) play fundamental roles in many biological processes. Therefore, the developing of effective computational tools for identifying DBPs is becoming highly desirable. In this study, we proposed an accurate method for the prediction of DBPs. Firstly, we focused on the challenge of improving DBP prediction accuracy with information solely from the sequence. Secondly, we used multiple informative features to encode the protein. These features included evolutionary conservation profile, secondary structure motifs, and physicochemical properties. Thirdly, we introduced a novel improved Binary Firefly Algorithm (BFA) to remove redundant or noisy features as well as select optimal parameters for the classifier. The experimental results of our predictor on two benchmark datasets outperformed many state-of-the-art predictors, which revealed the effectiveness of our method. The promising prediction performance on a new-compiled independent testing dataset from PDB and a large-scale dataset from UniProt proved the good generalization ability of our method. In addition, the BFA forged in this research would be of great potential in practical applications in optimization fields, especially in feature selection problems. A highly accurate method was proposed for the identification of DBPs. A user-friendly web-server named iDbP (identification of DNA-binding Proteins) was constructed and provided for academic use.
BMDExpress Data Viewer: A Visualization Tool to Analyze ...
Regulatory agencies increasingly apply benchmark dose (BMD) modeling to determine points of departure in human risk assessments. BMDExpress applies BMD modeling to transcriptomics datasets and groups genes to biological processes and pathways for rapid assessment of doses at which biological perturbations occur. However, graphing and analytical capabilities within BMDExpress are limited, and the analysis of output files is challenging. We developed a web-based application, BMDExpress Data Viewer, for visualization and graphical analyses of BMDExpress output files. The software application consists of two main components: ‘Summary Visualization Tools’ and ‘Dataset Exploratory Tools’. We demonstrate through two case studies that the ‘Summary Visualization Tools’ can be used to examine and assess the distributions of probe and pathway BMD outputs, as well as derive a potential regulatory BMD through the modes or means of the distributions. The ‘Functional Enrichment Analysis’ tool presents biological processes in a two-dimensional bubble chart view. By applying filters of pathway enrichment p-value and minimum number of significant genes, we showed that the Functional Enrichment Analysis tool can be applied to select pathways that are potentially sensitive to chemical perturbations. The ‘Multiple Dataset Comparison’ tool enables comparison of BMDs across multiple experiments (e.g., across time points, tissues, or organisms, etc.). The ‘BMDL-BM
Booma, P M; Prabhakaran, S; Dhanalakshmi, R
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.
Booma, P. M.; Prabhakaran, S.; Dhanalakshmi, R.
2014-01-01
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality. PMID:25136661
Automatic building extraction from LiDAR data fusion of point and grid-based features
NASA Astrophysics Data System (ADS)
Du, Shouji; Zhang, Yunsheng; Zou, Zhengrong; Xu, Shenghua; He, Xue; Chen, Siyang
2017-08-01
This paper proposes a method for extracting buildings from LiDAR point cloud data by combining point-based and grid-based features. To accurately discriminate buildings from vegetation, a point feature based on the variance of normal vectors is proposed. For a robust building extraction, a graph cuts algorithm is employed to combine the used features and consider the neighbor contexture information. As grid feature computing and a graph cuts algorithm are performed on a grid structure, a feature-retained DSM interpolation method is proposed in this paper. The proposed method is validated by the benchmark ISPRS Test Project on Urban Classification and 3D Building Reconstruction and compared to the state-art-of-the methods. The evaluation shows that the proposed method can obtain a promising result both at area-level and at object-level. The method is further applied to the entire ISPRS dataset and to a real dataset of the Wuhan City. The results show a completeness of 94.9% and a correctness of 92.2% at the per-area level for the former dataset and a completeness of 94.4% and a correctness of 95.8% for the latter one. The proposed method has a good potential for large-size LiDAR data.
Monroe, J Grey; Allen, Zachariah A; Tanger, Paul; Mullen, Jack L; Lovell, John T; Moyers, Brook T; Whitley, Darrell; McKay, John K
2017-01-01
Recent advances in nucleic acid sequencing technologies have led to a dramatic increase in the number of markers available to generate genetic linkage maps. This increased marker density can be used to improve genome assemblies as well as add much needed resolution for loci controlling variation in ecologically and agriculturally important traits. However, traditional genetic map construction methods from these large marker datasets can be computationally prohibitive and highly error prone. We present TSPmap , a method which implements both approximate and exact Traveling Salesperson Problem solvers to generate linkage maps. We demonstrate that for datasets with large numbers of genomic markers (e.g. 10,000) and in multiple population types generated from inbred parents, TSPmap can rapidly produce high quality linkage maps with low sensitivity to missing and erroneous genotyping data compared to two other benchmark methods, JoinMap and MSTmap . TSPmap is open source and freely available as an R package. With the advancement of low cost sequencing technologies, the number of markers used in the generation of genetic maps is expected to continue to rise. TSPmap will be a useful tool to handle such large datasets into the future, quickly producing high quality maps using a large number of genomic markers.
SPICE for ESA Planetary Missions
NASA Astrophysics Data System (ADS)
Costa, M.
2017-09-01
SPICE is an information system that provides the geometry needed to plan scientific observations and to analyze the obtained. The ESA SPICE Service generates the SPICE Kernel datasets for missions in all the active ESA Missions. This contribution describes the current status of the datasets, the extended services and the SPICE support provided to the ESA Planetary Missions (Mars-Express, ExoMars2016, BepiColombo, JUICE, Rosetta, Venus-Express and SMART-1) for the benefit of the science community.
cellVIEW: a Tool for Illustrative and Multi-Scale Rendering of Large Biomolecular Datasets
Le Muzic, Mathieu; Autin, Ludovic; Parulek, Julius; Viola, Ivan
2017-01-01
In this article we introduce cellVIEW, a new system to interactively visualize large biomolecular datasets on the atomic level. Our tool is unique and has been specifically designed to match the ambitions of our domain experts to model and interactively visualize structures comprised of several billions atom. The cellVIEW system integrates acceleration techniques to allow for real-time graphics performance of 60 Hz display rate on datasets representing large viruses and bacterial organisms. Inspired by the work of scientific illustrators, we propose a level-of-detail scheme which purpose is two-fold: accelerating the rendering and reducing visual clutter. The main part of our datasets is made out of macromolecules, but it also comprises nucleic acids strands which are stored as sets of control points. For that specific case, we extend our rendering method to support the dynamic generation of DNA strands directly on the GPU. It is noteworthy that our tool has been directly implemented inside a game engine. We chose to rely on a third party engine to reduce software development work-load and to make bleeding-edge graphics techniques more accessible to the end-users. To our knowledge cellVIEW is the only suitable solution for interactive visualization of large bimolecular landscapes on the atomic level and is freely available to use and extend. PMID:29291131
Robust Tomography using Randomized Benchmarking
NASA Astrophysics Data System (ADS)
Silva, Marcus; Kimmel, Shelby; Johnson, Blake; Ryan, Colm; Ohki, Thomas
2013-03-01
Conventional randomized benchmarking (RB) can be used to estimate the fidelity of Clifford operations in a manner that is robust against preparation and measurement errors -- thus allowing for a more accurate and relevant characterization of the average error in Clifford gates compared to standard tomography protocols. Interleaved RB (IRB) extends this result to the extraction of error rates for individual Clifford gates. In this talk we will show how to combine multiple IRB experiments to extract all information about the unital part of any trace preserving quantum process. Consequently, one can compute the average fidelity to any unitary, not just the Clifford group, with tighter bounds than IRB. Moreover, the additional information can be used to design improvements in control. MS, BJ, CR and TO acknowledge support from IARPA under contract W911NF-10-1-0324.
Online collaboration and model sharing in volcanology via VHub.org
NASA Astrophysics Data System (ADS)
Valentine, G.; Patra, A. K.; Bajo, J. V.; Bursik, M. I.; Calder, E.; Carn, S. A.; Charbonnier, S. J.; Connor, C.; Connor, L.; Courtland, L. M.; Gallo, S.; Jones, M.; Palma Lizana, J. L.; Moore-Russo, D.; Renschler, C. S.; Rose, W. I.
2013-12-01
VHub (short for VolcanoHub, and accessible at vhub.org) is an online platform for barrier free access to high end modeling and simulation and collaboration in research and training related to volcanoes, the hazards they pose, and risk mitigation. The underlying concept is to provide a platform, building upon the successful HUBzero software infrastructure (hubzero.org), that enables workers to collaborate online and to easily share information, modeling and analysis tools, and educational materials with colleagues around the globe. Collaboration occurs around several different points: (1) modeling and simulation; (2) data sharing; (3) education and training; (4) volcano observatories; and (5) project-specific groups. VHub promotes modeling and simulation in two ways: (1) some models can be implemented on VHub for online execution. VHub can provide a central warehouse for such models that should result in broader dissemination. VHub also provides a platform that supports the more complex CFD models by enabling the sharing of code development and problem-solving knowledge, benchmarking datasets, and the development of validation exercises. VHub also provides a platform for sharing of data and datasets. The VHub development team is implementing the iRODS data sharing middleware (see irods.org). iRODS allows a researcher to access data that are located at participating data sources around the world (a cloud of data) as if the data were housed in a single virtual database. Projects associated with VHub are also going to introduce the use of data driven workflow tools to support the use of multistage analysis processes where computing and data are integrated for model validation, hazard analysis etc. Audio-video recordings of seminars, PowerPoint slide sets, and educational simulations are all items that can be placed onto VHub for use by the community or by selected collaborators. An important point is that the manager of a given educational resource (or any other resource, such as a dataset or a model) can control the privacy of that resource, ranging from private (only accessible by, and known to, specific collaborators) to completely public. VHub is a very useful platform for project-specific collaborations. With a group site on VHub collaborators share documents, datasets, maps, and have ongoing discussions using the discussion board function. VHub is funded by the U.S. National Science Foundation, and is participating in development of larger earth-science cyberinfrastructure initiatives (EarthCube), as well as supporting efforts such as the Global Volcano Model. Emerging VHub-facilitated efforts include model benchmarking, collaborative code development, and growth in online modeling tools.
Using random forest for reliable classification and cost-sensitive learning for medical diagnosis.
Yang, Fan; Wang, Hua-zhen; Mi, Hong; Lin, Cheng-de; Cai, Wei-wen
2009-01-30
Most machine-learning classifiers output label predictions for new instances without indicating how reliable the predictions are. The applicability of these classifiers is limited in critical domains where incorrect predictions have serious consequences, like medical diagnosis. Further, the default assumption of equal misclassification costs is most likely violated in medical diagnosis. In this paper, we present a modified random forest classifier which is incorporated into the conformal predictor scheme. A conformal predictor is a transductive learning scheme, using Kolmogorov complexity to test the randomness of a particular sample with respect to the training sets. Our method show well-calibrated property that the performance can be set prior to classification and the accurate rate is exactly equal to the predefined confidence level. Further, to address the cost sensitive problem, we extend our method to a label-conditional predictor which takes into account different costs for misclassifications in different class and allows different confidence level to be specified for each class. Intensive experiments on benchmark datasets and real world applications show the resultant classifier is well-calibrated and able to control the specific risk of different class. The method of using RF outlier measure to design a nonconformity measure benefits the resultant predictor. Further, a label-conditional classifier is developed and turn to be an alternative approach to the cost sensitive learning problem that relies on label-wise predefined confidence level. The target of minimizing the risk of misclassification is achieved by specifying the different confidence level for different class.
A Variational Approach to Video Registration with Subspace Constraints.
Garg, Ravi; Roussos, Anastasios; Agapito, Lourdes
2013-01-01
This paper addresses the problem of non-rigid video registration, or the computation of optical flow from a reference frame to each of the subsequent images in a sequence, when the camera views deformable objects. We exploit the high correlation between 2D trajectories of different points on the same non-rigid surface by assuming that the displacement of any point throughout the sequence can be expressed in a compact way as a linear combination of a low-rank motion basis. This subspace constraint effectively acts as a trajectory regularization term leading to temporally consistent optical flow. We formulate it as a robust soft constraint within a variational framework by penalizing flow fields that lie outside the low-rank manifold. The resulting energy functional can be decoupled into the optimization of the brightness constancy and spatial regularization terms, leading to an efficient optimization scheme. Additionally, we propose a novel optimization scheme for the case of vector valued images, based on the dualization of the data term. This allows us to extend our approach to deal with colour images which results in significant improvements on the registration results. Finally, we provide a new benchmark dataset, based on motion capture data of a flag waving in the wind, with dense ground truth optical flow for evaluation of multi-frame optical flow algorithms for non-rigid surfaces. Our experiments show that our proposed approach outperforms state of the art optical flow and dense non-rigid registration algorithms.
Basis for substrate recognition and distinction by matrix metalloproteinases
Ratnikov, Boris I.; Cieplak, Piotr; Gramatikoff, Kosi; Pierce, James; Eroshkin, Alexey; Igarashi, Yoshinobu; Kazanov, Marat; Sun, Qing; Godzik, Adam; Osterman, Andrei; Stec, Boguslaw; Strongin, Alex; Smith, Jeffrey W.
2014-01-01
Genomic sequencing and structural genomics produced a vast amount of sequence and structural data, creating an opportunity for structure–function analysis in silico [Radivojac P, et al. (2013) Nat Methods 10(3):221–227]. Unfortunately, only a few large experimental datasets exist to serve as benchmarks for function-related predictions. Furthermore, currently there are no reliable means to predict the extent of functional similarity among proteins. Here, we quantify structure–function relationships among three phylogenetic branches of the matrix metalloproteinase (MMP) family by comparing their cleavage efficiencies toward an extended set of phage peptide substrates that were selected from ∼64 million peptide sequences (i.e., a large unbiased representation of substrate space). The observed second-order rate constants [k(obs)] across the substrate space provide a distance measure of functional similarity among the MMPs. These functional distances directly correlate with MMP phylogenetic distance. There is also a remarkable and near-perfect correlation between the MMP substrate preference and sequence identity of 50–57 discontinuous residues surrounding the catalytic groove. We conclude that these residues represent the specificity-determining positions (SDPs) that allowed for the expansion of MMP proteolytic function during evolution. A transmutation of only a few selected SDPs proximal to the bound substrate peptide, and contributing the most to selectivity among the MMPs, is sufficient to enact a global change in the substrate preference of one MMP to that of another, indicating the potential for the rational and focused redesign of cleavage specificity in MMPs. PMID:25246591
McKinney, Bill; Meyer, Peter A.; Crosas, Mercè; Sliz, Piotr
2016-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension—functionality supporting preservation of filesystem structure within Dataverse—which is essential for both in-place computation and supporting non-http data transfers. PMID:27862010
SWAT use of gridded observations for simulating runoff - a Vietnam river basin study
NASA Astrophysics Data System (ADS)
Vu, M. T.; Raghavan, S. V.; Liong, S. Y.
2011-12-01
Many research studies that focus on basin hydrology have used the SWAT model to simulate runoff. One common practice in calibrating the SWAT model is the application of station data rainfall to simulate runoff. But over regions lacking robust station data, there is a problem of applying the model to study the hydrological responses. For some countries and remote areas, the rainfall data availability might be a constraint due to many different reasons such as lacking of technology, war time and financial limitation that lead to difficulty in constructing the runoff data. To overcome such a limitation, this research study uses some of the available globally gridded high resolution precipitation datasets to simulate runoff. Five popular gridded observation precipitation datasets: (1) Asian Precipitation Highly Resolved Observational Data Integration Towards the Evaluation of Water Resources (APHRODITE), (2) Tropical Rainfall Measuring Mission (TRMM), (3) Precipitation Estimation from Remote Sensing Information using Artificial Neural Network (PERSIANN), (4) Global Precipitation Climatology Project (GPCP), (5) modified Global Historical Climatology Network version 2 (GHCN2) and one reanalysis dataset National Centers for Environment Prediction/National Center for Atmospheric Research (NCEP/NCAR) are used to simulate runoff over the Dakbla River (a small tributary of the Mekong River) in Vietnam. Wherever possible, available station data are also used for comparison. Bilinear interpolation of these gridded datasets is used to input the precipitation data at the closest grid points to the station locations. Sensitivity Analysis and Auto-calibration are performed for the SWAT model. The Nash-Sutcliffe Efficiency (NSE) and Coefficient of Determination (R2) indices are used to benchmark the model performance. This entails a good understanding of the response of the hydrological model to different datasets and a quantification of the uncertainties in these datasets. Such a methodology is also useful for planning on Rainfall-runoff and even reservoir/river management both at rural and urban scales.
Onder, Devrim; Sarioglu, Sulen; Karacali, Bilge
2013-04-01
Quasi-supervised learning is a statistical learning algorithm that contrasts two datasets by computing estimate for the posterior probability of each sample in either dataset. This method has not been applied to histopathological images before. The purpose of this study is to evaluate the performance of the method to identify colorectal tissues with or without adenocarcinoma. Light microscopic digital images from histopathological sections were obtained from 30 colorectal radical surgery materials including adenocarcinoma and non-neoplastic regions. The texture features were extracted by using local histograms and co-occurrence matrices. The quasi-supervised learning algorithm operates on two datasets, one containing samples of normal tissues labelled only indirectly, and the other containing an unlabeled collection of samples of both normal and cancer tissues. As such, the algorithm eliminates the need for manually labelled samples of normal and cancer tissues for conventional supervised learning and significantly reduces the expert intervention. Several texture feature vector datasets corresponding to different extraction parameters were tested within the proposed framework. The Independent Component Analysis dimensionality reduction approach was also identified as the one improving the labelling performance evaluated in this series. In this series, the proposed method was applied to the dataset of 22,080 vectors with reduced dimensionality 119 from 132. Regions containing cancer tissue could be identified accurately having false and true positive rates up to 19% and 88% respectively without using manually labelled ground-truth datasets in a quasi-supervised strategy. The resulting labelling performances were compared to that of a conventional powerful supervised classifier using manually labelled ground-truth data. The supervised classifier results were calculated as 3.5% and 95% for the same case. The results in this series in comparison with the benchmark classifier, suggest that quasi-supervised image texture labelling may be a useful method in the analysis and classification of pathological slides but further study is required to improve the results. Copyright © 2013 Elsevier Ltd. All rights reserved.
Missing value imputation for microarray data: a comprehensive comparison study and a web tool.
Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng
2013-01-01
Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.
Ren, Kai; Xu, Leiqing
2017-10-01
The data related in this paper are related to "Environmental-behavior studies of sustainable construction of the third place - based on outdoor environment-behavior cross-feed symbiotic analysis and verification of selective activities" (Ren, 2017) [1]. The dataset was from a field sub-time extended investigation to children of Hohhot West Inner Mongolia Electric Power Community Residential Area in Inner Mongolia of China that belongs to cold region of ID area according to Chinese design code for buildings. This filed data provided descriptive statistics about outdoor time, behavior scale specificity, age exclusivity and self-centeredness for children in different ages (babies, preschool children, school age children). This data provided five measurement elements of child-friendly space and their weight ratio. The field data set is made publicly available to enable critical or extended analyzes.
Electro-optical seasonal weather and gender data collection
NASA Astrophysics Data System (ADS)
McCoppin, Ryan; Koester, Nathan; Rude, Howard N.; Rizki, Mateen; Tamburino, Louis; Freeman, Andrew; Mendoza-Schrock, Olga
2013-05-01
This paper describes the process used to collect the Seasonal Weather And Gender (SWAG) dataset; an electro-optical dataset of human subjects that can be used to develop advanced gender classification algorithms. Several novel features characterize this ongoing effort (1) the human subjects self-label their gender by performing a specific action during the data collection and (2) the data collection will span months and even years resulting in a dataset containing realistic levels and types of clothing corresponding to the various seasons and weather conditions. It is envisioned that this type of data will support the development and evaluation of more robust gender classification systems that are capable of accurate gender recognition under extended operating conditions.
SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes.
Pruesse, Elmar; Peplies, Jörg; Glöckner, Frank Oliver
2012-07-15
In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements. In this study, we present the SILVA Incremental Aligner (SINA) used to align the rRNA gene databases provided by the SILVA ribosomal RNA project. SINA uses a combination of k-mer searching and partial order alignment (POA) to maintain very high alignment accuracy while satisfying high throughput performance demands. SINA was evaluated in comparison with the commonly used high throughput MSA programs PyNAST and mothur. The three BRAliBase III benchmark MSAs could be reproduced with 99.3, 97.6 and 96.1 accuracy. A larger benchmark MSA comprising 38 772 sequences could be reproduced with 98.9 and 99.3% accuracy using reference MSAs comprising 1000 and 5000 sequences. SINA was able to achieve higher accuracy than PyNAST and mothur in all performed benchmarks. Alignment of up to 500 sequences using the latest SILVA SSU/LSU Ref datasets as reference MSA is offered at http://www.arb-silva.de/aligner. This page also links to Linux binaries, user manual and tutorial. SINA is made available under a personal use license.
Mapping the Martian Meteorology
NASA Technical Reports Server (NTRS)
Allison, M.; Ross, J. D.; Solomon, N.
1999-01-01
The Mars-adapted version of the NASA/GISS general circulation model (GCM) has been applied to the hourly/daily simulation of the planet's meteorology over several seasonal orbits. The current running version of the model includes a diurnal solar cycle, CO2 sublimation, and a mature parameterization of upper level wave drag with a vertical domain extending from the surface up to the 6microb level. The benchmark simulations provide a four-dimensional archive for the comparative evaluation of various schemes for the retrieval of winds from anticipated polar orbiter measurements of temperatures by the Pressure Modulator Infrared Radiometer. Additional information is contained in the original extended abstract.
Extended output phasor representation of multi-spectral fluorescence lifetime imaging microscopy
Campos-Delgado, Daniel U.; Navarro, O. Gutiérrez; Arce-Santana, E. R.; Jo, Javier A.
2015-01-01
In this paper, we investigate novel low-dimensional and model-free representations for multi-spectral fluorescence lifetime imaging microscopy (m-FLIM) data. We depart from the classical definition of the phasor in the complex plane to propose the extended output phasor (EOP) and extended phasor (EP) for multi-spectral information. The frequency domain properties of the EOP and EP are analytically studied based on a multiexponential model for the impulse response of the imaged tissue. For practical implementations, the EOP is more appealing since there is no need to perform deconvolution of the instrument response from the measured m-FLIM data, as in the case of EP. Our synthetic and experimental evaluations with m-FLIM datasets of human coronary atherosclerotic plaques show that low frequency indexes have to be employed for a distinctive representation of the EOP and EP, and to reduce noise distortion. The tissue classification of the m-FLIM datasets by EOP and EP also improves with low frequency indexes, and does not present significant differences by using either phasor. PMID:26114031
NASA Astrophysics Data System (ADS)
Velioǧlu, Deniz; Cevdet Yalçıner, Ahmet; Zaytsev, Andrey
2016-04-01
Tsunamis are huge waves with long wave periods and wave lengths that can cause great devastation and loss of life when they strike a coast. The interest in experimental and numerical modeling of tsunami propagation and inundation increased considerably after the 2011 Great East Japan earthquake. In this study, two numerical codes, FLOW 3D and NAMI DANCE, that analyze tsunami propagation and inundation patterns are considered. Flow 3D simulates linear and nonlinear propagating surface waves as well as long waves by solving three-dimensional Navier-Stokes (3D-NS) equations. NAMI DANCE uses finite difference computational method to solve 2D depth-averaged linear and nonlinear forms of shallow water equations (NSWE) in long wave problems, specifically tsunamis. In order to validate these two codes and analyze the differences between 3D-NS and 2D depth-averaged NSWE equations, two benchmark problems are applied. One benchmark problem investigates the runup of long waves over a complex 3D beach. The experimental setup is a 1:400 scale model of Monai Valley located on the west coast of Okushiri Island, Japan. Other benchmark problem is discussed in 2015 National Tsunami Hazard Mitigation Program (NTHMP) Annual meeting in Portland, USA. It is a field dataset, recording the Japan 2011 tsunami in Hilo Harbor, Hawaii. The computed water surface elevation and velocity data are compared with the measured data. The comparisons showed that both codes are in fairly good agreement with each other and benchmark data. The differences between 3D-NS and 2D depth-averaged NSWE equations are highlighted. All results are presented with discussions and comparisons. Acknowledgements: Partial support by Japan-Turkey Joint Research Project by JICA on earthquakes and tsunamis in Marmara Region (JICA SATREPS - MarDiM Project), 603839 ASTARTE Project of EU, UDAP-C-12-14 project of AFAD Turkey, 108Y227, 113M556 and 213M534 projects of TUBITAK Turkey, RAPSODI (CONCERT_Dis-021) of CONCERT-Japan Joint Call and Istanbul Metropolitan Municipality are all acknowledged.
Assessing streamflow sensitivity to variations in glacier mass balance
O'Neel, Shad; Hood, Eran; Arendt, Anthony; Sass, Louis
2014-01-01
The purpose of this paper is to evaluate relationships among seasonal and annual glacier mass balances, glacier runoff and streamflow in two glacierized basins in different climate settings. We use long-term glacier mass balance and streamflow datasets from the United States Geological Survey (USGS) Alaska Benchmark Glacier Program to compare and contrast glacier-streamflow interactions in a maritime climate (Wolverine Glacier) with those in a continental climate (Gulkana Glacier). Our overall goal is to improve our understanding of how glacier mass balance processes impact streamflow, ultimately improving our conceptual understanding of the future evolution of glacier runoff in continental and maritime climates.
Auction dynamics: A volume constrained MBO scheme
NASA Astrophysics Data System (ADS)
Jacobs, Matt; Merkurjev, Ekaterina; Esedoǧlu, Selim
2018-02-01
We show how auction algorithms, originally developed for the assignment problem, can be utilized in Merriman, Bence, and Osher's threshold dynamics scheme to simulate multi-phase motion by mean curvature in the presence of equality and inequality volume constraints on the individual phases. The resulting algorithms are highly efficient and robust, and can be used in simulations ranging from minimal partition problems in Euclidean space to semi-supervised machine learning via clustering on graphs. In the case of the latter application, numerous experimental results on benchmark machine learning datasets show that our approach exceeds the performance of current state-of-the-art methods, while requiring a fraction of the computation time.
Comparison between extreme learning machine and wavelet neural networks in data classification
NASA Astrophysics Data System (ADS)
Yahia, Siwar; Said, Salwa; Jemai, Olfa; Zaied, Mourad; Ben Amar, Chokri
2017-03-01
Extreme learning Machine is a well known learning algorithm in the field of machine learning. It's about a feed forward neural network with a single-hidden layer. It is an extremely fast learning algorithm with good generalization performance. In this paper, we aim to compare the Extreme learning Machine with wavelet neural networks, which is a very used algorithm. We have used six benchmark data sets to evaluate each technique. These datasets Including Wisconsin Breast Cancer, Glass Identification, Ionosphere, Pima Indians Diabetes, Wine Recognition and Iris Plant. Experimental results have shown that both extreme learning machine and wavelet neural networks have reached good results.
Automatic Clustering Using FSDE-Forced Strategy Differential Evolution
NASA Astrophysics Data System (ADS)
Yasid, A.
2018-01-01
Clustering analysis is important in datamining for unsupervised data, cause no adequate prior knowledge. One of the important tasks is defining the number of clusters without user involvement that is known as automatic clustering. This study intends on acquiring cluster number automatically utilizing forced strategy differential evolution (AC-FSDE). Two mutation parameters, namely: constant parameter and variable parameter are employed to boost differential evolution performance. Four well-known benchmark datasets were used to evaluate the algorithm. Moreover, the result is compared with other state of the art automatic clustering methods. The experiment results evidence that AC-FSDE is better or competitive with other existing automatic clustering algorithm.
Recalculating the quasar luminosity function of the extended Baryon Oscillation Spectroscopic Survey
NASA Astrophysics Data System (ADS)
Caditz, David M.
2017-12-01
Aims: The extended Baryon Oscillation Spectroscopic Survey (eBOSS) of the Sloan Digital Sky Survey provides a uniform sample of over 13 000 variability selected quasi-stellar objects (QSOs) in the redshift range 0.68
Sun, Xiaodian; Jin, Li; Xiong, Momiao
2008-01-01
It is system dynamics that determines the function of cells, tissues and organisms. To develop mathematical models and estimate their parameters are an essential issue for studying dynamic behaviors of biological systems which include metabolic networks, genetic regulatory networks and signal transduction pathways, under perturbation of external stimuli. In general, biological dynamic systems are partially observed. Therefore, a natural way to model dynamic biological systems is to employ nonlinear state-space equations. Although statistical methods for parameter estimation of linear models in biological dynamic systems have been developed intensively in the recent years, the estimation of both states and parameters of nonlinear dynamic systems remains a challenging task. In this report, we apply extended Kalman Filter (EKF) to the estimation of both states and parameters of nonlinear state-space models. To evaluate the performance of the EKF for parameter estimation, we apply the EKF to a simulation dataset and two real datasets: JAK-STAT signal transduction pathway and Ras/Raf/MEK/ERK signaling transduction pathways datasets. The preliminary results show that EKF can accurately estimate the parameters and predict states in nonlinear state-space equations for modeling dynamic biochemical networks. PMID:19018286
2013-01-01
Background The objective of screening programs is to discover life threatening diseases in as many patients as early as possible and to increase the chance of survival. To be able to compare aspects of health care quality, methods are needed for benchmarking that allow comparisons on various health care levels (regional, national, and international). Objectives Applications and extensions of algorithms can be used to link the information on disease phases with relative survival rates and to consolidate them in composite measures. The application of the developed SAS-macros will give results for benchmarking of health care quality. Data examples for breast cancer care are given. Methods A reference scale (expected, E) must be defined at a time point at which all benchmark objects (observed, O) are measured. All indices are defined as O/E, whereby the extended standardized screening-index (eSSI), the standardized case-mix-index (SCI), the work-up-index (SWI), and the treatment-index (STI) address different health care aspects. The composite measures called overall-performance evaluation (OPE) and relative overall performance indices (ROPI) link the individual indices differently for cross-sectional or longitudinal analyses. Results Algorithms allow a time point and a time interval associated comparison of the benchmark objects in the indices eSSI, SCI, SWI, STI, OPE, and ROPI. Comparisons between countries, states and districts are possible. Exemplarily comparisons between two countries are made. The success of early detection and screening programs as well as clinical health care quality for breast cancer can be demonstrated while the population’s background mortality is concerned. Conclusions If external quality assurance programs and benchmark objects are based on population-based and corresponding demographic data, information of disease phase and relative survival rates can be combined to indices which offer approaches for comparative analyses between benchmark objects. Conclusions on screening programs and health care quality are possible. The macros can be transferred to other diseases if a disease-specific phase scale of prognostic value (e.g. stage) exists. PMID:23316692
McCance, Tanya; Wilson, Val; Kornman, Kelly
2016-07-01
The aim of the Paediatric International Nursing Study was to explore the utility of key performance indicators in developing person-centred practice across a range of services provided to sick children. The objective addressed in this paper was evaluating the use of these indicators to benchmark services internationally. This study builds on primary research, which produced indicators that were considered novel both in terms of their positive orientation and use in generating data that privileges the patient voice. This study extends this research through wider testing on an international platform within paediatrics. The overall methodological approach was a realistic evaluation used to evaluate the implementation of the key performance indicators, which combined an integrated development and evaluation methodology. The study involved children's wards/hospitals in Australia (six sites across three states) and Europe (seven sites across four countries). Qualitative and quantitative methods were used during the implementation process, however, this paper reports the quantitative data only, which used survey, observations and documentary review. The findings demonstrate the quality of care being delivered to children and their families across different international sites. The benchmarking does, however, highlight some differences between paediatric and general hospitals, and between the different key performance indicators across all the sites. The findings support the use of the key performance indicators as a novel method to benchmark services internationally. Whilst the data collected across 20 paediatric sites suggest services are more similar than different, benchmarking illuminates variations that encourage a critical dialogue about what works and why. The transferability of the key performance indicators and measurement framework across different settings has significant implications for practice. The findings offer an approach to benchmarking and celebrating the successes within practice, while learning from partners across the globe in further developing person-centred cultures. © 2016 John Wiley & Sons Ltd.
Cheyney, Melissa; Bovbjerg, Marit; Everson, Courtney; Gordon, Wendy; Hannibal, Darcy; Vedam, Saraswathi
2014-01-01
In 2004, the Midwives Alliance of North America's (MANA's) Division of Research developed a Web-based data collection system to gather information on the practices and outcomes associated with midwife-led births in the United States. This system, called the MANA Statistics Project (MANA Stats), grew out of a widely acknowledged need for more reliable data on outcomes by intended place of birth. This article describes the history and development of the MANA Stats birth registry and provides an analysis of the 2.0 dataset's content, strengths, and limitations. Data collection and review procedures for the MANA Stats 2.0 dataset are described, along with methods for the assessment of data accuracy. We calculated descriptive statistics for client demographics and contributing midwife credentials, and assessed the quality of data by calculating point estimates, 95% confidence intervals, and kappa statistics for key outcomes on pre- and postreview samples of records. The MANA Stats 2.0 dataset (2004-2009) contains 24,848 courses of care, 20,893 of which are for women who planned a home or birth center birth at the onset of labor. The majority of these records were planned home births (81%). Births were attended primarily by certified professional midwives (73%), and clients were largely white (92%), married (87%), and college-educated (49%). Data quality analyses of 9932 records revealed no differences between pre- and postreviewed samples for 7 key benchmarking variables (kappa, 0.98-1.00). The MANA Stats 2.0 data were accurately entered by participants; any errors in this dataset are likely random and not systematic. The primary limitation of the 2.0 dataset is that the sample was captured through voluntary participation; thus, it may not accurately reflect population-based outcomes. The dataset's primary strength is that it will allow for the examination of research questions on normal physiologic birth and midwife-led birth outcomes by intended place of birth. © 2014 by the American College of Nurse-Midwives.
Xiao, Jingjing; Stolkin, Rustam; Gao, Yuqing; Leonardis, Ales
2017-09-06
This paper presents a novel robust method for single target tracking in RGB-D images, and also contributes a substantial new benchmark dataset for evaluating RGB-D trackers. While a target object's color distribution is reasonably motion-invariant, this is not true for the target's depth distribution, which continually varies as the target moves relative to the camera. It is therefore nontrivial to design target models which can fully exploit (potentially very rich) depth information for target tracking. For this reason, much of the previous RGB-D literature relies on color information for tracking, while exploiting depth information only for occlusion reasoning. In contrast, we propose an adaptive range-invariant target depth model, and show how both depth and color information can be fully and adaptively fused during the search for the target in each new RGB-D image. We introduce a new, hierarchical, two-layered target model (comprising local and global models) which uses spatio-temporal consistency constraints to achieve stable and robust on-the-fly target relearning. In the global layer, multiple features, derived from both color and depth data, are adaptively fused to find a candidate target region. In ambiguous frames, where one or more features disagree, this global candidate region is further decomposed into smaller local candidate regions for matching to local-layer models of small target parts. We also note that conventional use of depth data, for occlusion reasoning, can easily trigger false occlusion detections when the target moves rapidly toward the camera. To overcome this problem, we show how combining target information with contextual information enables the target's depth constraint to be relaxed. Our adaptively relaxed depth constraints can robustly accommodate large and rapid target motion in the depth direction, while still enabling the use of depth data for highly accurate reasoning about occlusions. For evaluation, we introduce a new RGB-D benchmark dataset with per-frame annotated attributes and extensive bias analysis. Our tracker is evaluated using two different state-of-the-art methodologies, VOT and object tracking benchmark, and in both cases it significantly outperforms four other state-of-the-art RGB-D trackers from the literature.
Learning a peptide-protein binding affinity predictor with kernel ridge regression
2013-01-01
Background The cellular function of a vast majority of proteins is performed through physical interactions with other biomolecules, which, most of the time, are other proteins. Peptides represent templates of choice for mimicking a secondary structure in order to modulate protein-protein interaction. They are thus an interesting class of therapeutics since they also display strong activity, high selectivity, low toxicity and few drug-drug interactions. Furthermore, predicting peptides that would bind to a specific MHC alleles would be of tremendous benefit to improve vaccine based therapy and possibly generate antibodies with greater affinity. Modern computational methods have the potential to accelerate and lower the cost of drug and vaccine discovery by selecting potential compounds for testing in silico prior to biological validation. Results We propose a specialized string kernel for small bio-molecules, peptides and pseudo-sequences of binding interfaces. The kernel incorporates physico-chemical properties of amino acids and elegantly generalizes eight kernels, comprised of the Oligo, the Weighted Degree, the Blended Spectrum, and the Radial Basis Function. We provide a low complexity dynamic programming algorithm for the exact computation of the kernel and a linear time algorithm for it’s approximation. Combined with kernel ridge regression and SupCK, a novel binding pocket kernel, the proposed kernel yields biologically relevant and good prediction accuracy on the PepX database. For the first time, a machine learning predictor is capable of predicting the binding affinity of any peptide to any protein with reasonable accuracy. The method was also applied to both single-target and pan-specific Major Histocompatibility Complex class II benchmark datasets and three Quantitative Structure Affinity Model benchmark datasets. Conclusion On all benchmarks, our method significantly (p-value ≤ 0.057) outperforms the current state-of-the-art methods at predicting peptide-protein binding affinities. The proposed approach is flexible and can be applied to predict any quantitative biological activity. Moreover, generating reliable peptide-protein binding affinities will also improve system biology modelling of interaction pathways. Lastly, the method should be of value to a large segment of the research community with the potential to accelerate the discovery of peptide-based drugs and facilitate vaccine development. The proposed kernel is freely available at http://graal.ift.ulaval.ca/downloads/gs-kernel/. PMID:23497081
Recent Development on the NOAA's Global Surface Temperature Dataset
NASA Astrophysics Data System (ADS)
Zhang, H. M.; Huang, B.; Boyer, T.; Lawrimore, J. H.; Menne, M. J.; Rennie, J.
2016-12-01
Global Surface Temperature (GST) is one of the most widely used indicators for climate trend and extreme analyses. A widely used GST dataset is the NOAA merged land-ocean surface temperature dataset known as NOAAGlobalTemp (formerly MLOST). The NOAAGlobalTemp had recently been updated from version 3.5.4 to version 4. The update includes a significant improvement in the ocean surface component (Extended Reconstructed Sea Surface Temperature or ERSST, from version 3b to version 4) which resulted in an increased temperature trends in recent decades. Since then, advancements in both the ocean component (ERSST) and land component (GHCN-Monthly) have been made, including the inclusion of Argo float SSTs and expanded EOT modes in ERSST, and the use of ISTI databank in GHCN-Monthly. In this presentation, we describe the impact of those improvements on the merged global temperature dataset, in terms of global trends and other aspects.
Check your biosignals here: a new dataset for off-the-person ECG biometrics.
da Silva, Hugo Plácido; Lourenço, André; Fred, Ana; Raposo, Nuno; Aires-de-Sousa, Marta
2014-02-01
The Check Your Biosignals Here initiative (CYBHi) was developed as a way of creating a dataset and consistently repeatable acquisition framework, to further extend research in electrocardiographic (ECG) biometrics. In particular, our work targets the novel trend towards off-the-person data acquisition, which opens a broad new set of challenges and opportunities both for research and industry. While datasets with ECG signals collected using medical grade equipment at the chest can be easily found, for off-the-person ECG data the solution is generally for each team to collect their own corpus at considerable expense of resources. In this paper we describe the context, experimental considerations, methods, and preliminary findings of two public datasets created by our team, one for short-term and another for long-term assessment, with ECG data collected at the hand palms and fingers. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Cook, M. J.; Sasagawa, G. S.; Roland, E. C.; Schmidt, D. A.; Wilcock, W. S. D.; Zumberge, M. A.
2017-12-01
Seawater pressure can be used to measure vertical seafloor deformation since small seafloor height changes produce measurable pressure changes. However, resolving secular vertical deformation near subduction zones can be difficult due to pressure gauge drift. A typical gauge drift rate of about 10 cm/year exceeds the expected secular rate of 1 cm/year or less in Cascadia. The absolute self-calibrating pressure recorder (ASCPR) was developed to solve the issue of gauge drift by using a deadweight calibrator to make campaign-style measurements of the absolute seawater pressure. Pressure gauges alternate between observing the ambient seawater pressure and the deadweight calibrator pressure, which is an accurately known reference value, every 10-20 minutes for several hours. The difference between the known reference pressure and the observed seafloor pressure allows offsets and transients to be corrected to determine the true, absolute seafloor pressure. Absolute seafloor pressure measurements provide a great utility for geodetic deformation studies. The measurements provide instrument-independent, benchmark values that can be used far into the future as epoch points in long-term time series or as important calibration points for other continuous pressure records. The ASCPR was first deployed in Cascadia in 2014 and 2015, when seven concrete seafloor benchmarks were placed along a trench-perpendicular profile extending from 20 km to 105 km off the central Oregon coast. Two benchmarks have ASCPR measurements that span three years, one benchmark spans two years, and four benchmarks span one year. Measurement repeatability is currently 3 to 4 cm, but we anticipate accuracy on the order of 1 cm with improvements to the instrument metrology and processing tidal and non-tidal oceanographic signals.
mBEEF-vdW: Robust fitting of error estimation density functionals
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lundgaard, Keld T.; Wellendorff, Jess; Voss, Johannes
Here, we propose a general-purpose semilocal/nonlocal exchange-correlation functional approximation, named mBEEF-vdW. The exchange is a meta generalized gradient approximation, and the correlation is a semilocal and nonlocal mixture, with the Rutgers-Chalmers approximation for van der Waals (vdW) forces. The functional is fitted within the Bayesian error estimation functional (BEEF) framework. We improve the previously used fitting procedures by introducing a robust MM-estimator based loss function, reducing the sensitivity to outliers in the datasets. To more reliably determine the optimal model complexity, we furthermore introduce a generalization of the bootstrap 0.632 estimator with hierarchical bootstrap sampling and geometric mean estimator overmore » the training datasets. Using this estimator, we show that the robust loss function leads to a 10% improvement in the estimated prediction error over the previously used least-squares loss function. The mBEEF-vdW functional is benchmarked against popular density functional approximations over a wide range of datasets relevant for heterogeneous catalysis, including datasets that were not used for its training. Overall, we find that mBEEF-vdW has a higher general accuracy than competing popular functionals, and it is one of the best performing functionals on chemisorption systems, surface energies, lattice constants, and dispersion. We also show the potential-energy curve of graphene on the nickel(111) surface, where mBEEF-vdW matches the experimental binding length. mBEEF-vdW is currently available in gpaw and other density functional theory codes through Libxc, version 3.0.0.« less
mBEEF-vdW: Robust fitting of error estimation density functionals
Lundgaard, Keld T.; Wellendorff, Jess; Voss, Johannes; ...
2016-06-15
Here, we propose a general-purpose semilocal/nonlocal exchange-correlation functional approximation, named mBEEF-vdW. The exchange is a meta generalized gradient approximation, and the correlation is a semilocal and nonlocal mixture, with the Rutgers-Chalmers approximation for van der Waals (vdW) forces. The functional is fitted within the Bayesian error estimation functional (BEEF) framework. We improve the previously used fitting procedures by introducing a robust MM-estimator based loss function, reducing the sensitivity to outliers in the datasets. To more reliably determine the optimal model complexity, we furthermore introduce a generalization of the bootstrap 0.632 estimator with hierarchical bootstrap sampling and geometric mean estimator overmore » the training datasets. Using this estimator, we show that the robust loss function leads to a 10% improvement in the estimated prediction error over the previously used least-squares loss function. The mBEEF-vdW functional is benchmarked against popular density functional approximations over a wide range of datasets relevant for heterogeneous catalysis, including datasets that were not used for its training. Overall, we find that mBEEF-vdW has a higher general accuracy than competing popular functionals, and it is one of the best performing functionals on chemisorption systems, surface energies, lattice constants, and dispersion. We also show the potential-energy curve of graphene on the nickel(111) surface, where mBEEF-vdW matches the experimental binding length. mBEEF-vdW is currently available in gpaw and other density functional theory codes through Libxc, version 3.0.0.« less
NASA Technical Reports Server (NTRS)
Padovan, J.; Adams, M.; Lam, P.; Fertis, D.; Zeid, I.
1982-01-01
Second-year efforts within a three-year study to develop and extend finite element (FE) methodology to efficiently handle the transient/steady state response of rotor-bearing-stator structure associated with gas turbine engines are outlined. The two main areas aim at (1) implanting the squeeze film damper element into a general purpose FE code for testing and evaluation; and (2) determining the numerical characteristics of the FE-generated rotor-bearing-stator simulation scheme. The governing FE field equations are set out and the solution methodology is presented. The choice of ADINA as the general-purpose FE code is explained, and the numerical operational characteristics of the direct integration approach of FE-generated rotor-bearing-stator simulations is determined, including benchmarking, comparison of explicit vs. implicit methodologies of direct integration, and demonstration problems.
Rahaman, Obaidur; Estrada, Trilce P.; Doren, Douglas J.; Taufer, Michela; Brooks, Charles L.; Armen, Roger S.
2011-01-01
The performance of several two-step scoring approaches for molecular docking were assessed for their ability to predict binding geometries and free energies. Two new scoring functions designed for “step 2 discrimination” were proposed and compared to our CHARMM implementation of the linear interaction energy (LIE) approach using the Generalized-Born with Molecular Volume (GBMV) implicit solvation model. A scoring function S1 was proposed by considering only “interacting” ligand atoms as the “effective size” of the ligand, and extended to an empirical regression-based pair potential S2. The S1 and S2 scoring schemes were trained and five-fold cross validated on a diverse set of 259 protein-ligand complexes from the Ligand Protein Database (LPDB). The regression-based parameters for S1 and S2 also demonstrated reasonable transferability in the CSARdock 2010 benchmark using a new dataset (NRC HiQ) of diverse protein-ligand complexes. The ability of the scoring functions to accurately predict ligand geometry was evaluated by calculating the discriminative power (DP) of the scoring functions to identify native poses. The parameters for the LIE scoring function with the optimal discriminative power (DP) for geometry (step 1 discrimination) were found to be very similar to the best-fit parameters for binding free energy over a large number of protein-ligand complexes (step 2 discrimination). Reasonable performance of the scoring functions in enrichment of active compounds in four different protein target classes established that the parameters for S1 and S2 provided reasonable accuracy and transferability. Additional analysis was performed to definitively separate scoring function performance from molecular weight effects. This analysis included the prediction of ligand binding efficiencies for a subset of the CSARdock NRC HiQ dataset where the number of ligand heavy atoms ranged from 17 to 35. This range of ligand heavy atoms is where improved accuracy of predicted ligand efficiencies is most relevant to real-world drug design efforts. PMID:21644546
Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Crockett, Thomas W.
1999-01-01
This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.
McKinney, Bill; Meyer, Peter A; Crosas, Mercè; Sliz, Piotr
2017-01-01
Access to experimental X-ray diffraction image data is important for validation and reproduction of macromolecular models and indispensable for the development of structural biology processing methods. In response to the evolving needs of the structural biology community, we recently established a diffraction data publication system, the Structural Biology Data Grid (SBDG, data.sbgrid.org), to preserve primary experimental datasets supporting scientific publications. All datasets published through the SBDG are freely available to the research community under a public domain dedication license, with metadata compliant with the DataCite Schema (schema.datacite.org). A proof-of-concept study demonstrated community interest and utility. Publication of large datasets is a challenge shared by several fields, and the SBDG has begun collaborating with the Institute for Quantitative Social Science at Harvard University to extend the Dataverse (dataverse.org) open-source data repository system to structural biology datasets. Several extensions are necessary to support the size and metadata requirements for structural biology datasets. In this paper, we describe one such extension-functionality supporting preservation of file system structure within Dataverse-which is essential for both in-place computation and supporting non-HTTP data transfers. © 2016 New York Academy of Sciences.
Llamas, César; González, Manuel A; Hernández, Carmen; Vegas, Jesús
2016-10-01
Nearly every practical improvement in modeling human motion is well founded in a properly designed collection of data or datasets. These datasets must be made publicly available for the community could validate and accept them. It is reasonable to concede that a collective, guided enterprise could serve to devise solid and substantial datasets, as a result of a collaborative effort, in the same sense as the open software community does. In this way datasets could be complemented, extended and expanded in size with, for example, more individuals, samples and human actions. For this to be possible some commitments must be made by the collaborators, being one of them sharing the same data acquisition platform. In this paper, we offer an affordable open source hardware and software platform based on inertial wearable sensors in a way that several groups could cooperate in the construction of datasets through common software suitable for collaboration. Some experimental results about the throughput of the overall system are reported showing the feasibility of acquiring data from up to 6 sensors with a sampling frequency no less than 118Hz. Also, a proof-of-concept dataset is provided comprising sampled data from 12 subjects suitable for gait analysis. Copyright © 2016 Elsevier Inc. All rights reserved.
2016-04-30
previously, PALT duration was identified by customers as the single most important performance measure . In this section, we present the benchmark...decrease customer satisfaction . We identified percentage of warranted contracting officers ranging from 24% to 91% of an organization’s contracting...multiple uses of CPDO and other measures to optimize contract awards and meet the needs of procurement customers more effectively. Extending this