population-based dataset methods: Topics by Science.gov

Sample records for population-based dataset methods

Updated population metadata for United States historical climatology network stations

USGS Publications Warehouse

Owen, T.W.; Gallo, K.P.

2000-01-01

The United States Historical Climatology Network (HCN) serial temperature dataset is comprised of 1221 high-quality, long-term climate observing stations. The HCN dataset is available in several versions, one of which includes population-based temperature modifications to adjust urban temperatures for the "heat-island" effect. Unfortunately, the decennial population metadata file is not complete as missing values are present for 17.6% of the 12 210 population values associated with the 1221 individual stations during the 1900-90 interval. Retrospective grid-based populations. Within a fixed distance of an HCN station, were estimated through the use of a gridded population density dataset and historically available U.S. Census county data. The grid-based populations for the HCN stations provide values derived from a consistent methodology compared to the current HCN populations that can vary as definitions of the area associated with a city change over time. The use of grid-based populations may minimally be appropriate to augment populations for HCN climate stations that lack any population data, and are recommended when consistent and complete population data are required. The recommended urban temperature adjustments based on the HCN and grid-based methods of estimating station population can be significantly different for individual stations within the HCN dataset.
High resolution population distribution maps for Southeast Asia in 2010 and 2015.

PubMed

Gaughan, Andrea E; Stevens, Forrest R; Linard, Catherine; Jia, Peng; Tatem, Andrew J

2013-01-01

Spatially accurate, contemporary data on human population distributions are vitally important to many applied and theoretical researchers. The Southeast Asia region has undergone rapid urbanization and population growth over the past decade, yet existing spatial population distribution datasets covering the region are based principally on population count data from censuses circa 2000, with often insufficient spatial resolution or input data to map settlements precisely. Here we outline approaches to construct a database of GIS-linked circa 2010 census data and methods used to construct fine-scale (∼100 meters spatial resolution) population distribution datasets for each country in the Southeast Asia region. Landsat-derived settlement maps and land cover information were combined with ancillary datasets on infrastructure to model population distributions for 2010 and 2015. These products were compared with those from two other methods used to construct commonly used global population datasets. Results indicate mapping accuracies are consistently higher when incorporating land cover and settlement information into the AsiaPop modelling process. Using existing data, it is possible to produce detailed, contemporary and easily updatable population distribution datasets for Southeast Asia. The 2010 and 2015 datasets produced are freely available as a product of the AsiaPop Project and can be downloaded from: www.asiapop.org.
High Resolution Population Distribution Maps for Southeast Asia in 2010 and 2015

PubMed Central

Gaughan, Andrea E.; Stevens, Forrest R.; Linard, Catherine; Jia, Peng; Tatem, Andrew J.

2013-01-01

Spatially accurate, contemporary data on human population distributions are vitally important to many applied and theoretical researchers. The Southeast Asia region has undergone rapid urbanization and population growth over the past decade, yet existing spatial population distribution datasets covering the region are based principally on population count data from censuses circa 2000, with often insufficient spatial resolution or input data to map settlements precisely. Here we outline approaches to construct a database of GIS-linked circa 2010 census data and methods used to construct fine-scale (∼100 meters spatial resolution) population distribution datasets for each country in the Southeast Asia region. Landsat-derived settlement maps and land cover information were combined with ancillary datasets on infrastructure to model population distributions for 2010 and 2015. These products were compared with those from two other methods used to construct commonly used global population datasets. Results indicate mapping accuracies are consistently higher when incorporating land cover and settlement information into the AsiaPop modelling process. Using existing data, it is possible to produce detailed, contemporary and easily updatable population distribution datasets for Southeast Asia. The 2010 and 2015 datasets produced are freely available as a product of the AsiaPop Project and can be downloaded from: www.asiapop.org. PMID:23418469
Knowledge-Guided Robust MRI Brain Extraction for Diverse Large-Scale Neuroimaging Studies on Humans and Non-Human Primates

PubMed Central

Wang, Yaping; Nie, Jingxin; Yap, Pew-Thian; Li, Gang; Shi, Feng; Geng, Xiujuan; Guo, Lei; Shen, Dinggang

2014-01-01

Accurate and robust brain extraction is a critical step in most neuroimaging analysis pipelines. In particular, for the large-scale multi-site neuroimaging studies involving a significant number of subjects with diverse age and diagnostic groups, accurate and robust extraction of the brain automatically and consistently is highly desirable. In this paper, we introduce population-specific probability maps to guide the brain extraction of diverse subject groups, including both healthy and diseased adult human populations, both developing and aging human populations, as well as non-human primates. Specifically, the proposed method combines an atlas-based approach, for coarse skull-stripping, with a deformable-surface-based approach that is guided by local intensity information and population-specific prior information learned from a set of real brain images for more localized refinement. Comprehensive quantitative evaluations were performed on the diverse large-scale populations of ADNI dataset with over 800 subjects (55∼90 years of age, multi-site, various diagnosis groups), OASIS dataset with over 400 subjects (18∼96 years of age, wide age range, various diagnosis groups), and NIH pediatrics dataset with 150 subjects (5∼18 years of age, multi-site, wide age range as a complementary age group to the adult dataset). The results demonstrate that our method consistently yields the best overall results across almost the entire human life span, with only a single set of parameters. To demonstrate its capability to work on non-human primates, the proposed method is further evaluated using a rhesus macaque dataset with 20 subjects. Quantitative comparisons with popularly used state-of-the-art methods, including BET, Two-pass BET, BET-B, BSE, HWA, ROBEX and AFNI, demonstrate that the proposed method performs favorably with superior performance on all testing datasets, indicating its robustness and effectiveness. PMID:24489639
The effects of spatial population dataset choice on estimates of population at risk of disease

PubMed Central

2011-01-01

Background The spatial modeling of infectious disease distributions and dynamics is increasingly being undertaken for health services planning and disease control monitoring, implementation, and evaluation. Where risks are heterogeneous in space or dependent on person-to-person transmission, spatial data on human population distributions are required to estimate infectious disease risks, burdens, and dynamics. Several different modeled human population distribution datasets are available and widely used, but the disparities among them and the implications for enumerating disease burdens and populations at risk have not been considered systematically. Here, we quantify some of these effects using global estimates of populations at risk (PAR) of P. falciparum malaria as an example. Methods The recent construction of a global map of P. falciparum malaria endemicity enabled the testing of different gridded population datasets for providing estimates of PAR by endemicity class. The estimated population numbers within each class were calculated for each country using four different global gridded human population datasets: GRUMP (~1 km spatial resolution), LandScan (~1 km), UNEP Global Population Databases (~5 km), and GPW3 (~5 km). More detailed assessments of PAR variation and accuracy were conducted for three African countries where census data were available at a higher administrative-unit level than used by any of the four gridded population datasets. Results The estimates of PAR based on the datasets varied by more than 10 million people for some countries, even accounting for the fact that estimates of population totals made by different agencies are used to correct national totals in these datasets and can vary by more than 5% for many low-income countries. In many cases, these variations in PAR estimates comprised more than 10% of the total national population. The detailed country-level assessments suggested that none of the datasets was consistently more accurate than the others in estimating PAR. The sizes of such differences among modeled human populations were related to variations in the methods, input resolution, and date of the census data underlying each dataset. Data quality varied from country to country within the spatial population datasets. Conclusions Detailed, highly spatially resolved human population data are an essential resource for planning health service delivery for disease control, for the spatial modeling of epidemics, and for decision-making processes related to public health. However, our results highlight that for the low-income regions of the world where disease burden is greatest, existing datasets display substantial variations in estimated population distributions, resulting in uncertainty in disease assessments that utilize them. Increased efforts are required to gather contemporary and spatially detailed demographic data to reduce this uncertainty, particularly in Africa, and to develop population distribution modeling methods that match the rigor, sophistication, and ability to handle uncertainty of contemporary disease mapping and spread modeling. In the meantime, studies that utilize a particular spatial population dataset need to acknowledge the uncertainties inherent within them and consider how the methods and data that comprise each will affect conclusions. PMID:21299885
A comprehensive population dataset for Afghanistan constructed using GIS-based dasymetric mapping methods

USGS Publications Warehouse

Thompson, Allyson L.; Hubbard, Bernard E.

2014-01-01

This report summarizes the application of dasymetric methods for mapping the distribution of population throughout Afghanistan. Because Afghanistan's population has constantly changed through decades of war and conflict, existing vector and raster GIS datasets (such as point settlement densities and intensities of lights at night) do not adequately reflect the changes. The purposes of this report are (1) to provide historic population data at the provincial and district levels that can be used to chart population growth and migration trends within the country and (2) to provide baseline information that can be used for other types of spatial analyses of Afghanistan, such as resource and hazard assessments; infrastructure and capacity rebuilding; and assisting with international, regional, and local planning.
Investigating population continuity with ancient DNA under a spatially explicit simulation framework.

PubMed

Silva, Nuno Miguel; Rio, Jeremy; Currat, Mathias

2017-12-15

Recent advances in sequencing technologies have allowed for the retrieval of ancient DNA data (aDNA) from skeletal remains, providing direct genetic snapshots from diverse periods of human prehistory. Comparing samples taken in the same region but at different times, hereafter called "serial samples", may indicate whether there is continuity in the peopling history of that area or whether an immigration of a genetically different population has occurred between the two sampling times. However, the exploration of genetic relationships between serial samples generally ignores their geographical locations and the spatiotemporal dynamics of populations. Here, we present a new coalescent-based, spatially explicit modelling approach to investigate population continuity using aDNA, which includes two fundamental elements neglected in previous methods: population structure and migration. The approach also considers the extensive temporal and geographical variance that is commonly found in aDNA population samples. We first showed that our spatially explicit approach is more conservative than the previous (panmictic) approach and should be preferred to test for population continuity, especially when small and isolated populations are considered. We then applied our method to two mitochondrial datasets from Germany and France, both including modern and ancient lineages dating from the early Neolithic. The results clearly reject population continuity for the maternal line over the last 7500 years for the German dataset but not for the French dataset, suggesting regional heterogeneity in post-Neolithic migratory processes. Here, we demonstrate the benefits of using a spatially explicit method when investigating population continuity with aDNA. It constitutes an improvement over panmictic methods by considering the spatiotemporal dynamics of genetic lineages and the precise location of ancient samples. The method can be used to investigate population continuity between any pair of serial samples (ancient-ancient or ancient-modern) and to investigate more complex evolutionary scenarios. Although we based our study on mitochondrial DNA sequences, diploid molecular markers of different types (DNA, SNP, STR) can also be simulated with our approach. It thus constitutes a promising tool for the analysis of the numerous aDNA datasets being produced, including genome wide data, in humans but also in many other species.
Differential privacy based on importance weighting

PubMed Central

Ji, Zhanglong

2014-01-01

This paper analyzes a novel method for publishing data while still protecting privacy. The method is based on computing weights that make an existing dataset, for which there are no confidentiality issues, analogous to the dataset that must be kept private. The existing dataset may be genuine but public already, or it may be synthetic. The weights are importance sampling weights, but to protect privacy, they are regularized and have noise added. The weights allow statistical queries to be answered approximately while provably guaranteeing differential privacy. We derive an expression for the asymptotic variance of the approximate answers. Experiments show that the new mechanism performs well even when the privacy budget is small, and when the public and private datasets are drawn from different populations. PMID:24482559
Analysis of genetic population structure in Acacia caven (Leguminosae, Mimosoideae), comparing one exploratory and two Bayesian-model-based methods.

PubMed

Pometti, Carolina L; Bessega, Cecilia F; Saidman, Beatriz O; Vilardi, Juan C

2014-03-01

Bayesian clustering as implemented in STRUCTURE or GENELAND software is widely used to form genetic groups of populations or individuals. On the other hand, in order to satisfy the need for less computer-intensive approaches, multivariate analyses are specifically devoted to extracting information from large datasets. In this paper, we report the use of a dataset of AFLP markers belonging to 15 sampling sites of Acacia caven for studying the genetic structure and comparing the consistency of three methods: STRUCTURE, GENELAND and DAPC. Of these methods, DAPC was the fastest one and showed accuracy in inferring the K number of populations (K = 12 using the find.clusters option and K = 15 with a priori information of populations). GENELAND in turn, provides information on the area of membership probabilities for individuals or populations in the space, when coordinates are specified (K = 12). STRUCTURE also inferred the number of K populations and the membership probabilities of individuals based on ancestry, presenting the result K = 11 without prior information of populations and K = 15 using the LOCPRIOR option. Finally, in this work all three methods showed high consistency in estimating the population structure, inferring similar numbers of populations and the membership probabilities of individuals to each group, with a high correlation between each other.
SHIPS: Spectral Hierarchical Clustering for the Inference of Population Structure in Genetic Studies

PubMed Central

Bouaziz, Matthieu; Paccard, Caroline; Guedj, Mickael; Ambroise, Christophe

2012-01-01

Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns. PMID:23077494
Analysis of genetic population structure in Acacia caven (Leguminosae, Mimosoideae), comparing one exploratory and two Bayesian-model-based methods

PubMed Central

Pometti, Carolina L.; Bessega, Cecilia F.; Saidman, Beatriz O.; Vilardi, Juan C.

2014-01-01

Bayesian clustering as implemented in STRUCTURE or GENELAND software is widely used to form genetic groups of populations or individuals. On the other hand, in order to satisfy the need for less computer-intensive approaches, multivariate analyses are specifically devoted to extracting information from large datasets. In this paper, we report the use of a dataset of AFLP markers belonging to 15 sampling sites of Acacia caven for studying the genetic structure and comparing the consistency of three methods: STRUCTURE, GENELAND and DAPC. Of these methods, DAPC was the fastest one and showed accuracy in inferring the K number of populations (K = 12 using the find.clusters option and K = 15 with a priori information of populations). GENELAND in turn, provides information on the area of membership probabilities for individuals or populations in the space, when coordinates are specified (K = 12). STRUCTURE also inferred the number of K populations and the membership probabilities of individuals based on ancestry, presenting the result K = 11 without prior information of populations and K = 15 using the LOCPRIOR option. Finally, in this work all three methods showed high consistency in estimating the population structure, inferring similar numbers of populations and the membership probabilities of individuals to each group, with a high correlation between each other. PMID:24688293
A window-based time series feature extraction method.

PubMed

Katircioglu-Öztürk, Deniz; Güvenir, H Altay; Ravens, Ursula; Baykal, Nazife

2017-10-01

This study proposes a robust similarity score-based time series feature extraction method that is termed as Window-based Time series Feature ExtraCtion (WTC). Specifically, WTC generates domain-interpretable results and involves significantly low computational complexity thereby rendering itself useful for densely sampled and populated time series datasets. In this study, WTC is applied to a proprietary action potential (AP) time series dataset on human cardiomyocytes and three precordial leads from a publicly available electrocardiogram (ECG) dataset. This is followed by comparing WTC in terms of predictive accuracy and computational complexity with shapelet transform and fast shapelet transform (which constitutes an accelerated variant of the shapelet transform). The results indicate that WTC achieves a slightly higher classification performance with significantly lower execution time when compared to its shapelet-based alternatives. With respect to its interpretable features, WTC has a potential to enable medical experts to explore definitive common trends in novel datasets. Copyright © 2017 Elsevier Ltd. All rights reserved.
FIFS: A data mining method for informative marker selection in high dimensional population genomic data.

PubMed

Kavakiotis, Ioannis; Samaras, Patroklos; Triantafyllidis, Alexandros; Vlahavas, Ioannis

2017-11-01

Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions. In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned. The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics. Copyright © 2017 Elsevier Ltd. All rights reserved.
Please Don't Move-Evaluating Motion Artifact From Peripheral Quantitative Computed Tomography Scans Using Textural Features.

PubMed

Rantalainen, Timo; Chivers, Paola; Beck, Belinda R; Robertson, Sam; Hart, Nicolas H; Nimphius, Sophia; Weeks, Benjamin K; McIntyre, Fleur; Hands, Beth; Siafarikas, Aris

Most imaging methods, including peripheral quantitative computed tomography (pQCT), are susceptible to motion artifacts particularly in fidgety pediatric populations. Methods currently used to address motion artifact include manual screening (visual inspection) and objective assessments of the scans. However, previously reported objective methods either cannot be applied on the reconstructed image or have not been tested for distal bone sites. Therefore, the purpose of the present study was to develop and validate motion artifact classifiers to quantify motion artifact in pQCT scans. Whether textural features could provide adequate motion artifact classification performance in 2 adolescent datasets with pQCT scans from tibial and radial diaphyses and epiphyses was tested. The first dataset was split into training (66% of sample) and validation (33% of sample) datasets. Visual classification was used as the ground truth. Moderate to substantial classification performance (J48 classifier, kappa coefficients from 0.57 to 0.80) was observed in the validation dataset with the novel texture-based classifier. In applying the same classifier to the second cross-sectional dataset, a slight-to-fair (κ = 0.01-0.39) classification performance was observed. Overall, this novel textural analysis-based classifier provided a moderate-to-substantial classification of motion artifact when the classifier was specifically trained for the measurement device and population. Classification based on textural features may be used to prescreen obviously acceptable and unacceptable scans, with a subsequent human-operated visual classification of any remaining scans. Copyright © 2017 The International Society for Clinical Densitometry. Published by Elsevier Inc. All rights reserved.
A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa

PubMed Central

Petegrosso, Raphael; Tolar, Jakub

2018-01-01

Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells. Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale droplet-based dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry. MATLAB/Octave code available at https://github.com/kuanglab/scVDMC. PMID:29630593
Recommended GIS Analysis Methods for Global Gridded Population Data

NASA Astrophysics Data System (ADS)

Frye, C. E.; Sorichetta, A.; Rose, A.

2017-12-01

When using geographic information systems (GIS) to analyze gridded, i.e., raster, population data, analysts need a detailed understanding of several factors that affect raster data processing, and thus, the accuracy of the results. Global raster data is most often provided in an unprojected state, usually in the WGS 1984 geographic coordinate system. Most GIS functions and tools evaluate data based on overlay relationships (area) or proximity (distance). Area and distance for global raster data can be either calculated directly using the various earth ellipsoids or after transforming the data to equal-area/equidistant projected coordinate systems to analyze all locations equally. However, unlike when projecting vector data, not all projected coordinate systems can support such analyses equally, and the process of transforming raster data from one coordinate space to another often results unmanaged loss of data through a process called resampling. Resampling determines which values to use in the result dataset given an imperfect locational match in the input dataset(s). Cell size or resolution, registration, resampling method, statistical type, and whether the raster represents continuous or discreet information potentially influence the quality of the result. Gridded population data represent estimates of population in each raster cell, and this presentation will provide guidelines for accurately transforming population rasters for analysis in GIS. Resampling impacts the display of high resolution global gridded population data, and we will discuss how to properly handle pyramid creation using the Aggregate tool with the sum option to create overviews for mosaic datasets.
A high resolution spatial population database of Somalia for disease risk mapping.

PubMed

Linard, Catherine; Alegana, Victor A; Noor, Abdisalan M; Snow, Robert W; Tatem, Andrew J

2010-09-14

Millions of Somali have been deprived of basic health services due to the unstable political situation of their country. Attempts are being made to reconstruct the health sector, in particular to estimate the extent of infectious disease burden. However, any approach that requires the use of modelled disease rates requires reasonable information on population distribution. In a low-income country such as Somalia, population data are lacking, are of poor quality, or become outdated rapidly. Modelling methods are therefore needed for the production of contemporary and spatially detailed population data. Here land cover information derived from satellite imagery and existing settlement point datasets were used for the spatial reallocation of populations within census units. We used simple and semi-automated methods that can be implemented with free image processing software to produce an easily updatable gridded population dataset at 100 × 100 meters spatial resolution. The 2010 population dataset was matched to administrative population totals projected by the UN. Comparison tests between the new dataset and existing population datasets revealed important differences in population size distributions, and in population at risk of malaria estimates. These differences are particularly important in more densely populated areas and strongly depend on the settlement data used in the modelling approach. The results show that it is possible to produce detailed, contemporary and easily updatable settlement and population distribution datasets of Somalia using existing data. The 2010 population dataset produced is freely available as a product of the AfriPop Project and can be downloaded from: http://www.afripop.org.
A high resolution spatial population database of Somalia for disease risk mapping

PubMed Central

2010-01-01

Background Millions of Somali have been deprived of basic health services due to the unstable political situation of their country. Attempts are being made to reconstruct the health sector, in particular to estimate the extent of infectious disease burden. However, any approach that requires the use of modelled disease rates requires reasonable information on population distribution. In a low-income country such as Somalia, population data are lacking, are of poor quality, or become outdated rapidly. Modelling methods are therefore needed for the production of contemporary and spatially detailed population data. Results Here land cover information derived from satellite imagery and existing settlement point datasets were used for the spatial reallocation of populations within census units. We used simple and semi-automated methods that can be implemented with free image processing software to produce an easily updatable gridded population dataset at 100 × 100 meters spatial resolution. The 2010 population dataset was matched to administrative population totals projected by the UN. Comparison tests between the new dataset and existing population datasets revealed important differences in population size distributions, and in population at risk of malaria estimates. These differences are particularly important in more densely populated areas and strongly depend on the settlement data used in the modelling approach. Conclusions The results show that it is possible to produce detailed, contemporary and easily updatable settlement and population distribution datasets of Somalia using existing data. The 2010 population dataset produced is freely available as a product of the AfriPop Project and can be downloaded from: http://www.afripop.org. PMID:20840751
Population Health Metrics Research Consortium gold standard verbal autopsy validation study: design, implementation, and development of analysis datasets

PubMed Central

2011-01-01

Background Verbal autopsy methods are critically important for evaluating the leading causes of death in populations without adequate vital registration systems. With a myriad of analytical and data collection approaches, it is essential to create a high quality validation dataset from different populations to evaluate comparative method performance and make recommendations for future verbal autopsy implementation. This study was undertaken to compile a set of strictly defined gold standard deaths for which verbal autopsies were collected to validate the accuracy of different methods of verbal autopsy cause of death assignment. Methods Data collection was implemented in six sites in four countries: Andhra Pradesh, India; Bohol, Philippines; Dar es Salaam, Tanzania; Mexico City, Mexico; Pemba Island, Tanzania; and Uttar Pradesh, India. The Population Health Metrics Research Consortium (PHMRC) developed stringent diagnostic criteria including laboratory, pathology, and medical imaging findings to identify gold standard deaths in health facilities as well as an enhanced verbal autopsy instrument based on World Health Organization (WHO) standards. A cause list was constructed based on the WHO Global Burden of Disease estimates of the leading causes of death, potential to identify unique signs and symptoms, and the likely existence of sufficient medical technology to ascertain gold standard cases. Blinded verbal autopsies were collected on all gold standard deaths. Results Over 12,000 verbal autopsies on deaths with gold standard diagnoses were collected (7,836 adults, 2,075 children, 1,629 neonates, and 1,002 stillbirths). Difficulties in finding sufficient cases to meet gold standard criteria as well as problems with misclassification for certain causes meant that the target list of causes for analysis was reduced to 34 for adults, 21 for children, and 10 for neonates, excluding stillbirths. To ensure strict independence for the validation of methods and assessment of comparative performance, 500 test-train datasets were created from the universe of cases, covering a range of cause-specific compositions. Conclusions This unique, robust validation dataset will allow scholars to evaluate the performance of different verbal autopsy analytic methods as well as instrument design. This dataset can be used to inform the implementation of verbal autopsies to more reliably ascertain cause of death in national health information systems. PMID:21816095
Decision tree methods: applications for classification and prediction.

PubMed

Song, Yan-Yan; Lu, Ying

2015-04-25

Decision tree methodology is a commonly used data mining method for establishing classification systems based on multiple covariates or for developing prediction algorithms for a target variable. This method classifies a population into branch-like segments that construct an inverted tree with a root node, internal nodes, and leaf nodes. The algorithm is non-parametric and can efficiently deal with large, complicated datasets without imposing a complicated parametric structure. When the sample size is large enough, study data can be divided into training and validation datasets. Using the training dataset to build a decision tree model and a validation dataset to decide on the appropriate tree size needed to achieve the optimal final model. This paper introduces frequently used algorithms used to develop decision trees (including CART, C4.5, CHAID, and QUEST) and describes the SPSS and SAS programs that can be used to visualize tree structure.

Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification.

PubMed

Li, Jinyan; Fong, Simon; Sung, Yunsick; Cho, Kyungeun; Wong, Raymond; Wong, Kelvin K L

2016-01-01

An imbalanced dataset is defined as a training dataset that has imbalanced proportions of data in both interesting and uninteresting classes. Often in biomedical applications, samples from the stimulating class are rare in a population, such as medical anomalies, positive clinical tests, and particular diseases. Although the target samples in the primitive dataset are small in number, the induction of a classification model over such training data leads to poor prediction performance due to insufficient training from the minority class. In this paper, we use a novel class-balancing method named adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique (ASCB_DmSMOTE) to solve this imbalanced dataset problem, which is common in biomedical applications. The proposed method combines under-sampling and over-sampling into a swarm optimisation algorithm. It adaptively selects suitable parameters for the rebalancing algorithm to find the best solution. Compared with the other versions of the SMOTE algorithm, significant improvements, which include higher accuracy and credibility, are observed with ASCB_DmSMOTE. Our proposed method tactfully combines two rebalancing techniques together. It reasonably re-allocates the majority class in the details and dynamically optimises the two parameters of SMOTE to synthesise a reasonable scale of minority class for each clustered sub-imbalanced dataset. The proposed methods ultimately overcome other conventional methods and attains higher credibility with even greater accuracy of the classification model.
Parameter-expanded data augmentation for Bayesian analysis of capture-recapture models

USGS Publications Warehouse

Royle, J. Andrew; Dorazio, Robert M.

2012-01-01

Data augmentation (DA) is a flexible tool for analyzing closed and open population models of capture-recapture data, especially models which include sources of hetereogeneity among individuals. The essential concept underlying DA, as we use the term, is based on adding "observations" to create a dataset composed of a known number of individuals. This new (augmented) dataset, which includes the unknown number of individuals N in the population, is then analyzed using a new model that includes a reformulation of the parameter N in the conventional model of the observed (unaugmented) data. In the context of capture-recapture models, we add a set of "all zero" encounter histories which are not, in practice, observable. The model of the augmented dataset is a zero-inflated version of either a binomial or a multinomial base model. Thus, our use of DA provides a general approach for analyzing both closed and open population models of all types. In doing so, this approach provides a unified framework for the analysis of a huge range of models that are treated as unrelated "black boxes" and named procedures in the classical literature. As a practical matter, analysis of the augmented dataset by MCMC is greatly simplified compared to other methods that require specialized algorithms. For example, complex capture-recapture models of an augmented dataset can be fitted with popular MCMC software packages (WinBUGS or JAGS) by providing a concise statement of the model's assumptions that usually involves only a few lines of pseudocode. In this paper, we review the basic technical concepts of data augmentation, and we provide examples of analyses of closed-population models (M 0, M h , distance sampling, and spatial capture-recapture models) and open-population models (Jolly-Seber) with individual effects.
Demonstrating the robustness of population surveillance data: implications of error rates on demographic and mortality estimates.

PubMed

Fottrell, Edward; Byass, Peter; Berhane, Yemane

2008-03-25

As in any measurement process, a certain amount of error may be expected in routine population surveillance operations such as those in demographic surveillance sites (DSSs). Vital events are likely to be missed and errors made no matter what method of data capture is used or what quality control procedures are in place. The extent to which random errors in large, longitudinal datasets affect overall health and demographic profiles has important implications for the role of DSSs as platforms for public health research and clinical trials. Such knowledge is also of particular importance if the outputs of DSSs are to be extrapolated and aggregated with realistic margins of error and validity. This study uses the first 10-year dataset from the Butajira Rural Health Project (BRHP) DSS, Ethiopia, covering approximately 336,000 person-years of data. Simple programmes were written to introduce random errors and omissions into new versions of the definitive 10-year Butajira dataset. Key parameters of sex, age, death, literacy and roof material (an indicator of poverty) were selected for the introduction of errors based on their obvious importance in demographic and health surveillance and their established significant associations with mortality. Defining the original 10-year dataset as the 'gold standard' for the purposes of this investigation, population, age and sex compositions and Poisson regression models of mortality rate ratios were compared between each of the intentionally erroneous datasets and the original 'gold standard' 10-year data. The composition of the Butajira population was well represented despite introducing random errors, and differences between population pyramids based on the derived datasets were subtle. Regression analyses of well-established mortality risk factors were largely unaffected even by relatively high levels of random errors in the data. The low sensitivity of parameter estimates and regression analyses to significant amounts of randomly introduced errors indicates a high level of robustness of the dataset. This apparent inertia of population parameter estimates to simulated errors is largely due to the size of the dataset. Tolerable margins of random error in DSS data may exceed 20%. While this is not an argument in favour of poor quality data, reducing the time and valuable resources spent on detecting and correcting random errors in routine DSS operations may be justifiable as the returns from such procedures diminish with increasing overall accuracy. The money and effort currently spent on endlessly correcting DSS datasets would perhaps be better spent on increasing the surveillance population size and geographic spread of DSSs and analysing and disseminating research findings.
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

PubMed Central

Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

2015-01-01

Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595
Computer systems and methods for visualizing data

DOEpatents

Stolte, Chris; Hanrahan, Patrick

2010-07-13

A method for forming a visual plot using a hierarchical structure of a dataset. The dataset comprises a measure and a dimension. The dimension consists of a plurality of levels. The plurality of levels form a dimension hierarchy. The visual plot is constructed based on a specification. A first level from the plurality of levels is represented by a first component of the visual plot. A second level from the plurality of levels is represented by a second component of the visual plot. The dataset is queried to retrieve data in accordance with the specification. The data includes all or a portion of the dimension and all or a portion of the measure. The visual plot is populated with the retrieved data in accordance with the specification.
Computer systems and methods for visualizing data

DOEpatents

Stolte, Chris; Hanrahan, Patrick

2013-01-29

A method for forming a visual plot using a hierarchical structure of a dataset. The dataset comprises a measure and a dimension. The dimension consists of a plurality of levels. The plurality of levels form a dimension hierarchy. The visual plot is constructed based on a specification. A first level from the plurality of levels is represented by a first component of the visual plot. A second level from the plurality of levels is represented by a second component of the visual plot. The dataset is queried to retrieve data in accordance with the specification. The data includes all or a portion of the dimension and all or a portion of the measure. The visual plot is populated with the retrieved data in accordance with the specification.
Classification of autistic individuals and controls using cross-task characterization of fMRI activity

PubMed Central

Chanel, Guillaume; Pichon, Swann; Conty, Laurence; Berthoz, Sylvie; Chevallier, Coralie; Grèzes, Julie

2015-01-01

Multivariate pattern analysis (MVPA) has been applied successfully to task-based and resting-based fMRI recordings to investigate which neural markers distinguish individuals with autistic spectrum disorders (ASD) from controls. While most studies have focused on brain connectivity during resting state episodes and regions of interest approaches (ROI), a wealth of task-based fMRI datasets have been acquired in these populations in the last decade. This calls for techniques that can leverage information not only from a single dataset, but from several existing datasets that might share some common features and biomarkers. We propose a fully data-driven (voxel-based) approach that we apply to two different fMRI experiments with social stimuli (faces and bodies). The method, based on Support Vector Machines (SVMs) and Recursive Feature Elimination (RFE), is first trained for each experiment independently and each output is then combined to obtain a final classification output. Second, this RFE output is used to determine which voxels are most often selected for classification to generate maps of significant discriminative activity. Finally, to further explore the clinical validity of the approach, we correlate phenotypic information with obtained classifier scores. The results reveal good classification accuracy (range between 69% and 92.3%). Moreover, we were able to identify discriminative activity patterns pertaining to the social brain without relying on a priori ROI definitions. Finally, social motivation was the only dimension which correlated with classifier scores, suggesting that it is the main dimension captured by the classifiers. Altogether, we believe that the present RFE method proves to be efficient and may help identifying relevant biomarkers by taking advantage of acquired task-based fMRI datasets in psychiatric populations. PMID:26793434
Status update: is smoke on your mind? Using social media to assess smoke exposure

NASA Astrophysics Data System (ADS)

Ford, Bonne; Burke, Moira; Lassman, William; Pfister, Gabriele; Pierce, Jeffrey R.

2017-06-01

Exposure to wildland fire smoke is associated with negative effects on human health. However, these effects are poorly quantified. Accurately attributing health endpoints to wildland fire smoke requires determining the locations, concentrations, and durations of smoke events. Most current methods for assessing these smoke events (ground-based measurements, satellite observations, and chemical transport modeling) are limited temporally, spatially, and/or by their level of accuracy. In this work, we explore using daily social media posts from Facebook regarding smoke, haze, and air quality to assess population-level exposure for the summer of 2015 in the western US. We compare this de-identified, aggregated Facebook dataset to several other datasets that are commonly used for estimating exposure, such as satellite observations (MODIS aerosol optical depth and Hazard Mapping System smoke plumes), daily (24 h) average surface particulate matter measurements, and model-simulated (WRF-Chem) surface concentrations. After adding population-weighted spatial smoothing to the Facebook data, this dataset is well correlated (R2 generally above 0.5) with the other methods in smoke-impacted regions. The Facebook dataset is better correlated with surface measurements of PM2. 5 at a majority of monitoring sites (163 of 293 sites) than the satellite observations and our model simulation. We also present an example case for Washington state in 2015, for which we combine this Facebook dataset with MODIS observations and WRF-Chem-simulated PM2. 5 in a regression model. We show that the addition of the Facebook data improves the regression model's ability to predict surface concentrations. This high correlation of the Facebook data with surface monitors and our Washington state example suggests that this social-media-based proxy can be used to estimate smoke exposure in locations without direct ground-based particulate matter measurements.
Data preparation techniques for a perinatal psychiatric study based on linked data.

PubMed

Xu, Fenglian; Hilder, Lisa; Austin, Marie-Paule; Sullivan, Elizabeth A

2012-06-08

In recent years there has been an increase in the use of population-based linked data. However, there is little literature that describes the method of linked data preparation. This paper describes the method for merging data, calculating the statistical variable (SV), recoding psychiatric diagnoses and summarizing hospital admissions for a perinatal psychiatric study. The data preparation techniques described in this paper are based on linked birth data from the New South Wales (NSW) Midwives Data Collection (MDC), the Register of Congenital Conditions (RCC), the Admitted Patient Data Collection (APDC) and the Pharmaceutical Drugs of Addiction System (PHDAS). The master dataset is the meaningfully linked data which include all or major study data collections. The master dataset can be used to improve the data quality, calculate the SV and can be tailored for different analyses. To identify hospital admissions in the periods before pregnancy, during pregnancy and after birth, a statistical variable of time interval (SVTI) needs to be calculated. The methods and SPSS syntax for building a master dataset, calculating the SVTI, recoding the principal diagnoses of mental illness and summarizing hospital admissions are described. Linked data preparation, including building the master dataset and calculating the SV, can improve data quality and enhance data function.
Inference for multivariate regression model based on multiply imputed synthetic data generated via posterior predictive sampling

NASA Astrophysics Data System (ADS)

Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.

2017-06-01

The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks.

PubMed

Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

2015-01-01

The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.
High-resolution gridded population datasets for Latin America and the Caribbean in 2010, 2015, and 2020

PubMed Central

Sorichetta, Alessandro; Hornby, Graeme M.; Stevens, Forrest R.; Gaughan, Andrea E.; Linard, Catherine; Tatem, Andrew J.

2015-01-01

The Latin America and the Caribbean region is one of the most urbanized regions in the world, with a total population of around 630 million that is expected to increase by 25% by 2050. In this context, detailed and contemporary datasets accurately describing the distribution of residential population in the region are required for measuring the impacts of population growth, monitoring changes, supporting environmental and health applications, and planning interventions. To support these needs, an open access archive of high-resolution gridded population datasets was created through disaggregation of the most recent official population count data available for 28 countries located in the region. These datasets are described here along with the approach and methods used to create and validate them. For each country, population distribution datasets, having a resolution of 3 arc seconds (approximately 100 m at the equator), were produced for the population count year, as well as for 2010, 2015, and 2020. All these products are available both through the WorldPop Project website and the WorldPop Dataverse Repository. PMID:26347245
High-resolution gridded population datasets for Latin America and the Caribbean in 2010, 2015, and 2020.

PubMed

Sorichetta, Alessandro; Hornby, Graeme M; Stevens, Forrest R; Gaughan, Andrea E; Linard, Catherine; Tatem, Andrew J

2015-01-01

The Latin America and the Caribbean region is one of the most urbanized regions in the world, with a total population of around 630 million that is expected to increase by 25% by 2050. In this context, detailed and contemporary datasets accurately describing the distribution of residential population in the region are required for measuring the impacts of population growth, monitoring changes, supporting environmental and health applications, and planning interventions. To support these needs, an open access archive of high-resolution gridded population datasets was created through disaggregation of the most recent official population count data available for 28 countries located in the region. These datasets are described here along with the approach and methods used to create and validate them. For each country, population distribution datasets, having a resolution of 3 arc seconds (approximately 100 m at the equator), were produced for the population count year, as well as for 2010, 2015, and 2020. All these products are available both through the WorldPop Project website and the WorldPop Dataverse Repository.
An opinion formation based binary optimization approach for feature selection

NASA Astrophysics Data System (ADS)

Hamedmoghadam, Homayoun; Jalili, Mahdi; Yu, Xinghuo

2018-02-01

This paper proposed a novel optimization method based on opinion formation in complex network systems. The proposed optimization technique mimics human-human interaction mechanism based on a mathematical model derived from social sciences. Our method encodes a subset of selected features to the opinion of an artificial agent and simulates the opinion formation process among a population of agents to solve the feature selection problem. The agents interact using an underlying interaction network structure and get into consensus in their opinions, while finding better solutions to the problem. A number of mechanisms are employed to avoid getting trapped in local minima. We compare the performance of the proposed method with a number of classical population-based optimization methods and a state-of-the-art opinion formation based method. Our experiments on a number of high dimensional datasets reveal outperformance of the proposed algorithm over others.
flowVS: channel-specific variance stabilization in flow cytometry.

PubMed

Azad, Ariful; Rajwa, Bartek; Pothen, Alex

2016-07-28

Comparing phenotypes of heterogeneous cell populations from multiple biological conditions is at the heart of scientific discovery based on flow cytometry (FC). When the biological signal is measured by the average expression of a biomarker, standard statistical methods require that variance be approximately stabilized in populations to be compared. Since the mean and variance of a cell population are often correlated in fluorescence-based FC measurements, a preprocessing step is needed to stabilize the within-population variances. We present a variance-stabilization algorithm, called flowVS, that removes the mean-variance correlations from cell populations identified in each fluorescence channel. flowVS transforms each channel from all samples of a data set by the inverse hyperbolic sine (asinh) transformation. For each channel, the parameters of the transformation are optimally selected by Bartlett's likelihood-ratio test so that the populations attain homogeneous variances. The optimum parameters are then used to transform the corresponding channels in every sample. flowVS is therefore an explicit variance-stabilization method that stabilizes within-population variances in each channel by evaluating the homoskedasticity of clusters with a likelihood-ratio test. With two publicly available datasets, we show that flowVS removes the mean-variance dependence from raw FC data and makes the within-population variance relatively homogeneous. We demonstrate that alternative transformation techniques such as flowTrans, flowScape, logicle, and FCSTrans might not stabilize variance. Besides flow cytometry, flowVS can also be applied to stabilize variance in microarray data. With a publicly available data set we demonstrate that flowVS performs as well as the VSN software, a state-of-the-art approach developed for microarrays. The homogeneity of variance in cell populations across FC samples is desirable when extracting features uniformly and comparing cell populations with different levels of marker expressions. The newly developed flowVS algorithm solves the variance-stabilization problem in FC and microarrays by optimally transforming data with the help of Bartlett's likelihood-ratio test. On two publicly available FC datasets, flowVS stabilizes within-population variances more evenly than the available transformation and normalization techniques. flowVS-based variance stabilization can help in performing comparison and alignment of phenotypically identical cell populations across different samples. flowVS and the datasets used in this paper are publicly available in Bioconductor.
Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies.

PubMed

Bhaskar, Anand; Javanmard, Adel; Courtade, Thomas A; Tse, David

2017-03-15

Genetic variation in human populations is influenced by geographic ancestry due to spatial locality in historical mating and migration patterns. Spatial population structure in genetic datasets has been traditionally analyzed using either model-free algorithms, such as principal components analysis (PCA) and multidimensional scaling, or using explicit spatial probabilistic models of allele frequency evolution. We develop a general probabilistic model and an associated inference algorithm that unify the model-based and data-driven approaches to visualizing and inferring population structure. Our spatial inference algorithm can also be effectively applied to the problem of population stratification in genome-wide association studies (GWAS), where hidden population structure can create fictitious associations when population ancestry is correlated with both the genotype and the trait. Our algorithm Geographic Ancestry Positioning (GAP) relates local genetic distances between samples to their spatial distances, and can be used for visually discerning population structure as well as accurately inferring the spatial origin of individuals on a two-dimensional continuum. On both simulated and several real datasets from diverse human populations, GAP exhibits substantially lower error in reconstructing spatial ancestry coordinates compared to PCA. We also develop an association test that uses the ancestry coordinates inferred by GAP to accurately account for ancestry-induced correlations in GWAS. Based on simulations and analysis of a dataset of 10 metabolic traits measured in a Northern Finland cohort, which is known to exhibit significant population structure, we find that our method has superior power to current approaches. Our software is available at https://github.com/anand-bhaskar/gap . abhaskar@stanford.edu or ajavanma@usc.edu. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Data Analysis and Statistical Methods for the Assessment and Interpretation of Geochronologic Data

NASA Astrophysics Data System (ADS)

Reno, B. L.; Brown, M.; Piccoli, P. M.

2007-12-01

Ages are traditionally reported as a weighted mean with an uncertainty based on least squares analysis of analytical error on individual dates. This method does not take into account geological uncertainties, and cannot accommodate asymmetries in the data. In most instances, this method will understate uncertainty on a given age, which may lead to over interpretation of age data. Geologic uncertainty is difficult to quantify, but is typically greater than analytical uncertainty. These factors make traditional statistical approaches inadequate to fully evaluate geochronologic data. We propose a protocol to assess populations within multi-event datasets and to calculate age and uncertainty from each population of dates interpreted to represent a single geologic event using robust and resistant statistical methods. To assess whether populations thought to represent different events are statistically separate exploratory data analysis is undertaken using a box plot, where the range of the data is represented by a 'box' of length given by the interquartile range, divided at the median of the data, with 'whiskers' that extend to the furthest datapoint that lies within 1.5 times the interquartile range beyond the box. If the boxes representing the populations do not overlap, they are interpreted to represent statistically different sets of dates. Ages are calculated from statistically distinct populations using a robust tool such as the tanh method of Kelsey et al. (2003, CMP, 146, 326-340), which is insensitive to any assumptions about the underlying probability distribution from which the data are drawn. Therefore, this method takes into account the full range of data, and is not drastically affected by outliers. The interquartile range of each population of dates (the interquartile range) gives a first pass at expressing uncertainty, which accommodates asymmetry in the dataset; outliers have a minor affect on the uncertainty. To better quantify the uncertainty, a resistant tool that is insensitive to local misbehavior of data is preferred, such as the normalized median absolute deviations proposed by Powell et al. (2002, Chem Geol, 185, 191-204). We illustrate the method using a dataset of 152 monazite dates determined using EPMA chemical data from a single sample from the Neoproterozoic Brasília Belt, Brazil. Results are compared with ages and uncertainties calculated using traditional methods to demonstrate the differences. The dataset was manually culled into three populations representing discrete compositional domains within chemically-zoned monazite grains. The weighted mean ages and least squares uncertainties for these populations are 633±6 (2σ) Ma for a core domain, 614±5 (2σ) Ma for an intermediate domain and 595±6 (2σ) Ma for a rim domain. Probability distribution plots indicate asymmetric distributions of all populations, which cannot be accounted for with traditional statistical tools. These three domains record distinct ages outside the interquartile range for each population of dates, with the core domain lying in the subrange 642-624 Ma, the intermediate domain 617-609 Ma and the rim domain 606-589 Ma. The tanh estimator yields ages of 631±7 (2σ) for the core domain, 616±7 (2σ) for the intermediate domain and 601±8 (2σ) for the rim domain. Whereas the uncertainties derived using a resistant statistical tool are larger than those derived from traditional statistical tools, the method yields more realistic uncertainties that better address the spread in the dataset and account for asymmetry in the data.
AMP: Assembly Matching Pursuit.

PubMed

Biswas, S; Jojic, V

2013-01-01

Metagenomics, the study of the total genetic material isolated from a biological host, promises to reveal host-microbe or microbe-microbe interactions that may help to personalize medicine or improve agronomic practice. We introduce a method that discovers metagenomic units (MGUs) relevant for phenotype prediction through sequence-based dictionary learning. The method aggregates patient-specific dictionaries and estimates MGU abundances in order to summarize a whole population and yield universally predictive biomarkers. We analyze the impact of Gaussian, Poisson, and Negative Binomial read count models in guiding dictionary construction by examining classification efficiency on a number of synthetic datasets and a real dataset from Ref. 1. Each outperforms standard methods of dictionary composition, such as random projection and orthogonal matching pursuit. Additionally, the predictive MGUs they recover are biologically relevant.
Analysis of energy-based algorithms for RNA secondary structure prediction

PubMed Central

2012-01-01

Background RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. Results We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Conclusions Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets. PMID:22296803
Analysis of energy-based algorithms for RNA secondary structure prediction.

PubMed

Hajiaghayi, Monir; Condon, Anne; Hoos, Holger H

2012-02-01

RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets.

Climate-Related Hazards: A Method for Global Assessment of Urban and Rural Population Exposure to Cyclones, Droughts, and Floods

PubMed Central

Christenson, Elizabeth; Elliott, Mark; Banerjee, Ovik; Hamrick, Laura; Bartram, Jamie

2014-01-01

Global climate change (GCC) has led to increased focus on the occurrence of, and preparation for, climate-related extremes and hazards. Population exposure, the relative likelihood that a person in a given location was exposed to a given hazard event(s) in a given period of time, was the outcome for this analysis. Our objectives were to develop a method for estimating the population exposure at the country level to the climate-related hazards cyclone, drought, and flood; develop a method that readily allows the addition of better datasets to an automated model; differentiate population exposure of urban and rural populations; and calculate and present the results of exposure scores and ranking of countries based on the country-wide, urban, and rural population exposures to cyclone, drought, and flood. Gridded global datasets on cyclone, drought and flood occurrence as well as population density were combined and analysis was carried out using ArcGIS. Results presented include global maps of ranked country-level population exposure to cyclone, drought, flood and multiple hazards. Analyses by geography and human development index (HDI) are also included. The results and analyses of this exposure assessment have implications for country-level adaptation. It can also be used to help prioritize aid decisions and allocation of adaptation resources between countries and within a country. This model is designed to allow flexibility in applying cyclone, drought and flood exposure to a range of outcomes and adaptation measures. PMID:24566046
Federated Tensor Factorization for Computational Phenotyping

PubMed Central

Kim, Yejin; Sun, Jimeng; Yu, Hwanjo; Jiang, Xiaoqian

2017-01-01

Tensor factorization models offer an effective approach to convert massive electronic health records into meaningful clinical concepts (phenotypes) for data analysis. These models need a large amount of diverse samples to avoid population bias. An open challenge is how to derive phenotypes jointly across multiple hospitals, in which direct patient-level data sharing is not possible (e.g., due to institutional policies). In this paper, we developed a novel solution to enable federated tensor factorization for computational phenotyping without sharing patient-level data. We developed secure data harmonization and federated computation procedures based on alternating direction method of multipliers (ADMM). Using this method, the multiple hospitals iteratively update tensors and transfer secure summarized information to a central server, and the server aggregates the information to generate phenotypes. We demonstrated with real medical datasets that our method resembles the centralized training model (based on combined datasets) in terms of accuracy and phenotypes discovery while respecting privacy. PMID:29071165
Efficient, graph-based white matter connectivity from orientation distribution functions via multi-directional graph propagation

NASA Astrophysics Data System (ADS)

Boucharin, Alexis; Oguz, Ipek; Vachet, Clement; Shi, Yundi; Sanchez, Mar; Styner, Martin

2011-03-01

The use of regional connectivity measurements derived from diffusion imaging datasets has become of considerable interest in the neuroimaging community in order to better understand cortical and subcortical white matter connectivity. Current connectivity assessment methods are based on streamline fiber tractography, usually applied in a Monte-Carlo fashion. In this work we present a novel, graph-based method that performs a fully deterministic, efficient and stable connectivity computation. The method handles crossing fibers and deals well with multiple seed regions. The computation is based on a multi-directional graph propagation method applied to sampled orientation distribution function (ODF), which can be computed directly from the original diffusion imaging data. We show early results of our method on synthetic and real datasets. The results illustrate the potential of our method towards subjectspecific connectivity measurements that are performed in an efficient, stable and reproducible manner. Such individual connectivity measurements would be well suited for application in population studies of neuropathology, such as Autism, Huntington's Disease, Multiple Sclerosis or leukodystrophies. The proposed method is generic and could easily be applied to non-diffusion data as long as local directional data can be derived.
Deciphering the Routes of invasion of Drosophila suzukii by Means of ABC Random Forest.

PubMed

Fraimout, Antoine; Debat, Vincent; Fellous, Simon; Hufbauer, Ruth A; Foucaud, Julien; Pudlo, Pierre; Marin, Jean-Michel; Price, Donald K; Cattel, Julien; Chen, Xiao; Deprá, Marindia; François Duyck, Pierre; Guedot, Christelle; Kenis, Marc; Kimura, Masahito T; Loeb, Gregory; Loiseau, Anne; Martinez-Sañudo, Isabel; Pascual, Marta; Polihronakis Richmond, Maxi; Shearer, Peter; Singh, Nadia; Tamura, Koichiro; Xuéreb, Anne; Zhang, Jinping; Estoup, Arnaud

2017-04-01

Deciphering invasion routes from molecular data is crucial to understanding biological invasions, including identifying bottlenecks in population size and admixture among distinct populations. Here, we unravel the invasion routes of the invasive pest Drosophila suzukii using a multi-locus microsatellite dataset (25 loci on 23 worldwide sampling locations). To do this, we use approximate Bayesian computation (ABC), which has improved the reconstruction of invasion routes, but can be computationally expensive. We use our study to illustrate the use of a new, more efficient, ABC method, ABC random forest (ABC-RF) and compare it to a standard ABC method (ABC-LDA). We find that Japan emerges as the most probable source of the earliest recorded invasion into Hawaii. Southeast China and Hawaii together are the most probable sources of populations in western North America, which then in turn served as sources for those in eastern North America. European populations are genetically more homogeneous than North American populations, and their most probable source is northeast China, with evidence of limited gene flow from the eastern US as well. All introduced populations passed through bottlenecks, and analyses reveal five distinct admixture events. These findings can inform hypotheses concerning how this species evolved between different and independent source and invasive populations. Methodological comparisons indicate that ABC-RF and ABC-LDA show concordant results if ABC-LDA is based on a large number of simulated datasets but that ABC-RF out-performs ABC-LDA when using a comparable and more manageable number of simulated datasets, especially when analyzing complex introduction scenarios. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Training set optimization under population structure in genomic selection.

PubMed

Isidro, Julio; Jannink, Jean-Luc; Akdemir, Deniz; Poland, Jesse; Heslot, Nicolas; Sorrells, Mark E

2015-01-01

Population structure must be evaluated before optimization of the training set population. Maximizing the phenotypic variance captured by the training set is important for optimal performance. The optimization of the training set (TRS) in genomic selection has received much interest in both animal and plant breeding, because it is critical to the accuracy of the prediction models. In this study, five different TRS sampling algorithms, stratified sampling, mean of the coefficient of determination (CDmean), mean of predictor error variance (PEVmean), stratified CDmean (StratCDmean) and random sampling, were evaluated for prediction accuracy in the presence of different levels of population structure. In the presence of population structure, the most phenotypic variation captured by a sampling method in the TRS is desirable. The wheat dataset showed mild population structure, and CDmean and stratified CDmean methods showed the highest accuracies for all the traits except for test weight and heading date. The rice dataset had strong population structure and the approach based on stratified sampling showed the highest accuracies for all traits. In general, CDmean minimized the relationship between genotypes in the TRS, maximizing the relationship between TRS and the test set. This makes it suitable as an optimization criterion for long-term selection. Our results indicated that the best selection criterion used to optimize the TRS seems to depend on the interaction of trait architecture and population structure.
Classifier performance prediction for computer-aided diagnosis using a limited dataset.

PubMed

Sahiner, Berkman; Chan, Heang-Ping; Hadjiiski, Lubomir

2008-04-01

In a practical classifier design problem, the true population is generally unknown and the available sample is finite-sized. A common approach is to use a resampling technique to estimate the performance of the classifier that will be trained with the available sample. We conducted a Monte Carlo simulation study to compare the ability of the different resampling techniques in training the classifier and predicting its performance under the constraint of a finite-sized sample. The true population for the two classes was assumed to be multivariate normal distributions with known covariance matrices. Finite sets of sample vectors were drawn from the population. The true performance of the classifier is defined as the area under the receiver operating characteristic curve (AUC) when the classifier designed with the specific sample is applied to the true population. We investigated methods based on the Fukunaga-Hayes and the leave-one-out techniques, as well as three different types of bootstrap methods, namely, the ordinary, 0.632, and 0.632+ bootstrap. The Fisher's linear discriminant analysis was used as the classifier. The dimensionality of the feature space was varied from 3 to 15. The sample size n2 from the positive class was varied between 25 and 60, while the number of cases from the negative class was either equal to n2 or 3n2. Each experiment was performed with an independent dataset randomly drawn from the true population. Using a total of 1000 experiments for each simulation condition, we compared the bias, the variance, and the root-mean-squared error (RMSE) of the AUC estimated using the different resampling techniques relative to the true AUC (obtained from training on a finite dataset and testing on the population). Our results indicated that, under the study conditions, there can be a large difference in the RMSE obtained using different resampling methods, especially when the feature space dimensionality is relatively large and the sample size is small. Under this type of conditions, the 0.632 and 0.632+ bootstrap methods have the lowest RMSE, indicating that the difference between the estimated and the true performances obtained using the 0.632 and 0.632+ bootstrap will be statistically smaller than those obtained using the other three resampling methods. Of the three bootstrap methods, the 0.632+ bootstrap provides the lowest bias. Although this investigation is performed under some specific conditions, it reveals important trends for the problem of classifier performance prediction under the constraint of a limited dataset.
Discovering network behind infectious disease outbreak

NASA Astrophysics Data System (ADS)

Maeno, Yoshiharu

2010-11-01

Stochasticity and spatial heterogeneity are of great interest recently in studying the spread of an infectious disease. The presented method solves an inverse problem to discover the effectively decisive topology of a heterogeneous network and reveal the transmission parameters which govern the stochastic spreads over the network from a dataset on an infectious disease outbreak in the early growth phase. Populations in a combination of epidemiological compartment models and a meta-population network model are described by stochastic differential equations. Probability density functions are derived from the equations and used for the maximal likelihood estimation of the topology and parameters. The method is tested with computationally synthesized datasets and the WHO dataset on the SARS outbreak.
Medicalising normality? Using a simulated dataset to assess the performance of different diagnostic criteria of HIV-associated cognitive impairment

PubMed Central

De Francesco, Davide; Leech, Robert; Sabin, Caroline A.; Winston, Alan

2018-01-01

Objective The reported prevalence of cognitive impairment remains similar to that reported in the pre-antiretroviral therapy era. This may be partially artefactual due to the methods used to diagnose impairment. In this study, we evaluated the diagnostic performance of the HIV-associated neurocognitive disorder (Frascati criteria) and global deficit score (GDS) methods in comparison to a new, multivariate method of diagnosis. Methods Using a simulated ‘normative’ dataset informed by real-world cognitive data from the observational Pharmacokinetic and Clinical Observations in PeoPle Over fiftY (POPPY) cohort study, we evaluated the apparent prevalence of cognitive impairment using the Frascati and GDS definitions, as well as a novel multivariate method based on the Mahalanobis distance. We then quantified the diagnostic properties (including positive and negative predictive values and accuracy) of each method, using bootstrapping with 10,000 replicates, with a separate ‘test’ dataset to which a pre-defined proportion of ‘impaired’ individuals had been added. Results The simulated normative dataset demonstrated that up to ~26% of a normative control population would be diagnosed with cognitive impairment with the Frascati criteria and ~20% with the GDS. In contrast, the multivariate Mahalanobis distance method identified impairment in ~5%. Using the test dataset, diagnostic accuracy [95% confidence intervals] and positive predictive value (PPV) was best for the multivariate method vs. Frascati and GDS (accuracy: 92.8% [90.3–95.2%] vs. 76.1% [72.1–80.0%] and 80.6% [76.6–84.5%] respectively; PPV: 61.2% [48.3–72.2%] vs. 29.4% [22.2–36.8%] and 33.9% [25.6–42.3%] respectively). Increasing the a priori false positive rate for the multivariate Mahalanobis distance method from 5% to 15% resulted in an increase in sensitivity from 77.4% (64.5–89.4%) to 92.2% (83.3–100%) at a cost of specificity from 94.5% (92.8–95.2%) to 85.0% (81.2–88.5%). Conclusion Our simulations suggest that the commonly used diagnostic criteria of HIV-associated cognitive impairment label a significant proportion of a normative reference population as cognitively impaired, which will likely lead to a substantial over-estimate of the true proportion in a study population, due to their lower than expected specificity. These findings have important implications for clinical research regarding cognitive health in people living with HIV. More accurate methods of diagnosis should be implemented, with multivariate techniques offering a promising solution. PMID:29641619
The Living Planet Index: using species population time series to track trends in biodiversity

PubMed Central

Loh, Jonathan; Green, Rhys E; Ricketts, Taylor; Lamoreux, John; Jenkins, Martin; Kapos, Valerie; Randers, Jorgen

2005-01-01

The Living Planet Index was developed to measure the changing state of the world's biodiversity over time. It uses time-series data to calculate average rates of change in a large number of populations of terrestrial, freshwater and marine vertebrate species. The dataset contains about 3000 population time series for over 1100 species. Two methods of calculating the index are outlined: the chain method and a method based on linear modelling of log-transformed data. The dataset is analysed to compare the relative representation of biogeographic realms, ecoregional biomes, threat status and taxonomic groups among species contributing to the index. The two methods show very similar results: terrestrial species declined on average by 25% from 1970 to 2000. Birds and mammals are over-represented in comparison with other vertebrate classes, and temperate species are over-represented compared with tropical species, but there is little difference in representation between threatened and non-threatened species. Some of the problems arising from over-representation are reduced by the way in which the index is calculated. It may be possible to reduce this further by post-stratification and weighting, but new information would first need to be collected for data-poor classes, realms and biomes. PMID:15814346
Probabilistic atlas and geometric variability estimation to drive tissue segmentation.

PubMed

Xu, Hao; Thirion, Bertrand; Allassonnière, Stéphanie

2014-09-10

Computerized anatomical atlases play an important role in medical image analysis. While an atlas usually refers to a standard or mean image also called template, which presumably represents well a given population, it is not enough to characterize the observed population in detail. A template image should be learned jointly with the geometric variability of the shapes represented in the observations. These two quantities will in the sequel form the atlas of the corresponding population. The geometric variability is modeled as deformations of the template image so that it fits the observations. In this paper, we provide a detailed analysis of a new generative statistical model based on dense deformable templates that represents several tissue types observed in medical images. Our atlas contains both an estimation of probability maps of each tissue (called class) and the deformation metric. We use a stochastic algorithm for the estimation of the probabilistic atlas given a dataset. This atlas is then used for atlas-based segmentation method to segment the new images. Experiments are shown on brain T1 MRI datasets. Copyright © 2014 John Wiley & Sons, Ltd.
Semi-supervised manifold learning with affinity regularization for Alzheimer's disease identification using positron emission tomography imaging.

PubMed

Lu, Shen; Xia, Yong; Cai, Tom Weidong; Feng, David Dagan

2015-01-01

Dementia, Alzheimer's disease (AD) in particular is a global problem and big threat to the aging population. An image based computer-aided dementia diagnosis method is needed to providing doctors help during medical image examination. Many machine learning based dementia classification methods using medical imaging have been proposed and most of them achieve accurate results. However, most of these methods make use of supervised learning requiring fully labeled image dataset, which usually is not practical in real clinical environment. Using large amount of unlabeled images can improve the dementia classification performance. In this study we propose a new semi-supervised dementia classification method based on random manifold learning with affinity regularization. Three groups of spatial features are extracted from positron emission tomography (PET) images to construct an unsupervised random forest which is then used to regularize the manifold learning objective function. The proposed method, stat-of-the-art Laplacian support vector machine (LapSVM) and supervised SVM are applied to classify AD and normal controls (NC). The experiment results show that learning with unlabeled images indeed improves the classification performance. And our method outperforms LapSVM on the same dataset.
Optical pre-screening in breast screening programs: Can we identify women who benefit most from limited mammography resources?

NASA Astrophysics Data System (ADS)

Walter, Jane; Loshchenov, Maxim; Zhilkin, Vladimir; Peake, Rachel; Stone, Jennifer; Lilge, Lothar

2017-04-01

Background: In excess of 60% of all cancers are detected in low and middle-income countries, with breast cancer (BC) the dominant malignancy for women. Incidence rates continue to climb, most noticeably in the less than 50-year-old population. Expansion of mammography infrastructure and resources is lacking, resulting in over 60% of women diagnosed with stage III/IV BC in the majority of these countries. Optical Breast Spectroscopy (OBS) was shown to correlate well with mammographic breast density (MBD). OBS could aid breast screening programs in low- and middle-income countries by lowering the number of mammographs required for complete population coverage. However, its performance needs to be tested in large population trails to ensure high sensitivity and acceptable specificity. Methods: For the planned studies in low- and middle-income countries in different continents, online methods need to be implemented to monitor the performance and data collection by these devices, operated by trained nurses. Based on existing datasets, procedures were developed to validate an individual woman's data integrity and to identify operator errors versus system malfunctions. Results: Using a dataset comprising spectra from 360 women collected by 2 instruments in different locations and with 3 different trained operators, automated methods were developed to identify 100% of the source or photodetector malfunctions as well as incorrect calibrations and 96% of instances of insufficient tissue contact. Conclusions: Implementing the dataset validation locally in each instrument and tethered to a cloud database will allow the planned clinical trials to proceed.
Predicting Virtual World User Population Fluctuations with Deep Learning

PubMed Central

Park, Nuri; Zhang, Qimeng; Kim, Jun Gi; Kang, Shin Jin; Kim, Chang Hun

2016-01-01

This paper proposes a system for predicting increases in virtual world user actions. The virtual world user population is a very important aspect of these worlds; however, methods for predicting fluctuations in these populations have not been well documented. Therefore, we attempt to predict changes in virtual world user populations with deep learning, using easily accessible online data, including formal datasets from Google Trends, Wikipedia, and online communities, as well as informal datasets collected from online forums. We use the proposed system to analyze the user population of EVE Online, one of the largest virtual worlds. PMID:27936009
Predicting Virtual World User Population Fluctuations with Deep Learning.

PubMed

Kim, Young Bin; Park, Nuri; Zhang, Qimeng; Kim, Jun Gi; Kang, Shin Jin; Kim, Chang Hun

2016-01-01

This paper proposes a system for predicting increases in virtual world user actions. The virtual world user population is a very important aspect of these worlds; however, methods for predicting fluctuations in these populations have not been well documented. Therefore, we attempt to predict changes in virtual world user populations with deep learning, using easily accessible online data, including formal datasets from Google Trends, Wikipedia, and online communities, as well as informal datasets collected from online forums. We use the proposed system to analyze the user population of EVE Online, one of the largest virtual worlds.
Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data

PubMed Central

Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick

2016-01-01

This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest. PMID:26958859
Comparing the Performance of NoSQL Approaches for Managing Archetype-Based Electronic Health Record Data.

PubMed

Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick

2016-01-01

This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.
SNPs and Haplotypes in Native American Populations

PubMed Central

Kidd, Judith R.; Friedlaender, Françoise; Pakstis, Andrew J.; Furtado, Manohar; Fang, Rixun; Wang, Xudong; Nievergelt, Caroline M.; Kidd, Kenneth K.

2013-01-01

Autosomal DNA polymorphisms can provide new information and understanding of both the origins of and relationships among modern Native American populations. At the same time that autosomal markers can be highly informative, they are also susceptible to ascertainment biases in the selection of the markers to use. Identifying markers that can be used for ancestry inference among Native American populations can be considered separate from identifying markers to further the quest for history. In the current study we are using data on nine Native American populations to compare the results based on a large haplotype-based dataset with relatively small independent sets of SNPs. We are interested in what types of limited datasets an individual laboratory might be able to collect are best for addressing two different questions of interest. First, how well can we differentiate the Native American populations and/or infer ancestry by assigning an individual to her population(s) of origin? Second, how well can we infer the historical/evolutionary relationships among Native American populations and their Eurasian origins. We conclude that only a large comprehensive dataset involving multiple autosomal markers on multiple populations will be able to answer both questions; different small sets of markers are able to answer only one or the other of these questions. Using our largest dataset we see a general increasing distance from Old World populations from North to South in the New World except for an unexplained close relationship between our Maya and Quechua samples. PMID:21913176
Human neutral genetic variation and forensic STR data.

PubMed

Silva, Nuno M; Pereira, Luísa; Poloni, Estella S; Currat, Mathias

2012-01-01

The forensic genetics field is generating extensive population data on polymorphism of short tandem repeats (STR) markers in globally distributed samples. In this study we explored and quantified the informative power of these datasets to address issues related to human evolution and diversity, by using two online resources: an allele frequency dataset representing 141 populations summing up to almost 26 thousand individuals; a genotype dataset consisting of 42 populations and more than 11 thousand individuals. We show that the genetic relationships between populations based on forensic STRs are best explained by geography, as observed when analysing other worldwide datasets generated specifically to study human diversity. However, the global level of genetic differentiation between populations (as measured by a fixation index) is about half the value estimated with those other datasets, which contain a much higher number of markers but much less individuals. We suggest that the main factor explaining this difference is an ascertainment bias in forensics data resulting from the choice of markers for individual identification. We show that this choice results in average low variance of heterozygosity across world regions, and hence in low differentiation among populations. Thus, the forensic genetic markers currently produced for the purpose of individual assignment and identification allow the detection of the patterns of neutral genetic structure that characterize the human population but they do underestimate the levels of this genetic structure compared to the datasets of STRs (or other kinds of markers) generated specifically to study the diversity of human populations.
Ancestral haplotype-based association mapping with generalized linear mixed models accounting for stratification.

PubMed

Zhang, Z; Guillaume, F; Sartelet, A; Charlier, C; Georges, M; Farnir, F; Druet, T

2012-10-01

In many situations, genome-wide association studies are performed in populations presenting stratification. Mixed models including a kinship matrix accounting for genetic relatedness among individuals have been shown to correct for population and/or family structure. Here we extend this methodology to generalized linear mixed models which properly model data under various distributions. In addition we perform association with ancestral haplotypes inferred using a hidden Markov model. The method was shown to properly account for stratification under various simulated scenari presenting population and/or family structure. Use of ancestral haplotypes resulted in higher power than SNPs on simulated datasets. Application to real data demonstrates the usefulness of the developed model. Full analysis of a dataset with 4600 individuals and 500 000 SNPs was performed in 2 h 36 min and required 2.28 Gb of RAM. The software GLASCOW can be freely downloaded from www.giga.ulg.ac.be/jcms/prod_381171/software. francois.guillaume@jouy.inra.fr Supplementary data are available at Bioinformatics online.
CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision.

PubMed

Herath, Damayanthi; Tang, Sen-Lin; Tandon, Kshitij; Ackland, David; Halgamuge, Saman Kumara

2017-12-28

In metagenomics, the separation of nucleotide sequences belonging to an individual or closely matched populations is termed binning. Binning helps the evaluation of underlying microbial population structure as well as the recovery of individual genomes from a sample of uncultivable microbial organisms. Both supervised and unsupervised learning methods have been employed in binning; however, characterizing a metagenomic sample containing multiple strains remains a significant challenge. In this study, we designed and implemented a new workflow, Coverage and composition based binning of Metagenomes (CoMet), for binning contigs in a single metagenomic sample. CoMet utilizes coverage values and the compositional features of metagenomic contigs. The binning strategy in CoMet includes the initial grouping of contigs in guanine-cytosine (GC) content-coverage space and refinement of bins in tetranucleotide frequencies space in a purely unsupervised manner. With CoMet, the clustering algorithm DBSCAN is employed for binning contigs. The performances of CoMet were compared against four existing approaches for binning a single metagenomic sample, including MaxBin, Metawatt, MyCC (default) and MyCC (coverage) using multiple datasets including a sample comprised of multiple strains. Binning methods based on both compositional features and coverages of contigs had higher performances than the method which is based only on compositional features of contigs. CoMet yielded higher or comparable precision in comparison to the existing binning methods on benchmark datasets of varying complexities. MyCC (coverage) had the highest ranking score in F1-score. However, the performances of CoMet were higher than MyCC (coverage) on the dataset containing multiple strains. Furthermore, CoMet recovered contigs of more species and was 18 - 39% higher in precision than the compared existing methods in discriminating species from the sample of multiple strains. CoMet resulted in higher precision than MyCC (default) and MyCC (coverage) on a real metagenome. The approach proposed with CoMet for binning contigs, improves the precision of binning while characterizing more species in a single metagenomic sample and in a sample containing multiple strains. The F1-scores obtained from different binning strategies vary with different datasets; however, CoMet yields the highest F1-score with a sample comprised of multiple strains.

EnviroAtlas - Percentage of Working Age Population Who Are Employed by Block Group for the Conterminous United States

EPA Pesticide Factsheets

This EnviroAtlas dataset shows the employment rate, or the percent of the population aged 16-64 who have worked in the past 12 months. The employment rate is a measure of the percent of the working-age population who are employed. It is an indicator of the prevalence of unemployment, which is often used to assess labor market conditions by economists. It is a widely used metric to evaluate the sustainable development of communities (NRC, 2011, UNECE, 2009). This dataset is based on the American Community Survey 5-year data for 2008-2012. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
PASTA for Proteins.

PubMed

Collins, Kodi; Warnow, Tandy

2018-06-19

PASTA is a multiple sequence method that uses divide-and-conquer plus iteration to enable base alignment methods to scale with high accuracy to large sequence datasets. By default, PASTA included MAFFT L-INS-i; our new extension of PASTA enables the use of MAFFT G-INS-i, MAFFT Homologs, CONTRAlign, and ProbCons. We analyzed the performance of each base method and PASTA using these base methods on 224 datasets from BAliBASE 4 with at least 50 sequences. We show that PASTA enables the most accurate base methods to scale to larger datasets at reduced computational effort, and generally improves alignment and tree accuracy on the largest BAliBASE datasets. PASTA is available at https://github.com/kodicollins/pasta and has also been integrated into the original PASTA repository at https://github.com/smirarab/pasta. Supplementary data are available at Bioinformatics online.
Automated grouping of action potentials of human embryonic stem cell-derived cardiomyocytes.

PubMed

Gorospe, Giann; Zhu, Renjun; Millrod, Michal A; Zambidis, Elias T; Tung, Leslie; Vidal, Rene

2014-09-01

Methods for obtaining cardiomyocytes from human embryonic stem cells (hESCs) are improving at a significant rate. However, the characterization of these cardiomyocytes (CMs) is evolving at a relatively slower rate. In particular, there is still uncertainty in classifying the phenotype (ventricular-like, atrial-like, nodal-like, etc.) of an hESC-derived cardiomyocyte (hESC-CM). While previous studies identified the phenotype of a CM based on electrophysiological features of its action potential, the criteria for classification were typically subjective and differed across studies. In this paper, we use techniques from signal processing and machine learning to develop an automated approach to discriminate the electrophysiological differences between hESC-CMs. Specifically, we propose a spectral grouping-based algorithm to separate a population of CMs into distinct groups based on the similarity of their action potential shapes. We applied this method to a dataset of optical maps of cardiac cell clusters dissected from human embryoid bodies. While some of the nine cell clusters in the dataset are presented with just one phenotype, the majority of the cell clusters are presented with multiple phenotypes. The proposed algorithm is generally applicable to other action potential datasets and could prove useful in investigating the purification of specific types of CMs from an electrophysiological perspective.
Automated Grouping of Action Potentials of Human Embryonic Stem Cell-Derived Cardiomyocytes

PubMed Central

Gorospe, Giann; Zhu, Renjun; Millrod, Michal A.; Zambidis, Elias T.; Tung, Leslie; Vidal, René

2015-01-01

Methods for obtaining cardiomyocytes from human embryonic stem cells (hESCs) are improving at a significant rate. However, the characterization of these cardiomyocytes is evolving at a relatively slower rate. In particular, there is still uncertainty in classifying the phenotype (ventricular-like, atrial-like, nodal-like, etc.) of an hESC-derived cardiomyocyte (hESC-CM). While previous studies identified the phenotype of a cardiomyocyte based on electrophysiological features of its action potential, the criteria for classification were typically subjective and differed across studies. In this paper, we use techniques from signal processing and machine learning to develop an automated approach to discriminate the electrophysiological differences between hESC-CMs. Specifically, we propose a spectral grouping-based algorithm to separate a population of cardiomyocytes into distinct groups based on the similarity of their action potential shapes. We applied this method to a dataset of optical maps of cardiac cell clusters dissected from human embryoid bodies (hEBs). While some of the 9 cell clusters in the dataset presented with just one phenotype, the majority of the cell clusters presented with multiple phenotypes. The proposed algorithm is generally applicable to other action potential datasets and could prove useful in investigating the purification of specific types of cardiomyocytes from an electrophysiological perspective. PMID:25148658
The French-Canadian data set of Demirjian for dental age estimation: a systematic review and meta-analysis.

PubMed

Jayaraman, Jayakumar; Wong, Hai Ming; King, Nigel M; Roberts, Graham J

2013-07-01

Estimation of age of an individual can be performed by evaluating the pattern of dental development. A dataset for age estimation based on the dental maturity of a French-Canadian population was published over 35 years ago and has become the most widely accepted dataset. The applicability of this dataset has been tested on different population groups. To estimate the observed differences between Chronological age (CA) and Dental age (DA) when the French Canadian dataset was used to estimate the age of different population groups. A systematic search of literature for papers utilizing the French Canadian dataset for age estimation was performed. All language articles from PubMed, Embase and Cochrane databases were electronically searched for terms 'Demirjian' and 'Dental age' published between January 1973 and December 2011. A hand search of articles was also conducted. A total of 274 studies were identified from which 34 studies were included for qualitative analysis and 12 studies were included for quantitative assessment and meta-analysis. When synthesizing the estimation results from different population groups, on average, the Demirjian dataset overestimated the age of females by 0.65 years (-0.10 years to +2.82 years) and males by 0.60 years (-0.23 years to +3.04 years). The French Canadian dataset overestimates the age of the subjects by more than six months and hence this dataset should be used only with considerable caution when estimating age of group of subjects of any global population. Copyright © 2013 Elsevier Ltd and Faculty of Forensic and Legal Medicine. All rights reserved.
UNCLES: method for the identification of genes differentially consistently co-expressed in a specific subset of datasets.

PubMed

Abu-Jamous, Basel; Fa, Rui; Roberts, David J; Nandi, Asoke K

2015-06-04

Collective analysis of the increasingly emerging gene expression datasets are required. The recently proposed binarisation of consensus partition matrices (Bi-CoPaM) method can combine clustering results from multiple datasets to identify the subsets of genes which are consistently co-expressed in all of the provided datasets in a tuneable manner. However, results validation and parameter setting are issues that complicate the design of such methods. Moreover, although it is a common practice to test methods by application to synthetic datasets, the mathematical models used to synthesise such datasets are usually based on approximations which may not always be sufficiently representative of real datasets. Here, we propose an unsupervised method for the unification of clustering results from multiple datasets using external specifications (UNCLES). This method has the ability to identify the subsets of genes consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets, and to identify the subsets of genes consistently co-expressed in all given datasets. We also propose the M-N scatter plots validation technique and adopt it to set the parameters of UNCLES, such as the number of clusters, automatically. Additionally, we propose an approach for the synthesis of gene expression datasets using real data profiles in a way which combines the ground-truth-knowledge of synthetic data and the realistic expression values of real data, and therefore overcomes the problem of faithfulness of synthetic expression data modelling. By application to those datasets, we validate UNCLES while comparing it with other conventional clustering methods, and of particular relevance, biclustering methods. We further validate UNCLES by application to a set of 14 real genome-wide yeast datasets as it produces focused clusters that conform well to known biological facts. Furthermore, in-silico-based hypotheses regarding the function of a few previously unknown genes in those focused clusters are drawn. The UNCLES method, the M-N scatter plots technique, and the expression data synthesis approach will have wide application for the comprehensive analysis of genomic and other sources of multiple complex biological datasets. Moreover, the derived in-silico-based biological hypotheses represent subjects for future functional studies.
Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples

PubMed Central

White, James Robert; Nagarajan, Niranjan; Pop, Mihai

2009-01-01

Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/. PMID:19360128
Combined proportional and additive residual error models in population pharmacokinetic modelling.

PubMed

Proost, Johannes H

2017-11-15

In pharmacokinetic modelling, a combined proportional and additive residual error model is often preferred over a proportional or additive residual error model. Different approaches have been proposed, but a comparison between approaches is still lacking. The theoretical background of the methods is described. Method VAR assumes that the variance of the residual error is the sum of the statistically independent proportional and additive components; this method can be coded in three ways. Method SD assumes that the standard deviation of the residual error is the sum of the proportional and additive components. Using datasets from literature and simulations based on these datasets, the methods are compared using NONMEM. The different coding of methods VAR yield identical results. Using method SD, the values of the parameters describing residual error are lower than for method VAR, but the values of the structural parameters and their inter-individual variability are hardly affected by the choice of the method. Both methods are valid approaches in combined proportional and additive residual error modelling, and selection may be based on OFV. When the result of an analysis is used for simulation purposes, it is essential that the simulation tool uses the same method as used during analysis. Copyright © 2017 Elsevier B.V. All rights reserved.
Leveling data in geochemical mapping: scope of application, pros and cons of existing methods

NASA Astrophysics Data System (ADS)

Pereira, Benoît; Vandeuren, Aubry; Sonnet, Philippe

2017-04-01

Geochemical mapping successfully met a range of needs from mineral exploration to environmental management. In Europe and around the world numerous geochemical datasets already exist. These datasets may originate from geochemical mapping projects or from the collection of sample analyses requested by environmental protection regulatory bodies. Combining datasets can be highly beneficial for establishing geochemical maps with increased resolution and/or coverage area. However this practice requires assessing the equivalence between datasets and, if needed, applying data leveling to remove possible biases between datasets. In the literature, several procedures for assessing dataset equivalence and leveling data are proposed. Daneshfar & Cameron (1998) proposed a method for the leveling of two adjacent datasets while Pereira et al. (2016) proposed two methods for the leveling of datasets that contain records located within the same geographical area. Each discussed method requires its own set of assumptions (underlying populations of data, spatial distribution of data, etc.). Here we propose to discuss the scope of application, pros, cons and practical recommendations for each method. This work is illustrated with several case studies in Wallonia (Southern Belgium) and in Europe involving trace element geochemical datasets. References: Daneshfar, B. & Cameron, E. (1998), Leveling geochemical data between map sheets, Journal of Geochemical Exploration 63(3), 189-201. Pereira, B.; Vandeuren, A.; Govaerts, B. B. & Sonnet, P. (2016), Assessing dataset equivalence and leveling data in geochemical mapping, Journal of Geochemical Exploration 168, 36-48.
CrossLink: a novel method for cross-condition classification of cancer subtypes.

PubMed

Ma, Chifeng; Sastry, Konduru S; Flore, Mario; Gehani, Salah; Al-Bozom, Issam; Feng, Yusheng; Serpedin, Erchin; Chouchane, Lotfi; Chen, Yidong; Huang, Yufei

2016-08-22

We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.
Progeny Clustering: A Method to Identify Biological Phenotypes

PubMed Central

Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.

2015-01-01

Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
Mapping for maternal and newborn health: the distributions of women of childbearing age, pregnancies and births

PubMed Central

2014-01-01

Background The health and survival of women and their new-born babies in low income countries has been a key priority in public health since the 1990s. However, basic planning data, such as numbers of pregnancies and births, remain difficult to obtain and information is also lacking on geographic access to key services, such as facilities with skilled health workers. For maternal and newborn health and survival, planning for safer births and healthier newborns could be improved by more accurate estimations of the distributions of women of childbearing age. Moreover, subnational estimates of projected future numbers of pregnancies are needed for more effective strategies on human resources and infrastructure, while there is a need to link information on pregnancies to better information on health facilities in districts and regions so that coverage of services can be assessed. Methods This paper outlines demographic mapping methods based on freely available data for the production of high resolution datasets depicting estimates of numbers of people, women of childbearing age, live births and pregnancies, and distribution of comprehensive EmONC facilities in four large high burden countries: Afghanistan, Bangladesh, Ethiopia and Tanzania. Satellite derived maps of settlements and land cover were constructed and used to redistribute areal census counts to produce detailed maps of the distributions of women of childbearing age. Household survey data, UN statistics and other sources on growth rates, age specific fertility rates, live births, stillbirths and abortions were then integrated to convert the population distribution datasets to gridded estimates of births and pregnancies. Results and conclusions These estimates, which can be produced for current, past or future years based on standard demographic projections, can provide the basis for strategic intelligence, planning services, and provide denominators for subnational indicators to track progress. The datasets produced are part of national midwifery workforce assessments conducted in collaboration with the respective Ministries of Health and the United Nations Population Fund (UNFPA) to identify disparities between population needs, health infrastructure and workforce supply. The datasets are available to the respective Ministries as part of the UNFPA programme to inform midwifery workforce planning and also publicly available through the WorldPop population mapping project. PMID:24387010
A cautionary note on Bayesian estimation of population size by removal sampling with diffuse priors.

PubMed

Bord, Séverine; Bioche, Christèle; Druilhet, Pierre

2018-05-01

We consider the problem of estimating a population size by removal sampling when the sampling rate is unknown. Bayesian methods are now widespread and allow to include prior knowledge in the analysis. However, we show that Bayes estimates based on default improper priors lead to improper posteriors or infinite estimates. Similarly, weakly informative priors give unstable estimators that are sensitive to the choice of hyperparameters. By examining the likelihood, we show that population size estimates can be stabilized by penalizing small values of the sampling rate or large value of the population size. Based on theoretical results and simulation studies, we propose some recommendations on the choice of the prior. Then, we applied our results to real datasets. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Population of 224 realistic human subject-based computational breast phantoms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Erickson, David W.; Wells, Jered R., E-mail: jered.wells@duke.edu; Sturgeon, Gregory M.

Purpose: To create a database of highly realistic and anatomically variable 3D virtual breast phantoms based on dedicated breast computed tomography (bCT) data. Methods: A tissue classification and segmentation algorithm was used to create realistic and detailed 3D computational breast phantoms based on 230 + dedicated bCT datasets from normal human subjects. The breast volume was identified using a coarse three-class fuzzy C-means segmentation algorithm which accounted for and removed motion blur at the breast periphery. Noise in the bCT data was reduced through application of a postreconstruction 3D bilateral filter. A 3D adipose nonuniformity (bias field) correction was thenmore » applied followed by glandular segmentation using a 3D bias-corrected fuzzy C-means algorithm. Multiple tissue classes were defined including skin, adipose, and several fractional glandular densities. Following segmentation, a skin mask was produced which preserved the interdigitated skin, adipose, and glandular boundaries of the skin interior. Finally, surface modeling was used to produce digital phantoms with methods complementary to the XCAT suite of digital human phantoms. Results: After rejecting some datasets due to artifacts, 224 virtual breast phantoms were created which emulate the complex breast parenchyma of actual human subjects. The volume breast density (with skin) ranged from 5.5% to 66.3% with a mean value of 25.3% ± 13.2%. Breast volumes ranged from 25.0 to 2099.6 ml with a mean value of 716.3 ± 386.5 ml. Three breast phantoms were selected for imaging with digital compression (using finite element modeling) and simple ray-tracing, and the results show promise in their potential to produce realistic simulated mammograms. Conclusions: This work provides a new population of 224 breast phantoms based on in vivo bCT data for imaging research. Compared to previous studies based on only a few prototype cases, this dataset provides a rich source of new cases spanning a wide range of breast types, volumes, densities, and parenchymal patterns.« less
Population of 224 realistic human subject-based computational breast phantoms

PubMed Central

Erickson, David W.; Wells, Jered R.; Sturgeon, Gregory M.; Dobbins, James T.; Segars, W. Paul; Lo, Joseph Y.

2016-01-01

Purpose: To create a database of highly realistic and anatomically variable 3D virtual breast phantoms based on dedicated breast computed tomography (bCT) data. Methods: A tissue classification and segmentation algorithm was used to create realistic and detailed 3D computational breast phantoms based on 230 + dedicated bCT datasets from normal human subjects. The breast volume was identified using a coarse three-class fuzzy C-means segmentation algorithm which accounted for and removed motion blur at the breast periphery. Noise in the bCT data was reduced through application of a postreconstruction 3D bilateral filter. A 3D adipose nonuniformity (bias field) correction was then applied followed by glandular segmentation using a 3D bias-corrected fuzzy C-means algorithm. Multiple tissue classes were defined including skin, adipose, and several fractional glandular densities. Following segmentation, a skin mask was produced which preserved the interdigitated skin, adipose, and glandular boundaries of the skin interior. Finally, surface modeling was used to produce digital phantoms with methods complementary to the XCAT suite of digital human phantoms. Results: After rejecting some datasets due to artifacts, 224 virtual breast phantoms were created which emulate the complex breast parenchyma of actual human subjects. The volume breast density (with skin) ranged from 5.5% to 66.3% with a mean value of 25.3% ± 13.2%. Breast volumes ranged from 25.0 to 2099.6 ml with a mean value of 716.3 ± 386.5 ml. Three breast phantoms were selected for imaging with digital compression (using finite element modeling) and simple ray-tracing, and the results show promise in their potential to produce realistic simulated mammograms. Conclusions: This work provides a new population of 224 breast phantoms based on in vivo bCT data for imaging research. Compared to previous studies based on only a few prototype cases, this dataset provides a rich source of new cases spanning a wide range of breast types, volumes, densities, and parenchymal patterns. PMID:26745896
Identifying injection drug use and estimating population size of people who inject drugs using healthcare administrative datasets.

PubMed

Janjua, Naveed Zafar; Islam, Nazrul; Kuo, Margot; Yu, Amanda; Wong, Stanley; Butt, Zahid A; Gilbert, Mark; Buxton, Jane; Chapinal, Nuria; Samji, Hasina; Chong, Mei; Alvarez, Maria; Wong, Jason; Tyndall, Mark W; Krajden, Mel

2018-05-01

Large linked healthcare administrative datasets could be used to monitor programs providing prevention and treatment services to people who inject drugs (PWID). However, diagnostic codes in administrative datasets do not differentiate non-injection from injection drug use (IDU). We validated algorithms based on diagnostic codes and prescription records representing IDU in administrative datasets against interview-based IDU data. The British Columbia Hepatitis Testers Cohort (BC-HTC) includes ∼1.7 million individuals tested for HCV/HIV or reported HBV/HCV/HIV/tuberculosis cases in BC from 1990 to 2015, linked to administrative datasets including physician visit, hospitalization and prescription drug records. IDU, assessed through interviews as part of enhanced surveillance at the time of HIV or HCV/HBV diagnosis from a subset of cases included in the BC-HTC (n = 6559), was used as the gold standard. ICD-9/ICD-10 codes for IDU and injecting-related infections (IRI) were grouped with records of opioid substitution therapy (OST) into multiple IDU algorithms in administrative datasets. We assessed the performance of IDU algorithms through calculation of sensitivity, specificity, positive predictive, and negative predictive values. Sensitivity was highest (90-94%), and specificity was lowest (42-73%) for algorithms based either on IDU or IRI and drug misuse codes. Algorithms requiring both drug misuse and IRI had lower sensitivity (57-60%) and higher specificity (90-92%). An optimal sensitivity and specificity combination was found with two medical visits or a single hospitalization for injectable drugs with (83%/82%) and without OST (78%/83%), respectively. Based on algorithms that included two medical visits, a single hospitalization or OST records, there were 41,358 (1.2% of 11-65 years individuals in BC) recent PWID in BC based on health encounters during 3- year period (2013-2015). Algorithms for identifying PWID using diagnostic codes in linked administrative data could be used for tracking the progress of programing aimed at PWID. With population-based datasets, this tool can be used to inform much needed estimates of PWID population size. Copyright © 2018 Elsevier B.V. All rights reserved.
Resolving the multiple sequence alignment problem using biogeography-based optimization with multiple populations.

PubMed

Zemali, El-Amine; Boukra, Abdelmadjid

2015-08-01

The multiple sequence alignment (MSA) is one of the most challenging problems in bioinformatics, it involves discovering similarity between a set of protein or DNA sequences. This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP). It is based on a recent metaheuristic inspired from the mathematics of biogeography named biogeography-based optimization (BBO). To improve the exploration ability of BBO, we have introduced a new concept allowing better exploration of the search space. It consists of manipulating multiple populations having each one its own parameters. These parameters are used to build up progressive alignments allowing more diversity. At each iteration, the best found solution is injected in each population. Moreover, to improve solution quality, six operators are defined. These operators are selected with a dynamic probability which changes according to the operators efficiency. In order to test proposed approach performance, we have considered a set of datasets from Balibase 2.0 and compared it with many recent algorithms such as GAPAM, MSA-GA, QEAMSA and RBT-GA. The results show that the proposed approach achieves better average score than the previously cited methods.
A Population Genetic Signal of Polygenic Adaptation

PubMed Central

Berg, Jeremy J.; Coop, Graham

2014-01-01

Adaptation in response to selection on polygenic phenotypes may occur via subtle allele frequencies shifts at many loci. Current population genomic techniques are not well posed to identify such signals. In the past decade, detailed knowledge about the specific loci underlying polygenic traits has begun to emerge from genome-wide association studies (GWAS). Here we combine this knowledge from GWAS with robust population genetic modeling to identify traits that may have been influenced by local adaptation. We exploit the fact that GWAS provide an estimate of the additive effect size of many loci to estimate the mean additive genetic value for a given phenotype across many populations as simple weighted sums of allele frequencies. We use a general model of neutral genetic value drift for an arbitrary number of populations with an arbitrary relatedness structure. Based on this model, we develop methods for detecting unusually strong correlations between genetic values and specific environmental variables, as well as a generalization of comparisons to test for over-dispersion of genetic values among populations. Finally we lay out a framework to identify the individual populations or groups of populations that contribute to the signal of overdispersion. These tests have considerably greater power than their single locus equivalents due to the fact that they look for positive covariance between like effect alleles, and also significantly outperform methods that do not account for population structure. We apply our tests to the Human Genome Diversity Panel (HGDP) dataset using GWAS data for height, skin pigmentation, type 2 diabetes, body mass index, and two inflammatory bowel disease datasets. This analysis uncovers a number of putative signals of local adaptation, and we discuss the biological interpretation and caveats of these results. PMID:25102153
Stature estimation equations for South Asian skeletons based on DXA scans of contemporary adults.

PubMed

Pomeroy, Emma; Mushrif-Tripathy, Veena; Wells, Jonathan C K; Kulkarni, Bharati; Kinra, Sanjay; Stock, Jay T

2018-05-03

Stature estimation from the skeleton is a classic anthropological problem, and recent years have seen the proliferation of population-specific regression equations. Many rely on the anatomical reconstruction of stature from archaeological skeletons to derive regression equations based on long bone lengths, but this requires a collection with very good preservation. In some regions, for example, South Asia, typical environmental conditions preclude the sufficient preservation of skeletal remains. Large-scale epidemiological studies that include medical imaging of the skeleton by techniques such as dual-energy X-ray absorptiometry (DXA) offer new potential datasets for developing such equations. We derived estimation equations based on known height and bone lengths measured from DXA scans from the Andhra Pradesh Children and Parents Study (Hyderabad, India). Given debates on the most appropriate regression model to use, multiple methods were compared, and the performance of the equations was tested on a published skeletal dataset of individuals with known stature. The equations have standard errors of estimates and prediction errors similar to those derived using anatomical reconstruction or from cadaveric datasets. As measured by the number of significant differences between true and estimated stature, and the prediction errors, the new equations perform as well as, and generally better than, published equations commonly used on South Asian skeletons or based on Indian cadaveric datasets. This study demonstrates the utility of DXA scans as a data source for developing stature estimation equations and offer a new set of equations for use with South Asian datasets. © 2018 Wiley Periodicals, Inc.
Quantification of HTLV-1 Clonality and TCR Diversity

PubMed Central

Laydon, Daniel J.; Melamed, Anat; Sim, Aaron; Gillet, Nicolas A.; Sim, Kathleen; Darko, Sam; Kroll, J. Simon; Douek, Daniel C.; Price, David A.; Bangham, Charles R. M.; Asquith, Becca

2014-01-01

Estimation of immunological and microbiological diversity is vital to our understanding of infection and the immune response. For instance, what is the diversity of the T cell repertoire? These questions are partially addressed by high-throughput sequencing techniques that enable identification of immunological and microbiological “species” in a sample. Estimators of the number of unseen species are needed to estimate population diversity from sample diversity. Here we test five widely used non-parametric estimators, and develop and validate a novel method, DivE, to estimate species richness and distribution. We used three independent datasets: (i) viral populations from subjects infected with human T-lymphotropic virus type 1; (ii) T cell antigen receptor clonotype repertoires; and (iii) microbial data from infant faecal samples. When applied to datasets with rarefaction curves that did not plateau, existing estimators systematically increased with sample size. In contrast, DivE consistently and accurately estimated diversity for all datasets. We identify conditions that limit the application of DivE. We also show that DivE can be used to accurately estimate the underlying population frequency distribution. We have developed a novel method that is significantly more accurate than commonly used biodiversity estimators in microbiological and immunological populations. PMID:24945836

Personalized Modeling for Prediction with Decision-Path Models

PubMed Central

Visweswaran, Shyam; Ferreira, Antonio; Ribeiro, Guilherme A.; Oliveira, Alexandre C.; Cooper, Gregory F.

2015-01-01

Deriving predictive models in medicine typically relies on a population approach where a single model is developed from a dataset of individuals. In this paper we describe and evaluate a personalized approach in which we construct a new type of decision tree model called decision-path model that takes advantage of the particular features of a given person of interest. We introduce three personalized methods that derive personalized decision-path models. We compared the performance of these methods to that of Classification And Regression Tree (CART) that is a population decision tree to predict seven different outcomes in five medical datasets. Two of the three personalized methods performed statistically significantly better on area under the ROC curve (AUC) and Brier skill score compared to CART. The personalized approach of learning decision path models is a new approach for predictive modeling that can perform better than a population approach. PMID:26098570
A Markovian Entropy Measure for the Analysis of Calcium Activity Time Series.

PubMed

Marken, John P; Halleran, Andrew D; Rahman, Atiqur; Odorizzi, Laura; LeFew, Michael C; Golino, Caroline A; Kemper, Peter; Saha, Margaret S

2016-01-01

Methods to analyze the dynamics of calcium activity often rely on visually distinguishable features in time series data such as spikes, waves, or oscillations. However, systems such as the developing nervous system display a complex, irregular type of calcium activity which makes the use of such methods less appropriate. Instead, for such systems there exists a class of methods (including information theoretic, power spectral, and fractal analysis approaches) which use more fundamental properties of the time series to analyze the observed calcium dynamics. We present a new analysis method in this class, the Markovian Entropy measure, which is an easily implementable calcium time series analysis method which represents the observed calcium activity as a realization of a Markov Process and describes its dynamics in terms of the level of predictability underlying the transitions between the states of the process. We applied our and other commonly used calcium analysis methods on a dataset from Xenopus laevis neural progenitors which displays irregular calcium activity and a dataset from murine synaptic neurons which displays activity time series that are well-described by visually-distinguishable features. We find that the Markovian Entropy measure is able to distinguish between biologically distinct populations in both datasets, and that it can separate biologically distinct populations to a greater extent than other methods in the dataset exhibiting irregular calcium activity. These results support the benefit of using the Markovian Entropy measure to analyze calcium dynamics, particularly for studies using time series data which do not exhibit easily distinguishable features.
Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens

PubMed Central

Yin, Zheng; Zhou, Xiaobo; Bakal, Chris; Li, Fuhai; Sun, Youxian; Perrimon, Norbert; Wong, Stephen TC

2008-01-01

Background The recent emergence of high-throughput automated image acquisition technologies has forever changed how cell biologists collect and analyze data. Historically, the interpretation of cellular phenotypes in different experimental conditions has been dependent upon the expert opinions of well-trained biologists. Such qualitative analysis is particularly effective in detecting subtle, but important, deviations in phenotypes. However, while the rapid and continuing development of automated microscope-based technologies now facilitates the acquisition of trillions of cells in thousands of diverse experimental conditions, such as in the context of RNA interference (RNAi) or small-molecule screens, the massive size of these datasets precludes human analysis. Thus, the development of automated methods which aim to identify novel and biological relevant phenotypes online is one of the major challenges in high-throughput image-based screening. Ideally, phenotype discovery methods should be designed to utilize prior/existing information and tackle three challenging tasks, i.e. restoring pre-defined biological meaningful phenotypes, differentiating novel phenotypes from known ones and clarifying novel phenotypes from each other. Arbitrarily extracted information causes biased analysis, while combining the complete existing datasets with each new image is intractable in high-throughput screens. Results Here we present the design and implementation of a novel and robust online phenotype discovery method with broad applicability that can be used in diverse experimental contexts, especially high-throughput RNAi screens. This method features phenotype modelling and iterative cluster merging using improved gap statistics. A Gaussian Mixture Model (GMM) is employed to estimate the distribution of each existing phenotype, and then used as reference distribution in gap statistics. This method is broadly applicable to a number of different types of image-based datasets derived from a wide spectrum of experimental conditions and is suitable to adaptively process new images which are continuously added to existing datasets. Validations were carried out on different dataset, including published RNAi screening using Drosophila embryos [Additional files 1, 2], dataset for cell cycle phase identification using HeLa cells [Additional files 1, 3, 4] and synthetic dataset using polygons, our methods tackled three aforementioned tasks effectively with an accuracy range of 85%–90%. When our method is implemented in the context of a Drosophila genome-scale RNAi image-based screening of cultured cells aimed to identifying the contribution of individual genes towards the regulation of cell-shape, it efficiently discovers meaningful new phenotypes and provides novel biological insight. We also propose a two-step procedure to modify the novelty detection method based on one-class SVM, so that it can be used to online phenotype discovery. In different conditions, we compared the SVM based method with our method using various datasets and our methods consistently outperformed SVM based method in at least two of three tasks by 2% to 5%. These results demonstrate that our methods can be used to better identify novel phenotypes in image-based datasets from a wide range of conditions and organisms. Conclusion We demonstrate that our method can detect various novel phenotypes effectively in complex datasets. Experiment results also validate that our method performs consistently under different order of image input, variation of starting conditions including the number and composition of existing phenotypes, and dataset from different screens. In our findings, the proposed method is suitable for online phenotype discovery in diverse high-throughput image-based genetic and chemical screens. PMID:18534020
Automatic labeling of MR brain images through extensible learning and atlas forests.

PubMed

Xu, Lijun; Liu, Hong; Song, Enmin; Yan, Meng; Jin, Renchao; Hung, Chih-Cheng

2017-12-01

Multiatlas-based method is extensively used in MR brain images segmentation because of its simplicity and robustness. This method provides excellent accuracy although it is time consuming and limited in terms of obtaining information about new atlases. In this study, an automatic labeling of MR brain images through extensible learning and atlas forest is presented to address these limitations. We propose an extensible learning model which allows the multiatlas-based framework capable of managing the datasets with numerous atlases or dynamic atlas datasets and simultaneously ensure the accuracy of automatic labeling. Two new strategies are used to reduce the time and space complexity and improve the efficiency of the automatic labeling of brain MR images. First, atlases are encoded to atlas forests through random forest technology to reduce the time consumed for cross-registration between atlases and target image, and a scatter spatial vector is designed to eliminate errors caused by inaccurate registration. Second, an atlas selection method based on the extensible learning model is used to select atlases for target image without traversing the entire dataset and then obtain the accurate labeling. The labeling results of the proposed method were evaluated in three public datasets, namely, IBSR, LONI LPBA40, and ADNI. With the proposed method, the dice coefficient metric values on the three datasets were 84.17 ± 4.61%, 83.25 ± 4.29%, and 81.88 ± 4.53% which were 5% higher than those of the conventional method, respectively. The efficiency of the extensible learning model was evaluated by state-of-the-art methods for labeling of MR brain images. Experimental results showed that the proposed method could achieve accurate labeling for MR brain images without traversing the entire datasets. In the proposed multiatlas-based method, extensible learning and atlas forests were applied to control the automatic labeling of brain anatomies on large atlas datasets or dynamic atlas datasets and obtain accurate results. © 2017 American Association of Physicists in Medicine.
From conservation genetics to conservation genomics: a genome-wide assessment of blue whales (Balaenoptera musculus) in Australian feeding aggregations

PubMed Central

Sandoval-Castillo, Jonathan; Jenner, K. Curt S.; Gill, Peter C.; Jenner, Micheline-Nicole M.; Morrice, Margaret G.

2018-01-01

Genetic datasets of tens of markers have been superseded through next-generation sequencing technology with genome-wide datasets of thousands of markers. Genomic datasets improve our power to detect low population structure and identify adaptive divergence. The increased population-level knowledge can inform the conservation management of endangered species, such as the blue whale (Balaenoptera musculus). In Australia, there are two known feeding aggregations of the pygmy blue whale (B. m. brevicauda) which have shown no evidence of genetic structure based on a small dataset of 10 microsatellites and mtDNA. Here, we develop and implement a high-resolution dataset of 8294 genome-wide filtered single nucleotide polymorphisms, the first of its kind for blue whales. We use these data to assess whether the Australian feeding aggregations constitute one population and to test for the first time whether there is adaptive divergence between the feeding aggregations. We found no evidence of neutral population structure and negligible evidence of adaptive divergence. We propose that individuals likely travel widely between feeding areas and to breeding areas, which would require them to be adapted to a wide range of environmental conditions. This has important implications for their conservation as this blue whale population is likely vulnerable to a range of anthropogenic threats both off Australia and elsewhere. PMID:29410806
Efficient segmentation of 3D fluoroscopic datasets from mobile C-arm

NASA Astrophysics Data System (ADS)

Styner, Martin A.; Talib, Haydar; Singh, Digvijay; Nolte, Lutz-Peter

2004-05-01

The emerging mobile fluoroscopic 3D technology linked with a navigation system combines the advantages of CT-based and C-arm-based navigation. The intra-operative, automatic segmentation of 3D fluoroscopy datasets enables the combined visualization of surgical instruments and anatomical structures for enhanced planning, surgical eye-navigation and landmark digitization. We performed a thorough evaluation of several segmentation algorithms using a large set of data from different anatomical regions and man-made phantom objects. The analyzed segmentation methods include automatic thresholding, morphological operations, an adapted region growing method and an implicit 3D geodesic snake method. In regard to computational efficiency, all methods performed within acceptable limits on a standard Desktop PC (30sec-5min). In general, the best results were obtained with datasets from long bones, followed by extremities. The segmentations of spine, pelvis and shoulder datasets were generally of poorer quality. As expected, the threshold-based methods produced the worst results. The combined thresholding and morphological operations methods were considered appropriate for a smaller set of clean images. The region growing method performed generally much better in regard to computational efficiency and segmentation correctness, especially for datasets of joints, and lumbar and cervical spine regions. The less efficient implicit snake method was able to additionally remove wrongly segmented skin tissue regions. This study presents a step towards efficient intra-operative segmentation of 3D fluoroscopy datasets, but there is room for improvement. Next, we plan to study model-based approaches for datasets from the knee and hip joint region, which would be thenceforth applied to all anatomical regions in our continuing development of an ideal segmentation procedure for 3D fluoroscopic images.
Mapping for maternal and newborn health: the distributions of women of childbearing age, pregnancies and births.

PubMed

Tatem, Andrew J; Campbell, James; Guerra-Arias, Maria; de Bernis, Luc; Moran, Allisyn; Matthews, Zoë

2014-01-04

The health and survival of women and their new-born babies in low income countries has been a key priority in public health since the 1990s. However, basic planning data, such as numbers of pregnancies and births, remain difficult to obtain and information is also lacking on geographic access to key services, such as facilities with skilled health workers. For maternal and newborn health and survival, planning for safer births and healthier newborns could be improved by more accurate estimations of the distributions of women of childbearing age. Moreover, subnational estimates of projected future numbers of pregnancies are needed for more effective strategies on human resources and infrastructure, while there is a need to link information on pregnancies to better information on health facilities in districts and regions so that coverage of services can be assessed. This paper outlines demographic mapping methods based on freely available data for the production of high resolution datasets depicting estimates of numbers of people, women of childbearing age, live births and pregnancies, and distribution of comprehensive EmONC facilities in four large high burden countries: Afghanistan, Bangladesh, Ethiopia and Tanzania. Satellite derived maps of settlements and land cover were constructed and used to redistribute areal census counts to produce detailed maps of the distributions of women of childbearing age. Household survey data, UN statistics and other sources on growth rates, age specific fertility rates, live births, stillbirths and abortions were then integrated to convert the population distribution datasets to gridded estimates of births and pregnancies. These estimates, which can be produced for current, past or future years based on standard demographic projections, can provide the basis for strategic intelligence, planning services, and provide denominators for subnational indicators to track progress. The datasets produced are part of national midwifery workforce assessments conducted in collaboration with the respective Ministries of Health and the United Nations Population Fund (UNFPA) to identify disparities between population needs, health infrastructure and workforce supply. The datasets are available to the respective Ministries as part of the UNFPA programme to inform midwifery workforce planning and also publicly available through the WorldPop population mapping project.
Molecular diversification of Trichuris spp. from Sigmodontinae (Cricetidae) rodents from Argentina based on mitochondrial DNA sequences.

PubMed

Callejón, Rocío; Robles, María Del Rosario; Panei, Carlos Javier; Cutillas, Cristina

2016-08-01

A molecular phylogenetic hypothesis is presented for the genus Trichuris based on sequence data from mitochondrial cytochrome c oxidase 1 (cox1) and cytochrome b (cob). The taxa consisted of nine populations of whipworm from five species of Sigmodontinae rodents from Argentina. Bayesian Inference, Maximum Parsimony, and Maximum Likelihood methods were used to infer phylogenies for each gene separately but also for the combined mitochondrial data and the combined mitochondrial and nuclear dataset. Phylogenetic results based on cox1 and cob mitochondrial DNA (mtDNA) revealed three clades strongly resolved corresponding to three different species (Trichuris navonae, Trichuris bainae, and Trichuris pardinasi) showing phylogeographic variation, but relationships among Trichuris species were poorly resolved. Phylogenetic reconstruction based on concatenated sequences had greater phylogenetic resolution for delimiting species and populations intra-specific of Trichuris than those based on partitioned genes. Thus, populations of T. bainae and T. pardinasi could be affected by geographical factors and co-divergence parasite-host.
Simulating a base population in honey bee for molecular genetic studies

PubMed Central

2012-01-01

Background Over the past years, reports have indicated that honey bee populations are declining and that infestation by an ecto-parasitic mite (Varroa destructor) is one of the main causes. Selective breeding of resistant bees can help to prevent losses due to the parasite, but it requires that a robust breeding program and genetic evaluation are implemented. Genomic selection has emerged as an important tool in animal breeding programs and simulation studies have shown that it yields more accurate breeding value estimates, higher genetic gain and low rates of inbreeding. Since genomic selection relies on marker data, simulations conducted on a genomic dataset are a pre-requisite before selection can be implemented. Although genomic datasets have been simulated in other species undergoing genetic evaluation, simulation of a genomic dataset specific to the honey bee is required since this species has a distinct genetic and reproductive biology. Our software program was aimed at constructing a base population by simulating a random mating honey bee population. A forward-time population simulation approach was applied since it allows modeling of genetic characteristics and reproductive behavior specific to the honey bee. Results Our software program yielded a genomic dataset for a base population in linkage disequilibrium. In addition, information was obtained on (1) the position of markers on each chromosome, (2) allele frequency, (3) χ2 statistics for Hardy-Weinberg equilibrium, (4) a sorted list of markers with a minor allele frequency less than or equal to the input value, (5) average r2 values of linkage disequilibrium between all simulated marker loci pair for all generations and (6) average r2 value of linkage disequilibrium in the last generation for selected markers with the highest minor allele frequency. Conclusion We developed a software program that takes into account the genetic and reproductive biology specific to the honey bee and that can be used to constitute a genomic dataset compatible with the simulation studies necessary to optimize breeding programs. The source code together with an instruction file is freely accessible at http://msproteomics.org/Research/Misc/honeybeepopulationsimulator.html PMID:22520469
Simulating a base population in honey bee for molecular genetic studies.

PubMed

Gupta, Pooja; Conrad, Tim; Spötter, Andreas; Reinsch, Norbert; Bienefeld, Kaspar

2012-06-27

Over the past years, reports have indicated that honey bee populations are declining and that infestation by an ecto-parasitic mite (Varroa destructor) is one of the main causes. Selective breeding of resistant bees can help to prevent losses due to the parasite, but it requires that a robust breeding program and genetic evaluation are implemented. Genomic selection has emerged as an important tool in animal breeding programs and simulation studies have shown that it yields more accurate breeding value estimates, higher genetic gain and low rates of inbreeding. Since genomic selection relies on marker data, simulations conducted on a genomic dataset are a pre-requisite before selection can be implemented. Although genomic datasets have been simulated in other species undergoing genetic evaluation, simulation of a genomic dataset specific to the honey bee is required since this species has a distinct genetic and reproductive biology. Our software program was aimed at constructing a base population by simulating a random mating honey bee population. A forward-time population simulation approach was applied since it allows modeling of genetic characteristics and reproductive behavior specific to the honey bee. Our software program yielded a genomic dataset for a base population in linkage disequilibrium. In addition, information was obtained on (1) the position of markers on each chromosome, (2) allele frequency, (3) χ(2) statistics for Hardy-Weinberg equilibrium, (4) a sorted list of markers with a minor allele frequency less than or equal to the input value, (5) average r(2) values of linkage disequilibrium between all simulated marker loci pair for all generations and (6) average r2 value of linkage disequilibrium in the last generation for selected markers with the highest minor allele frequency. We developed a software program that takes into account the genetic and reproductive biology specific to the honey bee and that can be used to constitute a genomic dataset compatible with the simulation studies necessary to optimize breeding programs. The source code together with an instruction file is freely accessible at http://msproteomics.org/Research/Misc/honeybeepopulationsimulator.html.
A photogrammetric technique for generation of an accurate multispectral optical flow dataset

NASA Astrophysics Data System (ADS)

Kniaz, V. V.

2017-06-01

A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.
Dental age assessment of southern Chinese using the United Kingdom Caucasian reference dataset.

PubMed

Jayaraman, Jayakumar; Roberts, Graham J; King, Nigel M; Wong, Hai Ming

2012-03-10

Dental age assessment is one the most accurate methods for estimating the age of an unknown person. Demirjian's dataset on a French-Canadian population has been widely tested for its applicability on various ethnic groups including southern Chinese. Following inaccurate results from these studies, investigators are now confronted with using alternate datasets for comparison. Testing the applicability of other reliable datasets which result in accurate findings might limit the need to develop population specific standards. Recently, a Reference Data Set (RDS) similar to the Demirjian was prepared in the United Kingdom (UK) and has been subsequently validated. The advantages of the UK Caucasian RDS includes versatility from including both the maxillary and mandibular dentitions, involvement of a wide age group of subjects for evaluation and the possibility of precise age estimation with the mathematical technique of meta-analysis. The aim of this study was to evaluate the applicability of the United Kingdom Caucasian RDS on southern Chinese subjects. Dental panoramic tomographs (DPT) of 266 subjects (133 males and 133 females) aged 2-21 years that were previously taken for clinical diagnostic purposes were selected and scored by a single calibrated examiner based on Demirjian's classification of tooth developmental stages (A-H). The ages corresponding to each stage of tooth developmental stage were obtained from the UK dataset. Intra-examiner reproducibility was tested and the Cohen kappa (0.88) showed that the level of agreement was 'almost perfect'. The estimated dental age was then compared with the chronological age using a paired t-test, with statistical significance set at p<0.01. The results showed that the UK dataset, underestimated the age of southern Chinese subjects by 0.24 years but the results were not statistically significant. In conclusion, the UK Caucasian RDS may not be suitable for estimating the age of southern Chinese subjects and there is a need for an ethnic specific reference dataset for southern Chinese. Copyright Â© 2011. Published by Elsevier Ireland Ltd.
Comparison of Bayesian clustering and edge detection methods for inferring boundaries in landscape genetics

USGS Publications Warehouse

Safner, T.; Miller, M.P.; McRae, B.H.; Fortin, M.-J.; Manel, S.

2011-01-01

Recently, techniques available for identifying clusters of individuals or boundaries between clusters using genetic data from natural populations have expanded rapidly. Consequently, there is a need to evaluate these different techniques. We used spatially-explicit simulation models to compare three spatial Bayesian clustering programs and two edge detection methods. Spatially-structured populations were simulated where a continuous population was subdivided by barriers. We evaluated the ability of each method to correctly identify boundary locations while varying: (i) time after divergence, (ii) strength of isolation by distance, (iii) level of genetic diversity, and (iv) amount of gene flow across barriers. To further evaluate the methods' effectiveness to detect genetic clusters in natural populations, we used previously published data on North American pumas and a European shrub. Our results show that with simulated and empirical data, the Bayesian spatial clustering algorithms outperformed direct edge detection methods. All methods incorrectly detected boundaries in the presence of strong patterns of isolation by distance. Based on this finding, we support the application of Bayesian spatial clustering algorithms for boundary detection in empirical datasets, with necessary tests for the influence of isolation by distance. ?? 2011 by the authors; licensee MDPI, Basel, Switzerland.
A Markovian Entropy Measure for the Analysis of Calcium Activity Time Series

PubMed Central

Rahman, Atiqur; Odorizzi, Laura; LeFew, Michael C.; Golino, Caroline A.; Kemper, Peter; Saha, Margaret S.

2016-01-01

Methods to analyze the dynamics of calcium activity often rely on visually distinguishable features in time series data such as spikes, waves, or oscillations. However, systems such as the developing nervous system display a complex, irregular type of calcium activity which makes the use of such methods less appropriate. Instead, for such systems there exists a class of methods (including information theoretic, power spectral, and fractal analysis approaches) which use more fundamental properties of the time series to analyze the observed calcium dynamics. We present a new analysis method in this class, the Markovian Entropy measure, which is an easily implementable calcium time series analysis method which represents the observed calcium activity as a realization of a Markov Process and describes its dynamics in terms of the level of predictability underlying the transitions between the states of the process. We applied our and other commonly used calcium analysis methods on a dataset from Xenopus laevis neural progenitors which displays irregular calcium activity and a dataset from murine synaptic neurons which displays activity time series that are well-described by visually-distinguishable features. We find that the Markovian Entropy measure is able to distinguish between biologically distinct populations in both datasets, and that it can separate biologically distinct populations to a greater extent than other methods in the dataset exhibiting irregular calcium activity. These results support the benefit of using the Markovian Entropy measure to analyze calcium dynamics, particularly for studies using time series data which do not exhibit easily distinguishable features. PMID:27977764
Tensor Analysis Reveals Distinct Population Structure that Parallels the Different Computational Roles of Areas M1 and V1

PubMed Central

Ryu, Stephen I.; Shenoy, Krishna V.; Cunningham, John P.; Churchland, Mark M.

2016-01-01

Cortical firing rates frequently display elaborate and heterogeneous temporal structure. One often wishes to compute quantitative summaries of such structure—a basic example is the frequency spectrum—and compare with model-based predictions. The advent of large-scale population recordings affords the opportunity to do so in new ways, with the hope of distinguishing between potential explanations for why responses vary with time. We introduce a method that assesses a basic but previously unexplored form of population-level structure: when data contain responses across multiple neurons, conditions, and times, they are naturally expressed as a third-order tensor. We examined tensor structure for multiple datasets from primary visual cortex (V1) and primary motor cortex (M1). All V1 datasets were ‘simplest’ (there were relatively few degrees of freedom) along the neuron mode, while all M1 datasets were simplest along the condition mode. These differences could not be inferred from surface-level response features. Formal considerations suggest why tensor structure might differ across modes. For idealized linear models, structure is simplest across the neuron mode when responses reflect external variables, and simplest across the condition mode when responses reflect population dynamics. This same pattern was present for existing models that seek to explain motor cortex responses. Critically, only dynamical models displayed tensor structure that agreed with the empirical M1 data. These results illustrate that tensor structure is a basic feature of the data. For M1 the tensor structure was compatible with only a subset of existing models. PMID:27814353
Data-driven decision support for radiologists: re-using the National Lung Screening Trial dataset for pulmonary nodule management.

PubMed

Morrison, James J; Hostetter, Jason; Wang, Kenneth; Siegel, Eliot L

2015-02-01

Real-time mining of large research trial datasets enables development of case-based clinical decision support tools. Several applicable research datasets exist including the National Lung Screening Trial (NLST), a dataset unparalleled in size and scope for studying population-based lung cancer screening. Using these data, a clinical decision support tool was developed which matches patient demographics and lung nodule characteristics to a cohort of similar patients. The NLST dataset was converted into Structured Query Language (SQL) tables hosted on a web server, and a web-based JavaScript application was developed which performs real-time queries. JavaScript is used for both the server-side and client-side language, allowing for rapid development of a robust client interface and server-side data layer. Real-time data mining of user-specified patient cohorts achieved a rapid return of cohort cancer statistics and lung nodule distribution information. This system demonstrates the potential of individualized real-time data mining using large high-quality clinical trial datasets to drive evidence-based clinical decision-making.
Spatially Explicit Models to Investigate Geographic Patterns in the Distribution of Forensic STRs: Application to the North-Eastern Mediterranean.

PubMed

Messina, Francesco; Finocchio, Andrea; Akar, Nejat; Loutradis, Aphrodite; Michalodimitrakis, Emmanuel I; Brdicka, Radim; Jodice, Carla; Novelletto, Andrea

2016-01-01

Human forensic STRs used for individual identification have been reported to have little power for inter-population analyses. Several methods have been developed which incorporate information on the spatial distribution of individuals to arrive at a description of the arrangement of diversity. We genotyped at 16 forensic STRs a large population sample obtained from many locations in Italy, Greece and Turkey, i.e. three countries crucial to the understanding of discontinuities at the European/Asian junction and the genetic legacy of ancient migrations, but seldom represented together in previous studies. Using spatial PCA on the full dataset, we detected patterns of population affinities in the area. Additionally, we devised objective criteria to reduce the overall complexity into reduced datasets. Independent spatially explicit methods applied to these latter datasets converged in showing that the extraction of information on long- to medium-range geographical trends and structuring from the overall diversity is possible. All analyses returned the picture of a background clinal variation, with regional discontinuities captured by each of the reduced datasets. Several aspects of our results are confirmed on external STR datasets and replicate those of genome-wide SNP typings. High levels of gene flow were inferred within the main continental areas by coalescent simulations. These results are promising from a microevolutionary perspective, in view of the fast pace at which forensic data are being accumulated for many locales. It is foreseeable that this will allow the exploitation of an invaluable genotypic resource, assembled for other (forensic) purposes, to clarify important aspects in the formation of local gene pools.
Population entropies estimates of proteins

NASA Astrophysics Data System (ADS)

Low, Wai Yee

2017-05-01

The Shannon entropy equation provides a way to estimate variability of amino acids sequences in a multiple sequence alignment of proteins. Knowledge of protein variability is useful in many areas such as vaccine design, identification of antibody binding sites, and exploration of protein 3D structural properties. In cases where the population entropies of a protein are of interest but only a small sample size can be obtained, a method based on linear regression and random subsampling can be used to estimate the population entropy. This method is useful for comparisons of entropies where the actual sequence counts differ and thus, correction for alignment size bias is needed. In the current work, an R based package named EntropyCorrect that enables estimation of population entropy is presented and an empirical study on how well this new algorithm performs on simulated dataset of various combinations of population and sample sizes is discussed. The package is available at https://github.com/lloydlow/EntropyCorrect. This article, which was originally published online on 12 May 2017, contained an error in Eq. (1), where the summation sign was missing. The corrected equation appears in the Corrigendum attached to the pdf.
Evaluation of Deep Learning Based Stereo Matching Methods: from Ground to Aerial Images

NASA Astrophysics Data System (ADS)

Liu, J.; Ji, S.; Zhang, C.; Qin, Z.

2018-05-01

Dense stereo matching has been extensively studied in photogrammetry and computer vision. In this paper we evaluate the application of deep learning based stereo methods, which were raised from 2016 and rapidly spread, on aerial stereos other than ground images that are commonly used in computer vision community. Two popular methods are evaluated. One learns matching cost with a convolutional neural network (known as MC-CNN); the other produces a disparity map in an end-to-end manner by utilizing both geometry and context (known as GC-net). First, we evaluate the performance of the deep learning based methods for aerial stereo images by a direct model reuse. The models pre-trained on KITTI 2012, KITTI 2015 and Driving datasets separately, are directly applied to three aerial datasets. We also give the results of direct training on target aerial datasets. Second, the deep learning based methods are compared to the classic stereo matching method, Semi-Global Matching(SGM), and a photogrammetric software, SURE, on the same aerial datasets. Third, transfer learning strategy is introduced to aerial image matching based on the assumption of a few target samples available for model fine tuning. It experimentally proved that the conventional methods and the deep learning based methods performed similarly, and the latter had greater potential to be explored.
ANTONIA perfusion and stroke. A software tool for the multi-purpose analysis of MR perfusion-weighted datasets and quantitative ischemic stroke assessment.

PubMed

Forkert, N D; Cheng, B; Kemmling, A; Thomalla, G; Fiehler, J

2014-01-01

The objective of this work is to present the software tool ANTONIA, which has been developed to facilitate a quantitative analysis of perfusion-weighted MRI (PWI) datasets in general as well as the subsequent multi-parametric analysis of additional datasets for the specific purpose of acute ischemic stroke patient dataset evaluation. Three different methods for the analysis of DSC or DCE PWI datasets are currently implemented in ANTONIA, which can be case-specifically selected based on the study protocol. These methods comprise a curve fitting method as well as a deconvolution-based and deconvolution-free method integrating a previously defined arterial input function. The perfusion analysis is extended for the purpose of acute ischemic stroke analysis by additional methods that enable an automatic atlas-based selection of the arterial input function, an analysis of the perfusion-diffusion and DWI-FLAIR mismatch as well as segmentation-based volumetric analyses. For reliability evaluation, the described software tool was used by two observers for quantitative analysis of 15 datasets from acute ischemic stroke patients to extract the acute lesion core volume, FLAIR ratio, perfusion-diffusion mismatch volume with manually as well as automatically selected arterial input functions, and follow-up lesion volume. The results of this evaluation revealed that the described software tool leads to highly reproducible results for all parameters if the automatic arterial input function selection method is used. Due to the broad selection of processing methods that are available in the software tool, ANTONIA is especially helpful to support image-based perfusion and acute ischemic stroke research projects.

HEp-2 cell image classification method based on very deep convolutional networks with small datasets

NASA Astrophysics Data System (ADS)

Lu, Mengchi; Gao, Long; Guo, Xifeng; Liu, Qiang; Yin, Jianping

2017-07-01

Human Epithelial-2 (HEp-2) cell images staining patterns classification have been widely used to identify autoimmune diseases by the anti-Nuclear antibodies (ANA) test in the Indirect Immunofluorescence (IIF) protocol. Because manual test is time consuming, subjective and labor intensive, image-based Computer Aided Diagnosis (CAD) systems for HEp-2 cell classification are developing. However, methods proposed recently are mostly manual features extraction with low accuracy. Besides, the scale of available benchmark datasets is small, which does not exactly suitable for using deep learning methods. This issue will influence the accuracy of cell classification directly even after data augmentation. To address these issues, this paper presents a high accuracy automatic HEp-2 cell classification method with small datasets, by utilizing very deep convolutional networks (VGGNet). Specifically, the proposed method consists of three main phases, namely image preprocessing, feature extraction and classification. Moreover, an improved VGGNet is presented to address the challenges of small-scale datasets. Experimental results over two benchmark datasets demonstrate that the proposed method achieves superior performance in terms of accuracy compared with existing methods.
Numeric model to predict the location of market demand and economic order quantity for retailers of supply chain

NASA Astrophysics Data System (ADS)

Fradinata, Edy; Marli Kesuma, Zurnila

2018-05-01

Polynomials and Spline regression are the numeric model where they used to obtain the performance of methods, distance relationship models for cement retailers in Banda Aceh, predicts the market area for retailers and the economic order quantity (EOQ). These numeric models have their difference accuracy for measuring the mean square error (MSE). The distance relationships between retailers are to identify the density of retailers in the town. The dataset is collected from the sales of cement retailer with a global positioning system (GPS). The sales dataset is plotted of its characteristic to obtain the goodness of fitted quadratic, cubic, and fourth polynomial methods. On the real sales dataset, polynomials are used the behavior relationship x-abscissa and y-ordinate to obtain the models. This research obtains some advantages such as; the four models from the methods are useful for predicting the market area for the retailer in the competitiveness, the comparison of the performance of the methods, the distance of the relationship between retailers, and at last the inventory policy based on economic order quantity. The results, the high-density retail relationship areas indicate that the growing population with the construction project. The spline is better than quadratic, cubic, and four polynomials in predicting the points indicating of small MSE. The inventory policy usages the periodic review policy type.
Using natural language processing for identification of herpes zoster ophthalmicus cases to support population-based study.

PubMed

Zheng, Chengyi; Luo, Yi; Mercado, Cheryl; Sy, Lina; Jacobsen, Steven J; Ackerson, Brad; Lewin, Bruno; Tseng, Hung Fu

2018-06-19

Diagnosis codes are inadequate for accurately identifying herpes zoster ophthalmicus (HZO). There is significant lack of population-based studies on HZO due to the high expense of manual review of medical records. To assess whether HZO can be identified from the clinical notes using natural language processing (NLP). To investigate the epidemiology of HZO among HZ population based on the developed approach. A retrospective cohort analysis. A total of 49,914 southern California residents aged over 18 years, who had a new diagnosis of HZ. An NLP-based algorithm was developed and validated with the manually curated validation dataset (n=461). The algorithm was applied on over 1 million clinical notes associated with the study population. HZO versus non-HZO cases were compared by age, sex, race, and comorbidities. We measured the accuracy of NLP algorithm. NLP algorithm achieved 95.6% sensitivity and 99.3% specificity. Compared to the diagnosis codes, NLP identified significant more HZO cases among HZ population (13.9% versus 1.7%). Compared to the non-HZO group, the HZO group was older, had more males, had more Whites, and had more outpatient visits. We developed and validated an automatic method to identify HZO cases with high accuracy. As one of the largest studies on HZO, our finding emphasizes the importance of preventing HZ in the elderly population. This method can be a valuable tool to support population-based studies and clinical care of HZO in the era of big data. This article is protected by copyright. All rights reserved.
Tensor-driven extraction of developmental features from varying paediatric EEG datasets.

PubMed

Kinney-Lang, Eli; Spyrou, Loukianos; Ebied, Ahmed; Chin, Richard Fm; Escudero, Javier

2018-05-21

Constant changes in developing children's brains can pose a challenge in EEG dependant technologies. Advancing signal processing methods to identify developmental differences in paediatric populations could help improve function and usability of such technologies. Taking advantage of the multi-dimensional structure of EEG data through tensor analysis may offer a framework for extracting relevant developmental features of paediatric datasets. A proof of concept is demonstrated through identifying latent developmental features in resting-state EEG. Approach. Three paediatric datasets (n = 50, 17, 44) were analyzed using a two-step constrained parallel factor (PARAFAC) tensor decomposition. Subject age was used as a proxy measure of development. Classification used support vector machines (SVM) to test if PARAFAC identified features could predict subject age. The results were cross-validated within each dataset. Classification analysis was complemented by visualization of the high-dimensional feature structures using t-distributed Stochastic Neighbour Embedding (t-SNE) maps. Main Results. Development-related features were successfully identified for the developmental conditions of each dataset. SVM classification showed the identified features could accurately predict subject at a significant level above chance for both healthy and impaired populations. t-SNE maps revealed suitable tensor factorization was key in extracting the developmental features. Significance. The described methods are a promising tool for identifying latent developmental features occurring throughout childhood EEG. © 2018 IOP Publishing Ltd.
Approximation-based common principal component for feature extraction in multi-class brain-computer interfaces.

PubMed

Hoang, Tuan; Tran, Dat; Huang, Xu

2013-01-01

Common Spatial Pattern (CSP) is a state-of-the-art method for feature extraction in Brain-Computer Interface (BCI) systems. However it is designed for 2-class BCI classification problems. Current extensions of this method to multiple classes based on subspace union and covariance matrix similarity do not provide a high performance. This paper presents a new approach to solving multi-class BCI classification problems by forming a subspace resembled from original subspaces and the proposed method for this approach is called Approximation-based Common Principal Component (ACPC). We perform experiments on Dataset 2a used in BCI Competition IV to evaluate the proposed method. This dataset was designed for motor imagery classification with 4 classes. Preliminary experiments show that the proposed ACPC feature extraction method when combining with Support Vector Machines outperforms CSP-based feature extraction methods on the experimental dataset.
Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations.

PubMed

Torbati, Mahbaneh Eshaghzadeh; Mitreva, Makedonka; Gopalakrishnan, Vanathi

2016-12-01

Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. The predictive modeling of such microbiota count data for the classification of human infection from parasitic worms, such as helminths, can help in the detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse, containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously, using different methods of feature reduction. To our knowledge, integrative methods, such as transfer learning, have not yet been explored in the microbiome domain as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster, grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis by using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling by using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under the receiver operating characteristic (ROC) Curve (AUC) and Balanced Accuracy (Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.
Systems and methods for displaying data in split dimension levels

DOEpatents

Stolte, Chris; Hanrahan, Patrick

2015-07-28

Systems and methods for displaying data in split dimension levels are disclosed. In some implementations, a method includes: at a computer, obtaining a dimensional hierarchy associated with a dataset, wherein the dimensional hierarchy includes at least one dimension and a sub-dimension of the at least one dimension; and populating information representing data included in the dataset into a visual table having a first axis and a second axis, wherein the first axis corresponds to the at least one dimension and the second axis corresponds to the sub-dimension of the at least one dimension.
Survey-based socio-economic data from slums in Bangalore, India

NASA Astrophysics Data System (ADS)

Roy, Debraj; Palavalli, Bharath; Menon, Niveditha; King, Robin; Pfeffer, Karin; Lees, Michael; Sloot, Peter M. A.

2018-01-01

In 2010, an estimated 860 million people were living in slums worldwide, with around 60 million added to the slum population between 2000 and 2010. In 2011, 200 million people in urban Indian households were considered to live in slums. In order to address and create slum development programmes and poverty alleviation methods, it is necessary to understand the needs of these communities. Therefore, we require data with high granularity in the Indian context. Unfortunately, there is a paucity of highly granular data at the level of individual slums. We collected the data presented in this paper in partnership with the slum dwellers in order to overcome the challenges such as validity and efficacy of self reported data. Our survey of Bangalore covered 36 slums across the city. The slums were chosen based on stratification criteria, which included geographical location of the slum, whether the slum was resettled or rehabilitated, notification status of the slum, the size of the slum and the religious profile. This paper describes the relational model of the slum dataset, the variables in the dataset, the variables constructed for analysis and the issues identified with the dataset. The data collected includes around 267,894 data points spread over 242 questions for 1,107 households. The dataset can facilitate interdisciplinary research on spatial and temporal dynamics of urban poverty and well-being in the context of rapid urbanization of cities in developing countries.
Spatially Explicit Models to Investigate Geographic Patterns in the Distribution of Forensic STRs: Application to the North-Eastern Mediterranean

PubMed Central

Messina, Francesco; Finocchio, Andrea; Akar, Nejat; Loutradis, Aphrodite; Michalodimitrakis, Emmanuel I.; Brdicka, Radim; Jodice, Carla

2016-01-01

Human forensic STRs used for individual identification have been reported to have little power for inter-population analyses. Several methods have been developed which incorporate information on the spatial distribution of individuals to arrive at a description of the arrangement of diversity. We genotyped at 16 forensic STRs a large population sample obtained from many locations in Italy, Greece and Turkey, i.e. three countries crucial to the understanding of discontinuities at the European/Asian junction and the genetic legacy of ancient migrations, but seldom represented together in previous studies. Using spatial PCA on the full dataset, we detected patterns of population affinities in the area. Additionally, we devised objective criteria to reduce the overall complexity into reduced datasets. Independent spatially explicit methods applied to these latter datasets converged in showing that the extraction of information on long- to medium-range geographical trends and structuring from the overall diversity is possible. All analyses returned the picture of a background clinal variation, with regional discontinuities captured by each of the reduced datasets. Several aspects of our results are confirmed on external STR datasets and replicate those of genome-wide SNP typings. High levels of gene flow were inferred within the main continental areas by coalescent simulations. These results are promising from a microevolutionary perspective, in view of the fast pace at which forensic data are being accumulated for many locales. It is foreseeable that this will allow the exploitation of an invaluable genotypic resource, assembled for other (forensic) purposes, to clarify important aspects in the formation of local gene pools. PMID:27898725
Hard exudates segmentation based on learned initial seeds and iterative graph cut.

PubMed

Kusakunniran, Worapan; Wu, Qiang; Ritthipravat, Panrasee; Zhang, Jian

2018-05-01

(Background and Objective): The occurrence of hard exudates is one of the early signs of diabetic retinopathy which is one of the leading causes of the blindness. Many patients with diabetic retinopathy lose their vision because of the late detection of the disease. Thus, this paper is to propose a novel method of hard exudates segmentation in retinal images in an automatic way. (Methods): The existing methods are based on either supervised or unsupervised learning techniques. In addition, the learned segmentation models may often cause miss-detection and/or fault-detection of hard exudates, due to the lack of rich characteristics, the intra-variations, and the similarity with other components in the retinal image. Thus, in this paper, the supervised learning based on the multilayer perceptron (MLP) is only used to identify initial seeds with high confidences to be hard exudates. Then, the segmentation is finalized by unsupervised learning based on the iterative graph cut (GC) using clusters of initial seeds. Also, in order to reduce color intra-variations of hard exudates in different retinal images, the color transfer (CT) is applied to normalize their color information, in the pre-processing step. (Results): The experiments and comparisons with the other existing methods are based on the two well-known datasets, e_ophtha EX and DIARETDB1. It can be seen that the proposed method outperforms the other existing methods in the literature, with the sensitivity in the pixel-level of 0.891 for the DIARETDB1 dataset and 0.564 for the e_ophtha EX dataset. The cross datasets validation where the training process is performed on one dataset and the testing process is performed on another dataset is also evaluated in this paper, in order to illustrate the robustness of the proposed method. (Conclusions): This newly proposed method integrates the supervised learning and unsupervised learning based techniques. It achieves the improved performance, when compared with the existing methods in the literature. The robustness of the proposed method for the scenario of cross datasets could enhance its practical usage. That is, the trained model could be more practical for unseen data in the real-world situation, especially when the capturing environments of training and testing images are not the same. Copyright © 2018 Elsevier B.V. All rights reserved.
Assessing deep and shallow learning methods for quantitative prediction of acute chemical toxicity.

PubMed

Liu, Ruifeng; Madore, Michael; Glover, Kyle P; Feasel, Michael G; Wallqvist, Anders

2018-05-02

Animal-based methods for assessing chemical toxicity are struggling to meet testing demands. In silico approaches, including machine-learning methods, are promising alternatives. Recently, deep neural networks (DNNs) were evaluated and reported to outperform other machine-learning methods for quantitative structure-activity relationship modeling of molecular properties. However, most of the reported performance evaluations relied on global performance metrics, such as the root mean squared error (RMSE) between the predicted and experimental values of all samples, without considering the impact of sample distribution across the activity spectrum. Here, we carried out an in-depth analysis of DNN performance for quantitative prediction of acute chemical toxicity using several datasets. We found that the overall performance of DNN models on datasets of up to 30,000 compounds was similar to that of random forest (RF) models, as measured by the RMSE and correlation coefficients between the predicted and experimental results. However, our detailed analyses demonstrated that global performance metrics are inappropriate for datasets with a highly uneven sample distribution, because they show a strong bias for the most populous compounds along the toxicity spectrum. For highly toxic compounds, DNN and RF models trained on all samples performed much worse than the global performance metrics indicated. Surprisingly, our variable nearest neighbor method, which utilizes only structurally similar compounds to make predictions, performed reasonably well, suggesting that information of close near neighbors in the training sets is a key determinant of acute toxicity predictions.
Discriminant locality preserving projections based on L1-norm maximization.

PubMed

Zhong, Fujin; Zhang, Jiashu; Li, Defang

2014-11-01

Conventional discriminant locality preserving projection (DLPP) is a dimensionality reduction technique based on manifold learning, which has demonstrated good performance in pattern recognition. However, because its objective function is based on the distance criterion using L2-norm, conventional DLPP is not robust to outliers which are present in many applications. This paper proposes an effective and robust DLPP version based on L1-norm maximization, which learns a set of local optimal projection vectors by maximizing the ratio of the L1-norm-based locality preserving between-class dispersion and the L1-norm-based locality preserving within-class dispersion. The proposed method is proven to be feasible and also robust to outliers while overcoming the small sample size problem. The experimental results on artificial datasets, Binary Alphadigits dataset, FERET face dataset and PolyU palmprint dataset have demonstrated the effectiveness of the proposed method.
GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

PubMed Central

Jia, Erik; Chen, Tianlu

2018-01-01

Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130
Metabarcoding of marine nematodes – evaluation of reference datasets used in tree-based taxonomy assignment approach

PubMed Central

2016-01-01

Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919
Metabarcoding of marine nematodes - evaluation of reference datasets used in tree-based taxonomy assignment approach.

PubMed

Holovachov, Oleksandr

2016-01-01

Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.
Detection of genomic loci associated with environmental variables using generalized linear mixed models.

PubMed

Lobréaux, Stéphane; Melodelima, Christelle

2015-02-01

We tested the use of Generalized Linear Mixed Models to detect associations between genetic loci and environmental variables, taking into account the population structure of sampled individuals. We used a simulation approach to generate datasets under demographically and selectively explicit models. These datasets were used to analyze and optimize GLMM capacity to detect the association between markers and selective coefficients as environmental data in terms of false and true positive rates. Different sampling strategies were tested, maximizing the number of populations sampled, sites sampled per population, or individuals sampled per site, and the effect of different selective intensities on the efficiency of the method was determined. Finally, we apply these models to an Arabidopsis thaliana SNP dataset from different accessions, looking for loci associated with spring minimal temperature. We identified 25 regions that exhibit unusual correlations with the climatic variable and contain genes with functions related to temperature stress. Copyright © 2014 Elsevier Inc. All rights reserved.
Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses

PubMed Central

Park, Danny S.; Brown, Brielin; Eng, Celeste; Huntsman, Scott; Hu, Donglei; Torgerson, Dara G.; Burchard, Esteban G.; Zaitlen, Noah

2015-01-01

Motivation: Approaches to identifying new risk loci, training risk prediction models, imputing untyped variants and fine-mapping causal variants from summary statistics of genome-wide association studies are playing an increasingly important role in the human genetics community. Current summary statistics-based methods rely on global ‘best guess’ reference panels to model the genetic correlation structure of the dataset being studied. This approach, especially in admixed populations, has the potential to produce misleading results, ignores variation in local structure and is not feasible when appropriate reference panels are missing or small. Here, we develop a method, Adapt-Mix, that combines information across all available reference panels to produce estimates of local genetic correlation structure for summary statistics-based methods in arbitrary populations. Results: We applied Adapt-Mix to estimate the genetic correlation structure of both admixed and non-admixed individuals using simulated and real data. We evaluated our method by measuring the performance of two summary statistics-based methods: imputation and joint-testing. When using our method as opposed to the current standard of ‘best guess’ reference panels, we observed a 28% decrease in mean-squared error for imputation and a 73.7% decrease in mean-squared error for joint-testing. Availability and implementation: Our method is publicly available in a software package called ADAPT-Mix available at https://github.com/dpark27/adapt_mix. Contact: noah.zaitlen@ucsf.edu PMID:26072481
Assessing the phylogeographic history of the montane caddisfly Thremma gallicum using mitochondrial and restriction-site-associated DNA (RAD) markers

PubMed Central

Macher, Jan-Niklas; Rozenberg, Andrey; Pauls, Steffen U; Tollrian, Ralph; Wagner, Rüdiger; Leese, Florian

2015-01-01

Repeated Quaternary glaciations have significantly shaped the present distribution and diversity of several European species in aquatic and terrestrial habitats. To study the phylogeography of freshwater invertebrates, patterns of intraspecific variation have been examined primarily using mitochondrial DNA markers that may yield results unrepresentative of the true species history. Here, population genetic parameters were inferred for a montane aquatic caddisfly, Thremma gallicum, by sequencing a 658-bp fragment of the mitochondrial CO1 gene, and 12,514 nuclear RAD loci. T. gallicum has a highly disjunct distribution in southern and central Europe, with known populations in the Cantabrian Mountains, Pyrenees, Massif Central, and Black Forest. Both datasets represented rangewide sampling of T. gallicum. For the CO1 dataset, this included 352 specimens from 26 populations, and for the RAD dataset, 17 specimens from eight populations. We tested 20 competing phylogeographic scenarios using approximate Bayesian computation (ABC) and estimated genetic diversity patterns. Support for phylogeographic scenarios and diversity estimates differed between datasets with the RAD data favouring a southern origin of extant populations and indicating the Cantabrian Mountains and Massif Central populations to represent highly diverse populations as compared with the Pyrenees and Black Forest populations. The CO1 data supported a vicariance scenario (north–south) and yielded inconsistent diversity estimates. Permutation tests suggest that a few hundred polymorphic RAD SNPs are necessary for reliable parameter estimates. Our results highlight the potential of RAD and ABC-based hypothesis testing to complement phylogeographic studies on non-model species. PMID:25691988
Data harmonization and federated analysis of population-based studies: the BioSHaRE project

PubMed Central

2013-01-01

Abstracts Background Individual-level data pooling of large population-based studies across research centres in international research projects faces many hurdles. The BioSHaRE (Biobank Standardisation and Harmonisation for Research Excellence in the European Union) project aims to address these issues by building a collaborative group of investigators and developing tools for data harmonization, database integration and federated data analyses. Methods Eight population-based studies in six European countries were recruited to participate in the BioSHaRE project. Through workshops, teleconferences and electronic communications, participating investigators identified a set of 96 variables targeted for harmonization to answer research questions of interest. Using each study’s questionnaires, standard operating procedures, and data dictionaries, harmonization potential was assessed. Whenever harmonization was deemed possible, processing algorithms were developed and implemented in an open-source software infrastructure to transform study-specific data into the target (i.e. harmonized) format. Harmonized datasets located on server in each research centres across Europe were interconnected through a federated database system to perform statistical analysis. Results Retrospective harmonization led to the generation of common format variables for 73% of matches considered (96 targeted variables across 8 studies). Authenticated investigators can now perform complex statistical analyses of harmonized datasets stored on distributed servers without actually sharing individual-level data using the DataSHIELD method. Conclusion New Internet-based networking technologies and database management systems are providing the means to support collaborative, multi-center research in an efficient and secure manner. The results from this pilot project show that, given a strong collaborative relationship between participating studies, it is possible to seamlessly co-analyse internationally harmonized research databases while allowing each study to retain full control over individual-level data. We encourage additional collaborative research networks in epidemiology, public health, and the social sciences to make use of the open source tools presented herein. PMID:24257327
HGDP and HapMap Analysis by Ancestry Mapper Reveals Local and Global Population Relationships

PubMed Central

Magalhães, Tiago R.; Casey, Jillian P.; Conroy, Judith; Regan, Regina; Fitzpatrick, Darren J.; Shah, Naisha; Sobral, João; Ennis, Sean

2012-01-01

Knowledge of human origins, migrations, and expansions is greatly enhanced by the availability of large datasets of genetic information from different populations and by the development of bioinformatic tools used to analyze the data. We present Ancestry Mapper, which we believe improves on existing methods, for the assignment of genetic ancestry to an individual and to study the relationships between local and global populations. The principle function of the method, named Ancestry Mapper, is to give each individual analyzed a genetic identifier, made up of just 51 genetic coordinates, that corresponds to its relationship to the HGDP reference population. As a consequence, the Ancestry Mapper Id (AMid) has intrinsic biological meaning and provides a tool to measure similarity between world populations. We applied Ancestry Mapper to a dataset comprised of the HGDP and HapMap data. The results show distinctions at the continental level, while simultaneously giving details at the population level. We clustered AMids of HGDP/HapMap and observe a recapitulation of human migrations: for a small number of clusters, individuals are grouped according to continental origins; for a larger number of clusters, regional and population distinctions are evident. Calculating distances between AMids allows us to infer ancestry. The number of coordinates is expandable, increasing the power of Ancestry Mapper. An R package called Ancestry Mapper is available to apply this method to any high density genomic data set. PMID:23189146

HGDP and HapMap analysis by Ancestry Mapper reveals local and global population relationships.

PubMed

Magalhães, Tiago R; Casey, Jillian P; Conroy, Judith; Regan, Regina; Fitzpatrick, Darren J; Shah, Naisha; Sobral, João; Ennis, Sean

2012-01-01

Knowledge of human origins, migrations, and expansions is greatly enhanced by the availability of large datasets of genetic information from different populations and by the development of bioinformatic tools used to analyze the data. We present Ancestry Mapper, which we believe improves on existing methods, for the assignment of genetic ancestry to an individual and to study the relationships between local and global populations. The principle function of the method, named Ancestry Mapper, is to give each individual analyzed a genetic identifier, made up of just 51 genetic coordinates, that corresponds to its relationship to the HGDP reference population. As a consequence, the Ancestry Mapper Id (AMid) has intrinsic biological meaning and provides a tool to measure similarity between world populations. We applied Ancestry Mapper to a dataset comprised of the HGDP and HapMap data. The results show distinctions at the continental level, while simultaneously giving details at the population level. We clustered AMids of HGDP/HapMap and observe a recapitulation of human migrations: for a small number of clusters, individuals are grouped according to continental origins; for a larger number of clusters, regional and population distinctions are evident. Calculating distances between AMids allows us to infer ancestry. The number of coordinates is expandable, increasing the power of Ancestry Mapper. An R package called Ancestry Mapper is available to apply this method to any high density genomic data set.
Genovo: De Novo Assembly for Metagenomes

NASA Astrophysics Data System (ADS)

Laserson, Jonathan; Jojic, Vladimir; Koller, Daphne

Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A Chinese restaurant process prior accounts for the unknown number of genomes in the sample. Inference is made by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs across one synthetic dataset and eight metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, and yield a higher assembly score.
U.S. Maternally Linked Birth Records May Be Biased for Hispanics and Other Population Groups

PubMed Central

LEISS, JACK K.; GILES, DENISE; SULLIVAN, KRISTIN M.; MATHEWS, RAHEL; SENTELLE, GLENDA; TOMASHEK, KAY M.

2010-01-01

Purpose To advance understanding of linkage error in U.S. maternally linked datasets, and how the error may affect results of studies based on the linked data. Methods North Carolina birth and fetal death records for 1988-1997 were maternally linked (n=1,030,029). The maternal set probability, defined as the probability that all records assigned to the same maternal set do in fact represent events to the same woman, was used to assess differential maternal linkage error across race/ethnic groups. Results Maternal set probabilities were lower for records specifying Asian or Hispanic race/ethnicity, suggesting greater maternal linkage error. The lower probabilities for Hispanics were concentrated in women of Mexican origin who were not born in the United States. Conclusions Differential maternal linkage error may be a source of bias in studies using U.S. maternally linked datasets to make comparisons between Hispanics and other groups or among Hispanic subgroups. Methods to quantify and adjust for this potential bias are needed. PMID:20006273
Bayes and empirical Bayes estimators of abundance and density from spatial capture-recapture data

USGS Publications Warehouse

Dorazio, Robert M.

2013-01-01

In capture-recapture and mark-resight surveys, movements of individuals both within and between sampling periods can alter the susceptibility of individuals to detection over the region of sampling. In these circumstances spatially explicit capture-recapture (SECR) models, which incorporate the observed locations of individuals, allow population density and abundance to be estimated while accounting for differences in detectability of individuals. In this paper I propose two Bayesian SECR models, one for the analysis of recaptures observed in trapping arrays and another for the analysis of recaptures observed in area searches. In formulating these models I used distinct submodels to specify the distribution of individual home-range centers and the observable recaptures associated with these individuals. This separation of ecological and observational processes allowed me to derive a formal connection between Bayes and empirical Bayes estimators of population abundance that has not been established previously. I showed that this connection applies to every Poisson point-process model of SECR data and provides theoretical support for a previously proposed estimator of abundance based on recaptures in trapping arrays. To illustrate results of both classical and Bayesian methods of analysis, I compared Bayes and empirical Bayes esimates of abundance and density using recaptures from simulated and real populations of animals. Real populations included two iconic datasets: recaptures of tigers detected in camera-trap surveys and recaptures of lizards detected in area-search surveys. In the datasets I analyzed, classical and Bayesian methods provided similar – and often identical – inferences, which is not surprising given the sample sizes and the noninformative priors used in the analyses.
Bayes and empirical Bayes estimators of abundance and density from spatial capture-recapture data.

PubMed

Dorazio, Robert M

2013-01-01

In capture-recapture and mark-resight surveys, movements of individuals both within and between sampling periods can alter the susceptibility of individuals to detection over the region of sampling. In these circumstances spatially explicit capture-recapture (SECR) models, which incorporate the observed locations of individuals, allow population density and abundance to be estimated while accounting for differences in detectability of individuals. In this paper I propose two Bayesian SECR models, one for the analysis of recaptures observed in trapping arrays and another for the analysis of recaptures observed in area searches. In formulating these models I used distinct submodels to specify the distribution of individual home-range centers and the observable recaptures associated with these individuals. This separation of ecological and observational processes allowed me to derive a formal connection between Bayes and empirical Bayes estimators of population abundance that has not been established previously. I showed that this connection applies to every Poisson point-process model of SECR data and provides theoretical support for a previously proposed estimator of abundance based on recaptures in trapping arrays. To illustrate results of both classical and Bayesian methods of analysis, I compared Bayes and empirical Bayes esimates of abundance and density using recaptures from simulated and real populations of animals. Real populations included two iconic datasets: recaptures of tigers detected in camera-trap surveys and recaptures of lizards detected in area-search surveys. In the datasets I analyzed, classical and Bayesian methods provided similar - and often identical - inferences, which is not surprising given the sample sizes and the noninformative priors used in the analyses.
Recovering complete and draft population genomes from metagenome datasets

DOE PAGES

Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.

2016-03-08

Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem ofmore » chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.« less
Recovering complete and draft population genomes from metagenome datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.

Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem ofmore » chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.« less
Systems-based biological concordance and predictive reproducibility of gene set discovery methods in cardiovascular disease.

PubMed

Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying

2011-08-01

The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.
A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes

PubMed Central

2011-01-01

Background Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data. Methods A long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information. Results The algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available. Conclusions The algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets. PMID:21388557
The Affective Norms for Polish Short Texts (ANPST) Database Properties and Impact of Participants’ Population and Sex on Affective Ratings

PubMed Central

Imbir, Kamil K.

2017-01-01

The Affective Norms for Polish Short Texts (ANPST) dataset (Imbir, 2016d) is a list of 718 affective sentence stimuli with known affective properties with respect to subjectively perceived valence, arousal, dominance, origin, subjective significance, and source. This article examines the reliability of the ANPST and the impact of population type and sex on affective ratings. The ANPST dataset was introduced to provide a recognized method of eliciting affective states with linguistic stimuli more complex than single words and that included contextual information and thus are less ambiguous in interpretation than single word. Analysis of the properties of the ANPST dataset showed that norms collected are reliable in terms of split-half estimation and that the distributions of ratings are similar to those obtained in other affective norms studies. The pattern of correlations was the same as that found in analysis of an affective norms dataset for words based on the same six variables. Female psychology students’ valence ratings were also more polarized than those of their female student peers studying other subjects, but arousal ratings were only higher for negative words. Differences also appeared for all other measured dimensions. Women’s valence ratings were found to be more polarized and arousal ratings were higher than those made by men, and differences were also present for dominance, origin, and subjective significance. The ANPST is the first Polish language list of sentence stimuli and could easily be adapted for other languages and cultures. PMID:28611707
Improved detection of CXCR4-using HIV by V3 genotyping: application of population-based and "deep" sequencing to plasma RNA and proviral DNA.

PubMed

Swenson, Luke C; Moores, Andrew; Low, Andrew J; Thielen, Alexander; Dong, Winnie; Woods, Conan; Jensen, Mark A; Wynhoven, Brian; Chan, Dennison; Glascock, Christopher; Harrigan, P Richard

2010-08-01

Tropism testing should rule out CXCR4-using HIV before treatment with CCR5 antagonists. Currently, the recombinant phenotypic Trofile assay (Monogram) is most widely utilized; however, genotypic tests may represent alternative methods. Independent triplicate amplifications of the HIV gp120 V3 region were made from either plasma HIV RNA or proviral DNA. These underwent standard, population-based sequencing with an ABI3730 (RNA n = 63; DNA n = 40), or "deep" sequencing with a Roche/454 Genome Sequencer-FLX (RNA n = 12; DNA n = 12). Position-specific scoring matrices (PSSMX4/R5) (-6.96 cutoff) and geno2pheno[coreceptor] (5% false-positive rate) inferred tropism from V3 sequence. These methods were then independently validated with a separate, blinded dataset (n = 278) of screening samples from the maraviroc MOTIVATE trials. Standard sequencing of HIV RNA with PSSM yielded 69% sensitivity and 91% specificity, relative to Trofile. The validation dataset gave 75% sensitivity and 83% specificity. Proviral DNA plus PSSM gave 77% sensitivity and 71% specificity. "Deep" sequencing of HIV RNA detected >2% inferred-CXCR4-using virus in 8/8 samples called non-R5 by Trofile, and <2% in 4/4 samples called R5. Triplicate analyses of V3 standard sequence data detect greater proportions of CXCR4-using samples than previously achieved. Sequencing proviral DNA and "deep" V3 sequencing may also be useful tools for assessing tropism.
Energy Partitioning of Seismic Phases: Current Datasets and Techniques Aimed at Improving the Future of Event Identification

NASA Astrophysics Data System (ADS)

Bonner, J.

2006-05-01

Differences in energy partitioning of seismic phases from earthquakes and explosions provide the opportunity for event identification. In this talk, I will briefly review teleseismic Ms:mb and P/S ratio techniques that help identify events based on differences in compressional, shear, and surface wave energy generation from explosions and earthquakes. With the push to identify smaller yield explosions, the identification process has become increasingly complex as varied types of explosions, including chemical, mining, and nuclear, must be identified at regional distances. Thus, I will highlight some of the current views and problems associated with the energy partitioning of seismic phases from single- and delay-fired chemical explosions. One problem yet to have a universally accepted answer is whether the explosion and earthquake populations, based on the Ms:mb discriminants, should be separated at smaller magnitudes. I will briefly describe the datasets and theory that support either converging or parallel behavior of these populations. Also, I will discuss improvement to the currently used methods that will better constrain this problem in the future. I will also discuss the role of regional P/S ratios in identifying explosions. In particular, recent datasets from South Africa, Scandinavia, and the Western United States collected from earthquakes, single-fired chemical explosions, and/or delay-fired mining explosions have provide new insight into regional P, S, Lg, and Rg energy partitioning. Data from co-located mining and chemical explosions suggest that some mining explosions may be used for limited calibration of regional discriminants in regions where no historic explosion data is available.
Sorting protein decoys by machine-learning-to-rank

PubMed Central

Jing, Xiaoyang; Wang, Kai; Lu, Ruqian; Dong, Qiwen

2016-01-01

Much progress has been made in Protein structure prediction during the last few decades. As the predicted models can span a broad range of accuracy spectrum, the accuracy of quality estimation becomes one of the key elements of successful protein structure prediction. Over the past years, a number of methods have been developed to address this issue, and these methods could be roughly divided into three categories: the single-model methods, clustering-based methods and quasi single-model methods. In this study, we develop a single-model method MQAPRank based on the learning-to-rank algorithm firstly, and then implement a quasi single-model method Quasi-MQAPRank. The proposed methods are benchmarked on the 3DRobot and CASP11 dataset. The five-fold cross-validation on the 3DRobot dataset shows the proposed single model method outperforms other methods whose outputs are taken as features of the proposed method, and the quasi single-model method can further enhance the performance. On the CASP11 dataset, the proposed methods also perform well compared with other leading methods in corresponding categories. In particular, the Quasi-MQAPRank method achieves a considerable performance on the CASP11 Best150 dataset. PMID:27530967
Sorting protein decoys by machine-learning-to-rank.

PubMed

Jing, Xiaoyang; Wang, Kai; Lu, Ruqian; Dong, Qiwen

2016-08-17

Much progress has been made in Protein structure prediction during the last few decades. As the predicted models can span a broad range of accuracy spectrum, the accuracy of quality estimation becomes one of the key elements of successful protein structure prediction. Over the past years, a number of methods have been developed to address this issue, and these methods could be roughly divided into three categories: the single-model methods, clustering-based methods and quasi single-model methods. In this study, we develop a single-model method MQAPRank based on the learning-to-rank algorithm firstly, and then implement a quasi single-model method Quasi-MQAPRank. The proposed methods are benchmarked on the 3DRobot and CASP11 dataset. The five-fold cross-validation on the 3DRobot dataset shows the proposed single model method outperforms other methods whose outputs are taken as features of the proposed method, and the quasi single-model method can further enhance the performance. On the CASP11 dataset, the proposed methods also perform well compared with other leading methods in corresponding categories. In particular, the Quasi-MQAPRank method achieves a considerable performance on the CASP11 Best150 dataset.
Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.

PubMed

Najibi, Seyed Morteza; Maadooliat, Mehdi; Zhou, Lan; Huang, Jianhua Z; Gao, Xin

2017-01-01

Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.
Exact Calculation of the Joint Allele Frequency Spectrum for Isolation with Migration Models.

PubMed

Kern, Andrew D; Hey, Jody

2017-09-01

Population genomic datasets collected over the past decade have spurred interest in developing methods that can utilize massive numbers of loci for inference of demographic and selective histories of populations. The allele frequency spectrum (AFS) provides a convenient statistic for such analysis, and, accordingly, much attention has been paid to predicting theoretical expectations of the AFS under a number of different models. However, to date, exact solutions for the joint AFS of two or more populations under models of migration and divergence have not been found. Here, we present a novel Markov chain representation of the coalescent on the state space of the joint AFS that allows for rapid, exact calculation of the joint AFS under isolation with migration (IM) models. In turn, we show how our Markov chain method, in the context of composite likelihood estimation, can be used for accurate inference of parameters of the IM model using SNP data. Lastly, we apply our method to recent whole genome datasets from African Drosophila melanogaster . Copyright © 2017 Kern and Hey.
A three-step approach for the derivation and validation of high-performing predictive models using an operational dataset: congestive heart failure readmission case study.

PubMed

AbdelRahman, Samir E; Zhang, Mingyuan; Bray, Bruce E; Kawamoto, Kensaku

2014-05-27

The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.
LOSITAN: a workbench to detect molecular adaptation based on a Fst-outlier method.

PubMed

Antao, Tiago; Lopes, Ana; Lopes, Ricardo J; Beja-Pereira, Albano; Luikart, Gordon

2008-07-28

Testing for selection is becoming one of the most important steps in the analysis of multilocus population genetics data sets. Existing applications are difficult to use, leaving many non-trivial, error-prone tasks to the user. Here we present LOSITAN, a selection detection workbench based on a well evaluated Fst-outlier detection method. LOSITAN greatly facilitates correct approximation of model parameters (e.g., genome-wide average, neutral Fst), provides data import and export functions, iterative contour smoothing and generation of graphics in a easy to use graphical user interface. LOSITAN is able to use modern multi-core processor architectures by locally parallelizing fdist, reducing computation time by half in current dual core machines and with almost linear performance gains in machines with more cores. LOSITAN makes selection detection feasible to a much wider range of users, even for large population genomic datasets, by both providing an easy to use interface and essential functionality to complete the whole selection detection process.
Large Scale Flood Risk Analysis using a New Hyper-resolution Population Dataset

NASA Astrophysics Data System (ADS)

Smith, A.; Neal, J. C.; Bates, P. D.; Quinn, N.; Wing, O.

2017-12-01

Here we present the first national scale flood risk analyses, using high resolution Facebook Connectivity Lab population data and data from a hyper resolution flood hazard model. In recent years the field of large scale hydraulic modelling has been transformed by new remotely sensed datasets, improved process representation, highly efficient flow algorithms and increases in computational power. These developments have allowed flood risk analysis to be undertaken in previously unmodeled territories and from continental to global scales. Flood risk analyses are typically conducted via the integration of modelled water depths with an exposure dataset. Over large scales and in data poor areas, these exposure data typically take the form of a gridded population dataset, estimating population density using remotely sensed data and/or locally available census data. The local nature of flooding dictates that for robust flood risk analysis to be undertaken both hazard and exposure data should sufficiently resolve local scale features. Global flood frameworks are enabling flood hazard data to produced at 90m resolution, resulting in a mis-match with available population datasets which are typically more coarsely resolved. Moreover, these exposure data are typically focused on urban areas and struggle to represent rural populations. In this study we integrate a new population dataset with a global flood hazard model. The population dataset was produced by the Connectivity Lab at Facebook, providing gridded population data at 5m resolution, representing a resolution increase over previous countrywide data sets of multiple orders of magnitude. Flood risk analysis undertaken over a number of developing countries are presented, along with a comparison of flood risk analyses undertaken using pre-existing population datasets.
Comparison of Prediction Model for Cardiovascular Autonomic Dysfunction Using Artificial Neural Network and Logistic Regression Analysis

PubMed Central

Zeng, Fangfang; Li, Zhongtao; Yu, Xiaoling; Zhou, Linuo

2013-01-01

Background This study aimed to develop the artificial neural network (ANN) and multivariable logistic regression (LR) analyses for prediction modeling of cardiovascular autonomic (CA) dysfunction in the general population, and compare the prediction models using the two approaches. Methods and Materials We analyzed a previous dataset based on a Chinese population sample consisting of 2,092 individuals aged 30–80 years. The prediction models were derived from an exploratory set using ANN and LR analysis, and were tested in the validation set. Performances of these prediction models were then compared. Results Univariate analysis indicated that 14 risk factors showed statistically significant association with the prevalence of CA dysfunction (P<0.05). The mean area under the receiver-operating curve was 0.758 (95% CI 0.724–0.793) for LR and 0.762 (95% CI 0.732–0.793) for ANN analysis, but noninferiority result was found (P<0.001). The similar results were found in comparisons of sensitivity, specificity, and predictive values in the prediction models between the LR and ANN analyses. Conclusion The prediction models for CA dysfunction were developed using ANN and LR. ANN and LR are two effective tools for developing prediction models based on our dataset. PMID:23940593

Learning to recognize rat social behavior: Novel dataset and cross-dataset application.

PubMed

Lorbach, Malte; Kyriakou, Elisavet I; Poppe, Ronald; van Dam, Elsbeth A; Noldus, Lucas P J J; Veltkamp, Remco C

2018-04-15

Social behavior is an important aspect of rodent models. Automated measuring tools that make use of video analysis and machine learning are an increasingly attractive alternative to manual annotation. Because machine learning-based methods need to be trained, it is important that they are validated using data from different experiment settings. To develop and validate automated measuring tools, there is a need for annotated rodent interaction datasets. Currently, the availability of such datasets is limited to two mouse datasets. We introduce the first, publicly available rat social interaction dataset, RatSI. We demonstrate the practical value of the novel dataset by using it as the training set for a rat interaction recognition method. We show that behavior variations induced by the experiment setting can lead to reduced performance, which illustrates the importance of cross-dataset validation. Consequently, we add a simple adaptation step to our method and improve the recognition performance. Most existing methods are trained and evaluated in one experimental setting, which limits the predictive power of the evaluation to that particular setting. We demonstrate that cross-dataset experiments provide more insight in the performance of classifiers. With our novel, public dataset we encourage the development and validation of automated recognition methods. We are convinced that cross-dataset validation enhances our understanding of rodent interactions and facilitates the development of more sophisticated recognition methods. Combining them with adaptation techniques may enable us to apply automated recognition methods to a variety of animals and experiment settings. Copyright © 2017 Elsevier B.V. All rights reserved.
A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks.

PubMed

Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong

2017-01-01

Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.
A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks

PubMed Central

Wu, Chenxue; Liu, Zhao; Zhu, Yunhong

2017-01-01

Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687
C-A1-03: Considerations in the Design and Use of an Oracle-based Virtual Data Warehouse

PubMed Central

Bredfeldt, Christine; McFarland, Lela

2011-01-01

Background/Aims The amount of clinical data available for research is growing exponentially. As it grows, increasing the efficiency of both data storage and data access becomes critical. Relational database management systems (rDBMS) such as Oracle are ideal solutions for managing longitudinal clinical data because they support large-scale data storage and highly efficient data retrieval. In addition, they can greatly simplify the management of large data warehouses, including security management and regular data refreshes. However, the HMORN Virtual Data Warehouse (VDW) was originally designed based on SAS datasets, and this design choice has a number of implications for both the design and use of an Oracle-based VDW. From a design standpoint, VDW tables are designed as flat SAS datasets, which do not take full advantage of Oracle indexing capabilities. From a data retrieval standpoint, standard VDW SAS scripts do not take advantage of SAS pass-through SQL capabilities to enable Oracle to perform the processing required to narrow datasets to the population of interest. Methods Beginning in 2009, the research department at Kaiser Permanente in the Mid-Atlantic States (KPMA) has developed an Oracle-based VDW according to the HMORN v3 specifications. In order to take advantage of the strengths of relational databases, KPMA introduced an interface layer to the VDW data, using views to provide access to standardized VDW variables. In addition, KPMA has developed SAS programs that provide access to SQL pass-through processing for first-pass data extraction into SAS VDW datasets for processing by standard VDW scripts. Results We discuss both the design and performance considerations specific to the KPMA Oracle-based VDW. We benchmarked performance of the Oracle-based VDW using both standard VDW scripts and an initial pre-processing layer to evaluate speed and accuracy of data return. Conclusions Adapting the VDW for deployment in an Oracle environment required minor changes to the underlying structure of the data. Further modifications of the underlying data structure would lead to performance enhancements. Maximally efficient data access for standard VDW scripts requires an extra step that involves restricting the data to the population of interest at the data server level prior to standard processing.
Attribute Weighting Based K-Nearest Neighbor Using Gain Ratio

NASA Astrophysics Data System (ADS)

Nababan, A. A.; Sitompul, O. S.; Tulus

2018-04-01

K- Nearest Neighbor (KNN) is a good classifier, but from several studies, the result performance accuracy of KNN still lower than other methods. One of the causes of the low accuracy produced, because each attribute has the same effect on the classification process, while some less relevant characteristics lead to miss-classification of the class assignment for new data. In this research, we proposed Attribute Weighting Based K-Nearest Neighbor Using Gain Ratio as a parameter to see the correlation between each attribute in the data and the Gain Ratio also will be used as the basis for weighting each attribute of the dataset. The accuracy of results is compared to the accuracy acquired from the original KNN method using 10-fold Cross-Validation with several datasets from the UCI Machine Learning repository and KEEL-Dataset Repository, such as abalone, glass identification, haberman, hayes-roth and water quality status. Based on the result of the test, the proposed method was able to increase the classification accuracy of KNN, where the highest difference of accuracy obtained hayes-roth dataset is worth 12.73%, and the lowest difference of accuracy obtained in the abalone dataset of 0.07%. The average result of the accuracy of all dataset increases the accuracy by 5.33%.
Topic modeling for cluster analysis of large biological and medical datasets

PubMed Central

2014-01-01

Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106
Topic modeling for cluster analysis of large biological and medical datasets.

PubMed

Zhao, Weizhong; Zou, Wen; Chen, James J

2014-01-01

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
A Simple Sampling Method for Estimating the Accuracy of Large Scale Record Linkage Projects.

PubMed

Boyd, James H; Guiver, Tenniel; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Anderson, Phil; Dickinson, Teresa

2016-05-17

Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives. The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage. In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known. The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601). This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions.
A fast least-squares algorithm for population inference

PubMed Central

2013-01-01

Background Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning. Results We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster. Conclusions The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate. PMID:23343408
A fast least-squares algorithm for population inference.

PubMed

Parry, R Mitchell; Wang, May D

2013-01-23

Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual's genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning. We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster. The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.
Cluster ensemble based on Random Forests for genetic data.

PubMed

Alhusain, Luluah; Hafez, Alaaeldin M

2017-01-01

Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Privacy-preserving record linkage on large real world datasets.

PubMed

Randall, Sean M; Ferrante, Anna M; Boyd, James H; Bauer, Jacqueline K; Semmens, James B

2014-08-01

Record linkage typically involves the use of dedicated linkage units who are supplied with personally identifying information to determine individuals from within and across datasets. The personally identifying information supplied to linkage units is separated from clinical information prior to release by data custodians. While this substantially reduces the risk of disclosure of sensitive information, some residual risks still exist and remain a concern for some custodians. In this paper we trial a method of record linkage which reduces privacy risk still further on large real world administrative data. The method uses encrypted personal identifying information (bloom filters) in a probability-based linkage framework. The privacy preserving linkage method was tested on ten years of New South Wales (NSW) and Western Australian (WA) hospital admissions data, comprising in total over 26 million records. No difference in linkage quality was found when the results were compared to traditional probabilistic methods using full unencrypted personal identifiers. This presents as a possible means of reducing privacy risks related to record linkage in population level research studies. It is hoped that through adaptations of this method or similar privacy preserving methods, risks related to information disclosure can be reduced so that the benefits of linked research taking place can be fully realised. Copyright © 2013 Elsevier Inc. All rights reserved.
Semi-automatic segmentation of brain tumors using population and individual information.

PubMed

Wu, Yao; Yang, Wei; Jiang, Jun; Li, Shuanqian; Feng, Qianjin; Chen, Wufan

2013-08-01

Efficient segmentation of tumors in medical images is of great practical importance in early diagnosis and radiation plan. This paper proposes a novel semi-automatic segmentation method based on population and individual statistical information to segment brain tumors in magnetic resonance (MR) images. First, high-dimensional image features are extracted. Neighborhood components analysis is proposed to learn two optimal distance metrics, which contain population and patient-specific information, respectively. The probability of each pixel belonging to the foreground (tumor) and the background is estimated by the k-nearest neighborhood classifier under the learned optimal distance metrics. A cost function for segmentation is constructed through these probabilities and is optimized using graph cuts. Finally, some morphological operations are performed to improve the achieved segmentation results. Our dataset consists of 137 brain MR images, including 68 for training and 69 for testing. The proposed method overcomes segmentation difficulties caused by the uneven gray level distribution of the tumors and even can get satisfactory results if the tumors have fuzzy edges. Experimental results demonstrate that the proposed method is robust to brain tumor segmentation.
Survey-based socio-economic data from slums in Bangalore, India

PubMed Central

Roy, Debraj; Palavalli, Bharath; Menon, Niveditha; King, Robin; Pfeffer, Karin; Lees, Michael; Sloot, Peter M. A.

2018-01-01

In 2010, an estimated 860 million people were living in slums worldwide, with around 60 million added to the slum population between 2000 and 2010. In 2011, 200 million people in urban Indian households were considered to live in slums. In order to address and create slum development programmes and poverty alleviation methods, it is necessary to understand the needs of these communities. Therefore, we require data with high granularity in the Indian context. Unfortunately, there is a paucity of highly granular data at the level of individual slums. We collected the data presented in this paper in partnership with the slum dwellers in order to overcome the challenges such as validity and efficacy of self reported data. Our survey of Bangalore covered 36 slums across the city. The slums were chosen based on stratification criteria, which included geographical location of the slum, whether the slum was resettled or rehabilitated, notification status of the slum, the size of the slum and the religious profile. This paper describes the relational model of the slum dataset, the variables in the dataset, the variables constructed for analysis and the issues identified with the dataset. The data collected includes around 267,894 data points spread over 242 questions for 1,107 households. The dataset can facilitate interdisciplinary research on spatial and temporal dynamics of urban poverty and well-being in the context of rapid urbanization of cities in developing countries. PMID:29313840
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades.

PubMed

Orchard, Garrick; Jayawant, Ajinkya; Cohen, Gregory K; Thakor, Nitish

2015-01-01

Creating datasets for Neuromorphic Vision is a challenging task. A lack of available recordings from Neuromorphic Vision sensors means that data must typically be recorded specifically for dataset creation rather than collecting and labeling existing data. The task is further complicated by a desire to simultaneously provide traditional frame-based recordings to allow for direct comparison with traditional Computer Vision algorithms. Here we propose a method for converting existing Computer Vision static image datasets into Neuromorphic Vision datasets using an actuated pan-tilt camera platform. Moving the sensor rather than the scene or image is a more biologically realistic approach to sensing and eliminates timing artifacts introduced by monitor updates when simulating motion on a computer monitor. We present conversion of two popular image datasets (MNIST and Caltech101) which have played important roles in the development of Computer Vision, and we provide performance metrics on these datasets using spike-based recognition algorithms. This work contributes datasets for future use in the field, as well as results from spike-based algorithms against which future works can compare. Furthermore, by converting datasets already popular in Computer Vision, we enable more direct comparison with frame-based approaches.
Global patterns of current and future road infrastructure

NASA Astrophysics Data System (ADS)

Meijer, Johan R.; Huijbregts, Mark A. J.; Schotten, Kees C. G. J.; Schipper, Aafke M.

2018-06-01

Georeferenced information on road infrastructure is essential for spatial planning, socio-economic assessments and environmental impact analyses. Yet current global road maps are typically outdated or characterized by spatial bias in coverage. In the Global Roads Inventory Project we gathered, harmonized and integrated nearly 60 geospatial datasets on road infrastructure into a global roads dataset. The resulting dataset covers 222 countries and includes over 21 million km of roads, which is two to three times the total length in the currently best available country-based global roads datasets. We then related total road length per country to country area, population density, GDP and OECD membership, resulting in a regression model with adjusted R 2 of 0.90, and found that that the highest road densities are associated with densely populated and wealthier countries. Applying our regression model to future population densities and GDP estimates from the Shared Socioeconomic Pathway (SSP) scenarios, we obtained a tentative estimate of 3.0–4.7 million km additional road length for the year 2050. Large increases in road length were projected for developing nations in some of the world’s last remaining wilderness areas, such as the Amazon, the Congo basin and New Guinea. This highlights the need for accurate spatial road datasets to underpin strategic spatial planning in order to reduce the impacts of roads in remaining pristine ecosystems.
Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

PubMed

Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

2009-07-01

Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.
Shrinkage regression-based methods for microarray missing value imputation.

PubMed

Wang, Hsiuying; Chiu, Chia-Chun; Wu, Yi-Ching; Wu, Wei-Sheng

2013-01-01

Missing values commonly occur in the microarray data, which usually contain more than 5% missing values with up to 90% of genes affected. Inaccurate missing value estimation results in reducing the power of downstream microarray data analyses. Many types of methods have been developed to estimate missing values. Among them, the regression-based methods are very popular and have been shown to perform better than the other types of methods in many testing microarray datasets. To further improve the performances of the regression-based methods, we propose shrinkage regression-based methods. Our methods take the advantage of the correlation structure in the microarray data and select similar genes for the target gene by Pearson correlation coefficients. Besides, our methods incorporate the least squares principle, utilize a shrinkage estimation approach to adjust the coefficients of the regression model, and then use the new coefficients to estimate missing values. Simulation results show that the proposed methods provide more accurate missing value estimation in six testing microarray datasets than the existing regression-based methods do. Imputation of missing values is a very important aspect of microarray data analyses because most of the downstream analyses require a complete dataset. Therefore, exploring accurate and efficient methods for estimating missing values has become an essential issue. Since our proposed shrinkage regression-based methods can provide accurate missing value estimation, they are competitive alternatives to the existing regression-based methods.
ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction.

PubMed

Hajiloo, Mohsen; Sapkota, Yadav; Mackey, John R; Robson, Paula; Greiner, Russell; Damaraju, Sambasivarao

2013-02-22

Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case-control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual's continental and sub-continental ancestry. To predict an individual's continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control's λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. ETHNOPRED is a novel technique for producing classifiers that can identify an individual's continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
Attributes for NHDplus Catchments (Version 1.1) for the Conterminous United States: Population Density, 2000

USGS Publications Warehouse

Wieczorek, Michael; LaMottem, Andrew E.

2010-01-01

This data set represents the average population density, in number of people per square kilometer multiplied by 10 for the year 2000, compiled for every catchment of NHDPlus for the conterminous United States. The source data set is the 2000 Population Density by Block Group for the Conterminous United States (Hitt, 2003). The NHDPlus Version 1.1 is an integrated suite of application-ready geospatial datasets that incorporates many of the best features of the National Hydrography Dataset (NHD) and the National Elevation Dataset (NED). The NHDPlus includes a stream network (based on the 1:100,00-scale NHD), improved networking, naming, and value-added attributes (VAAs). NHDPlus also includes elevation-derived catchments (drainage areas) produced using a drainage enforcement technique first widely used in New England, and thus referred to as "the New England Method." This technique involves "burning in" the 1:100,000-scale NHD and when available building "walls" using the National Watershed Boundary Dataset (WBD). The resulting modified digital elevation model (HydroDEM) is used to produce hydrologic derivatives that agree with the NHD and WBD. Over the past two years, an interdisciplinary team from the U.S. Geological Survey (USGS), and the U.S. Environmental Protection Agency (USEPA), and contractors, found that this method produces the best quality NHD catchments using an automated process (USEPA, 2007). The NHDPlus dataset is organized by 18 Production Units that cover the conterminous United States. The NHDPlus version 1.1 data are grouped by the U.S. Geologic Survey's Major River Basins (MRBs, Crawford and others, 2006). MRB1, covering the New England and Mid-Atlantic River basins, contains NHDPlus Production Units 1 and 2. MRB2, covering the South Atlantic-Gulf and Tennessee River basins, contains NHDPlus Production Units 3 and 6. MRB3, covering the Great Lakes, Ohio, Upper Mississippi, and Souris-Red-Rainy River basins, contains NHDPlus Production Units 4, 5, 7 and 9. MRB4, covering the Missouri River basins, contains NHDPlus Production Units 10-lower and 10-upper. MRB5, covering the Lower Mississippi, Arkansas-White-Red, and Texas-Gulf River basins, contains NHDPlus Production Units 8, 11 and 12. MRB6, covering the Rio Grande, Colorado and Great Basin River basins, contains NHDPlus Production Units 13, 14, 15 and 16. MRB7, covering the Pacific Northwest River basins, contains NHDPlus Production Unit 17. MRB8, covering California River basins, contains NHDPlus Production Unit 18.

Influence analysis in quantitative trait loci detection.

PubMed

Dou, Xiaoling; Kuriki, Satoshi; Maeno, Akiteru; Takada, Toyoyuki; Shiroishi, Toshihiko

2014-07-01

This paper presents systematic methods for the detection of influential individuals that affect the log odds (LOD) score curve. We derive general formulas of influence functions for profile likelihoods and introduce them into two standard quantitative trait locus detection methods-the interval mapping method and single marker analysis. Besides influence analysis on specific LOD scores, we also develop influence analysis methods on the shape of the LOD score curves. A simulation-based method is proposed to assess the significance of the influence of the individuals. These methods are shown useful in the influence analysis of a real dataset of an experimental population from an F2 mouse cross. By receiver operating characteristic analysis, we confirm that the proposed methods show better performance than existing diagnostics. © 2014 The Author. Biometrical Journal published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Interactive visualization and analysis of multimodal datasets for surgical applications.

PubMed

Kirmizibayrak, Can; Yim, Yeny; Wakid, Mike; Hahn, James

2012-12-01

Surgeons use information from multiple sources when making surgical decisions. These include volumetric datasets (such as CT, PET, MRI, and their variants), 2D datasets (such as endoscopic videos), and vector-valued datasets (such as computer simulations). Presenting all the information to the user in an effective manner is a challenging problem. In this paper, we present a visualization approach that displays the information from various sources in a single coherent view. The system allows the user to explore and manipulate volumetric datasets, display analysis of dataset values in local regions, combine 2D and 3D imaging modalities and display results of vector-based computer simulations. Several interaction methods are discussed: in addition to traditional interfaces including mouse and trackers, gesture-based natural interaction methods are shown to control these visualizations with real-time performance. An example of a medical application (medialization laryngoplasty) is presented to demonstrate how the combination of different modalities can be used in a surgical setting with our approach.
Patterns of genetic diversity in the polymorphic ground snake (Sonora semiannulata).

PubMed

Cox, Christian L; Chippindale, Paul T

2014-08-01

We evaluated the genetic diversity of a snake species with color polymorphism to understand the evolutionary processes that drive genetic structure across a large geographic region. Specifically, we analyzed genetic structure of the highly polymorphic ground snake, Sonora semiannulata, (1) among populations, (2) among color morphs (3) at regional and local spatial scales, using an amplified fragment length polymorphism dataset and multiple population genetic analyses, including FST-based and clustering analytical techniques. Based upon these methods, we found that there was moderate to low genetic structure among populations. However, this diversity was not associated with geographic locality at either spatial scale. Similarly, we found no evidence for genetic divergence among color morphs at either spatial scale. These results suggest that despite dramatic color polymorphism, this phenotypic diversity is not a major driver of genetic diversity within or among populations of ground snakes. We suggest that there are two mechanisms that could explain existing genetic diversity in ground snakes: recent range expansion from a genetically diverse founder population and current or recent gene flow among populations. Our findings have further implications for the types of color polymorphism that may generate genetic diversity in snakes.
Mapping the spatial distribution of global anthropogenic mercury atmospheric emission inventories

NASA Astrophysics Data System (ADS)

Wilson, Simon J.; Steenhuisen, Frits; Pacyna, Jozef M.; Pacyna, Elisabeth G.

This paper describes the procedures employed to spatially distribute global inventories of anthropogenic emissions of mercury to the atmosphere, prepared by Pacyna, E.G., Pacyna, J.M., Steenhuisen, F., Wilson, S. [2006. Global anthropogenic mercury emission inventory for 2000. Atmospheric Environment, this issue, doi:10.1016/j.atmosenv.2006.03.041], and briefly discusses the results of this work. A new spatially distributed global emission inventory for the (nominal) year 2000, and a revised version of the 1995 inventory are presented. Emissions estimates for total mercury and major species groups are distributed within latitude/longitude-based grids with a resolution of 1×1 and 0.5×0.5°. A key component in the spatial distribution procedure is the use of population distribution as a surrogate parameter to distribute emissions from sources that cannot be accurately geographically located. In this connection, new gridded population datasets were prepared, based on the CEISIN GPW3 datasets (CIESIN, 2004. Gridded Population of the World (GPW), Version 3. Center for International Earth Science Information Network (CIESIN), Columbia University and Centro Internacional de Agricultura Tropical (CIAT). GPW3 data are available at http://beta.sedac.ciesin.columbia.edu/gpw/index.jsp). The spatially distributed emissions inventories and population datasets prepared in the course of this work are available on the Internet at www.amap.no/Resources/HgEmissions/
Integrative Spatial Data Analytics for Public Health Studies of New York State

PubMed Central

Chen, Xin; Wang, Fusheng

2016-01-01

Increased accessibility of health data made available by the government provides unique opportunity for spatial analytics with much higher resolution to discover patterns of diseases, and their correlation with spatial impact indicators. This paper demonstrated our vision of integrative spatial analytics for public health by linking the New York Cancer Mapping Dataset with datasets containing potential spatial impact indicators. We performed spatial based discovery of disease patterns and variations across New York State, and identify potential correlations between diseases and demographic, socio-economic and environmental indicators. Our methods were validated by three correlation studies: the correlation between stomach cancer and Asian race, the correlation between breast cancer and high education population, and the correlation between lung cancer and air toxics. Our work will allow public health researchers, government officials or other practitioners to adequately identify, analyze, and monitor health problems at the community or neighborhood level for New York State. PMID:28269834
Comparing Performance of Methods to Deal with Differential Attrition in Lottery Based Evaluations

ERIC Educational Resources Information Center

Zamarro, Gema; Anderson, Kaitlin; Steele, Jennifer; Miller, Trey

2016-01-01

The purpose of this study is to study the performance of different methods (inverse probability weighting and estimation of informative bounds) to control for differential attrition by comparing the results of different methods using two datasets: an original dataset from Portland Public Schools (PPS) subject to high rates of differential…
A Method to Estimate the Size and Characteristics of HIV-positive Populations Using an Individual-based Stochastic Simulation Model.

PubMed

Nakagawa, Fumiyo; van Sighem, Ard; Thiebaut, Rodolphe; Smith, Colette; Ratmann, Oliver; Cambiano, Valentina; Albert, Jan; Amato-Gauci, Andrew; Bezemer, Daniela; Campbell, Colin; Commenges, Daniel; Donoghoe, Martin; Ford, Deborah; Kouyos, Roger; Lodwick, Rebecca; Lundgren, Jens; Pantazis, Nikos; Pharris, Anastasia; Quinten, Chantal; Thorne, Claire; Touloumi, Giota; Delpech, Valerie; Phillips, Andrew

2016-03-01

It is important not only to collect epidemiologic data on HIV but to also fully utilize such information to understand the epidemic over time and to help inform and monitor the impact of policies and interventions. We describe and apply a novel method to estimate the size and characteristics of HIV-positive populations. The method was applied to data on men who have sex with men living in the UK and to a pseudo dataset to assess performance for different data availability. The individual-based simulation model was calibrated using an approximate Bayesian computation-based approach. In 2013, 48,310 (90% plausibility range: 39,900-45,560) men who have sex with men were estimated to be living with HIV in the UK, of whom 10,400 (6,160-17,350) were undiagnosed. There were an estimated 3,210 (1,730-5,350) infections per year on average between 2010 and 2013. Sixty-two percent of the total HIV-positive population are thought to have viral load <500 copies/ml. In the pseudo-epidemic example, HIV estimates have narrower plausibility ranges and are closer to the true number, the greater the data availability to calibrate the model. We demonstrate that our method can be applied to settings with less data, however plausibility ranges for estimates will be wider to reflect greater uncertainty of the data used to fit the model.
DeepPap: Deep Convolutional Networks for Cervical Cell Classification.

PubMed

Zhang, Ling; Le Lu; Nogues, Isabella; Summers, Ronald M; Liu, Shaoxiong; Yao, Jianhua

2017-11-01

Automation-assisted cervical screening via Pap smear or liquid-based cytology (LBC) is a highly effective cell imaging based cancer detection tool, where cells are partitioned into "abnormal" and "normal" categories. However, the success of most traditional classification methods relies on the presence of accurate cell segmentations. Despite sixty years of research in this field, accurate segmentation remains a challenge in the presence of cell clusters and pathologies. Moreover, previous classification methods are only built upon the extraction of hand-crafted features, such as morphology and texture. This paper addresses these limitations by proposing a method to directly classify cervical cells-without prior segmentation-based on deep features, using convolutional neural networks (ConvNets). First, the ConvNet is pretrained on a natural image dataset. It is subsequently fine-tuned on a cervical cell dataset consisting of adaptively resampled image patches coarsely centered on the nuclei. In the testing phase, aggregation is used to average the prediction scores of a similar set of image patches. The proposed method is evaluated on both Pap smear and LBC datasets. Results show that our method outperforms previous algorithms in classification accuracy (98.3%), area under the curve (0.99) values, and especially specificity (98.3%), when applied to the Herlev benchmark Pap smear dataset and evaluated using five-fold cross validation. Similar superior performances are also achieved on the HEMLBC (H&E stained manual LBC) dataset. Our method is promising for the development of automation-assisted reading systems in primary cervical screening.
A fast boosting-based screening method for large-scale association study in complex traits with genetic heterogeneity.

PubMed

Wang, Lu-Yong; Fasulo, D

2006-01-01

Genome-wide association study for complex diseases will generate massive amount of single nucleotide polymorphisms (SNPs) data. Univariate statistical test (i.e. Fisher exact test) was used to single out non-associated SNPs. However, the disease-susceptible SNPs may have little marginal effects in population and are unlikely to retain after the univariate tests. Also, model-based methods are impractical for large-scale dataset. Moreover, genetic heterogeneity makes the traditional methods harder to identify the genetic causes of diseases. A more recent random forest method provides a more robust method for screening the SNPs in thousands scale. However, for more large-scale data, i.e., Affymetrix Human Mapping 100K GeneChip data, a faster screening method is required to screening SNPs in whole-genome large scale association analysis with genetic heterogeneity. We propose a boosting-based method for rapid screening in large-scale analysis of complex traits in the presence of genetic heterogeneity. It provides a relatively fast and fairly good tool for screening and limiting the candidate SNPs for further more complex computational modeling task.
Deformable Image Registration based on Similarity-Steered CNN Regression.

PubMed

Cao, Xiaohuan; Yang, Jianhua; Zhang, Jun; Nie, Dong; Kim, Min-Jeong; Wang, Qian; Shen, Dinggang

2017-09-01

Existing deformable registration methods require exhaustively iterative optimization, along with careful parameter tuning, to estimate the deformation field between images. Although some learning-based methods have been proposed for initiating deformation estimation, they are often template-specific and not flexible in practical use. In this paper, we propose a convolutional neural network (CNN) based regression model to directly learn the complex mapping from the input image pair (i.e., a pair of template and subject) to their corresponding deformation field. Specifically, our CNN architecture is designed in a patch-based manner to learn the complex mapping from the input patch pairs to their respective deformation field. First, the equalized active-points guided sampling strategy is introduced to facilitate accurate CNN model learning upon a limited image dataset. Then, the similarity-steered CNN architecture is designed, where we propose to add the auxiliary contextual cue, i.e., the similarity between input patches, to more directly guide the learning process. Experiments on different brain image datasets demonstrate promising registration performance based on our CNN model. Furthermore, it is found that the trained CNN model from one dataset can be successfully transferred to another dataset, although brain appearances across datasets are quite variable.
Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies.

PubMed

Jiang, Wei; Yu, Weichuan

2017-02-15

In genome-wide association studies (GWASs) of common diseases/traits, we often analyze multiple GWASs with the same phenotype together to discover associated genetic variants with higher power. Since it is difficult to access data with detailed individual measurements, summary-statistics-based meta-analysis methods have become popular to jointly analyze datasets from multiple GWASs. In this paper, we propose a novel summary-statistics-based joint analysis method based on controlling the joint local false discovery rate (Jlfdr). We prove that our method is the most powerful summary-statistics-based joint analysis method when controlling the false discovery rate at a certain level. In particular, the Jlfdr-based method achieves higher power than commonly used meta-analysis methods when analyzing heterogeneous datasets from multiple GWASs. Simulation experiments demonstrate the superior power of our method over meta-analysis methods. Also, our method discovers more associations than meta-analysis methods from empirical datasets of four phenotypes. The R-package is available at: http://bioinformatics.ust.hk/Jlfdr.html . eeyu@ust.hk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Using imputed genotype data in the joint score tests for genetic association and gene-environment interactions in case-control studies.

PubMed

Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan

2018-03-01

Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.
GiniClust: detecting rare cell types from single-cell gene expression data with Gini index.

PubMed

Jiang, Lan; Chen, Huidong; Pinello, Luca; Yuan, Guo-Cheng

2016-07-01

High-throughput single-cell technologies have great potential to discover new cell types; however, it remains challenging to detect rare cell types that are distinct from a large population. We present a novel computational method, called GiniClust, to overcome this challenge. Validation against a benchmark dataset indicates that GiniClust achieves high sensitivity and specificity. Application of GiniClust to public single-cell RNA-seq datasets uncovers previously unrecognized rare cell types, including Zscan4-expressing cells within mouse embryonic stem cells and hemoglobin-expressing cells in the mouse cortex and hippocampus. GiniClust also correctly detects a small number of normal cells that are mixed in a cancer cell population.
A Semantic Transformation Methodology for the Secondary Use of Observational Healthcare Data in Postmarketing Safety Studies.

PubMed

Pacaci, Anil; Gonul, Suat; Sinaci, A Anil; Yuksel, Mustafa; Laleci Erturkmen, Gokce B

2018-01-01

Background: Utilization of the available observational healthcare datasets is key to complement and strengthen the postmarketing safety studies. Use of common data models (CDM) is the predominant approach in order to enable large scale systematic analyses on disparate data models and vocabularies. Current CDM transformation practices depend on proprietarily developed Extract-Transform-Load (ETL) procedures, which require knowledge both on the semantics and technical characteristics of the source datasets and target CDM. Purpose: In this study, our aim is to develop a modular but coordinated transformation approach in order to separate semantic and technical steps of transformation processes, which do not have a strict separation in traditional ETL approaches. Such an approach would discretize the operations to extract data from source electronic health record systems, alignment of the source, and target models on the semantic level and the operations to populate target common data repositories. Approach: In order to separate the activities that are required to transform heterogeneous data sources to a target CDM, we introduce a semantic transformation approach composed of three steps: (1) transformation of source datasets to Resource Description Framework (RDF) format, (2) application of semantic conversion rules to get the data as instances of ontological model of the target CDM, and (3) population of repositories, which comply with the specifications of the CDM, by processing the RDF instances from step 2. The proposed approach has been implemented on real healthcare settings where Observational Medical Outcomes Partnership (OMOP) CDM has been chosen as the common data model and a comprehensive comparative analysis between the native and transformed data has been conducted. Results: Health records of ~1 million patients have been successfully transformed to an OMOP CDM based database from the source database. Descriptive statistics obtained from the source and target databases present analogous and consistent results. Discussion and Conclusion: Our method goes beyond the traditional ETL approaches by being more declarative and rigorous. Declarative because the use of RDF based mapping rules makes each mapping more transparent and understandable to humans while retaining logic-based computability. Rigorous because the mappings would be based on computer readable semantics which are amenable to validation through logic-based inference methods.
Population Estimation Using a 3D City Model: A Multi-Scale Country-Wide Study in the Netherlands

PubMed Central

Arroyo Ohori, Ken; Ledoux, Hugo; Peters, Ravi; Stoter, Jantien

2016-01-01

The remote estimation of a region’s population has for decades been a key application of geographic information science in demography. Most studies have used 2D data (maps, satellite imagery) to estimate population avoiding field surveys and questionnaires. As the availability of semantic 3D city models is constantly increasing, we investigate to what extent they can be used for the same purpose. Based on the assumption that housing space is a proxy for the number of its residents, we use two methods to estimate the population with 3D city models in two directions: (1) disaggregation (areal interpolation) to estimate the population of small administrative entities (e.g. neighbourhoods) from that of larger ones (e.g. municipalities); and (2) a statistical modelling approach to estimate the population of large entities from a sample composed of their smaller ones (e.g. one acquired by a government register). Starting from a complete Dutch census dataset at the neighbourhood level and a 3D model of all 9.9 million buildings in the Netherlands, we compare the population estimates obtained by both methods with the actual population as reported in the census, and use it to evaluate the quality that can be achieved by estimations at different administrative levels. We also analyse how the volume-based estimation enabled by 3D city models fares in comparison to 2D methods using building footprints and floor areas, as well as how it is affected by different levels of semantic detail in a 3D city model. We conclude that 3D city models are useful for estimations of large areas (e.g. for a country), and that the 3D approach has clear advantages over the 2D approach. PMID:27254151
Compatibility of pedigree-based and marker-based relationship matrices for single-step genetic evaluation.

PubMed

Christensen, Ole F

2012-12-03

Single-step methods provide a coherent and conceptually simple approach to incorporate genomic information into genetic evaluations. An issue with single-step methods is compatibility between the marker-based relationship matrix for genotyped animals and the pedigree-based relationship matrix. Therefore, it is necessary to adjust the marker-based relationship matrix to the pedigree-based relationship matrix. Moreover, with data from routine evaluations, this adjustment should in principle be based on both observed marker genotypes and observed phenotypes, but until now this has been overlooked. In this paper, I propose a new method to address this issue by 1) adjusting the pedigree-based relationship matrix to be compatible with the marker-based relationship matrix instead of the reverse and 2) extending the single-step genetic evaluation using a joint likelihood of observed phenotypes and observed marker genotypes. The performance of this method is then evaluated using two simulated datasets. The method derived here is a single-step method in which the marker-based relationship matrix is constructed assuming all allele frequencies equal to 0.5 and the pedigree-based relationship matrix is constructed using the unusual assumption that animals in the base population are related and inbred with a relationship coefficient γ and an inbreeding coefficient γ / 2. Taken together, this γ parameter and a parameter that scales the marker-based relationship matrix can handle the issue of compatibility between marker-based and pedigree-based relationship matrices. The full log-likelihood function used for parameter inference contains two terms. The first term is the REML-log-likelihood for the phenotypes conditional on the observed marker genotypes, whereas the second term is the log-likelihood for the observed marker genotypes. Analyses of the two simulated datasets with this new method showed that 1) the parameters involved in adjusting marker-based and pedigree-based relationship matrices can depend on both observed phenotypes and observed marker genotypes and 2) a strong association between these two parameters exists. Finally, this method performed at least as well as a method based on adjusting the marker-based relationship matrix. Using the full log-likelihood and adjusting the pedigree-based relationship matrix to be compatible with the marker-based relationship matrix provides a new and interesting approach to handle the issue of compatibility between the two matrices in single-step genetic evaluation.
Phylogeographic patterns of Lygus pratensis (Hemiptera: Miridae): Evidence for weak genetic structure and recent expansion in northwest China.

PubMed

Zhang, Li-Juan; Cai, Wan-Zhi; Luo, Jun-Yu; Zhang, Shuai; Wang, Chun-Yi; Lv, Li-Min; Zhu, Xiang-Zhen; Wang, Li; Cui, Jin-Jie

2017-01-01

Lygus pratensis (L.) is an important cotton pest in China, especially in the northwest region. Nymphs and adults cause serious quality and yield losses. However, the genetic structure and geographic distribution of L. pratensis is not well known. We analyzed genetic diversity, geographical structure, gene flow, and population dynamics of L. pratensis in northwest China using mitochondrial and nuclear sequence datasets to study phylogeographical patterns and demographic history. L. pratensis (n = 286) were collected at sites across an area spanning 2,180,000 km2, including the Xinjiang and Gansu-Ningxia regions. Populations in the two regions could be distinguished based on mitochondrial criteria but the overall genetic structure was weak. The nuclear dataset revealed a lack of diagnostic genetic structure across sample areas. Phylogenetic analysis indicated a lack of population level monophyly that may have been caused by incomplete lineage sorting. The Mantel test showed a significant correlation between genetic and geographic distances among the populations based on the mtDNA data. However the nuclear dataset did not show significant correlation. A high level of gene flow among populations was indicated by migration analysis; human activities may have also facilitated insect movement. The availability of irrigation water and ample cotton hosts makes the Xinjiang region well suited for L. pratensis reproduction. Bayesian skyline plot analysis, star-shaped network, and neutrality tests all indicated that L. pratensis has experienced recent population expansion. Climatic changes and extensive areas occupied by host plants have led to population expansion of L. pratensis. In conclusion, the present distribution and phylogeographic pattern of L. pratensis was influenced by climate, human activities, and availability of plant hosts.
Artificial neural network models for prediction of cardiovascular autonomic dysfunction in general Chinese population

PubMed Central

2013-01-01

Background The present study aimed to develop an artificial neural network (ANN) based prediction model for cardiovascular autonomic (CA) dysfunction in the general population. Methods We analyzed a previous dataset based on a population sample consisted of 2,092 individuals aged 30–80 years. The prediction models were derived from an exploratory set using ANN analysis. Performances of these prediction models were evaluated in the validation set. Results Univariate analysis indicated that 14 risk factors showed statistically significant association with CA dysfunction (P < 0.05). The mean area under the receiver-operating curve was 0.762 (95% CI 0.732–0.793) for prediction model developed using ANN analysis. The mean sensitivity, specificity, positive and negative predictive values were similar in the prediction models was 0.751, 0.665, 0.330 and 0.924, respectively. All HL statistics were less than 15.0. Conclusion ANN is an effective tool for developing prediction models with high value for predicting CA dysfunction among the general population. PMID:23902963
A spline-based regression parameter set for creating customized DARTEL MRI brain templates from infancy to old age.

PubMed

Wilke, Marko

2018-02-01

This dataset contains the regression parameters derived by analyzing segmented brain MRI images (gray matter and white matter) from a large population of healthy subjects, using a multivariate adaptive regression splines approach. A total of 1919 MRI datasets ranging in age from 1-75 years from four publicly available datasets (NIH, C-MIND, fCONN, and IXI) were segmented using the CAT12 segmentation framework, writing out gray matter and white matter images normalized using an affine-only spatial normalization approach. These images were then subjected to a six-step DARTEL procedure, employing an iterative non-linear registration approach and yielding increasingly crisp intermediate images. The resulting six datasets per tissue class were then analyzed using multivariate adaptive regression splines, using the CerebroMatic toolbox. This approach allows for flexibly modelling smoothly varying trajectories while taking into account demographic (age, gender) as well as technical (field strength, data quality) predictors. The resulting regression parameters described here can be used to generate matched DARTEL or SHOOT templates for a given population under study, from infancy to old age. The dataset and the algorithm used to generate it are publicly available at https://irc.cchmc.org/software/cerebromatic.php.
Assessment of phylogenetic sensitivity for reconstructing HIV-1 epidemiological relationships.

PubMed

Beloukas, Apostolos; Magiorkinis, Emmanouil; Magiorkinis, Gkikas; Zavitsanou, Asimina; Karamitros, Timokratis; Hatzakis, Angelos; Paraskevis, Dimitrios

2012-06-01

Phylogenetic analysis has been extensively used as a tool for the reconstruction of epidemiological relations for research or for forensic purposes. It was our objective to assess the sensitivity of different phylogenetic methods and various phylogenetic programs to reconstruct epidemiological links among HIV-1 infected patients that is the probability to reveal a true transmission relationship. Multiple datasets (90) were prepared consisting of HIV-1 sequences in protease (PR) and partial reverse transcriptase (RT) sampled from patients with documented epidemiological relationship (target population), and from unrelated individuals (control population) belonging to the same HIV-1 subtype as the target population. Each dataset varied regarding the number, the geographic origin and the transmission risk groups of the sequences among the control population. Phylogenetic trees were inferred by neighbor-joining (NJ), maximum likelihood heuristics (hML) and Bayesian methods. All clusters of sequences belonging to the target population were correctly reconstructed by NJ and Bayesian methods receiving high bootstrap and posterior probability (PP) support, respectively. On the other hand, TreePuzzle failed to reconstruct or provide significant support for several clusters; high puzzling step support was associated with the inclusion of control sequences from the same geographic area as the target population. In contrary, all clusters were correctly reconstructed by hML as implemented in PhyML 3.0 receiving high bootstrap support. We report that under the conditions of our study, hML using PhyML, NJ and Bayesian methods were the most sensitive for the reconstruction of epidemiological links mostly from sexually infected individuals. Copyright © 2012 Elsevier B.V. All rights reserved.

High-throughput ocular artifact reduction in multichannel electroencephalography (EEG) using component subspace projection.

PubMed

Ma, Junshui; Bayram, Sevinç; Tao, Peining; Svetnik, Vladimir

2011-03-15

After a review of the ocular artifact reduction literature, a high-throughput method designed to reduce the ocular artifacts in multichannel continuous EEG recordings acquired at clinical EEG laboratories worldwide is proposed. The proposed method belongs to the category of component-based methods, and does not rely on any electrooculography (EOG) signals. Based on a concept that all ocular artifact components exist in a signal component subspace, the method can uniformly handle all types of ocular artifacts, including eye-blinks, saccades, and other eye movements, by automatically identifying ocular components from decomposed signal components. This study also proposes an improved strategy to objectively and quantitatively evaluate artifact reduction methods. The evaluation strategy uses real EEG signals to synthesize realistic simulated datasets with different amounts of ocular artifacts. The simulated datasets enable us to objectively demonstrate that the proposed method outperforms some existing methods when no high-quality EOG signals are available. Moreover, the results of the simulated datasets improve our understanding of the involved signal decomposition algorithms, and provide us with insights into the inconsistency regarding the performance of different methods in the literature. The proposed method was also applied to two independent clinical EEG datasets involving 28 volunteers and over 1000 EEG recordings. This effort further confirms that the proposed method can effectively reduce ocular artifacts in large clinical EEG datasets in a high-throughput fashion. Copyright © 2011 Elsevier B.V. All rights reserved.
Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation.

PubMed

Cornuet, Jean-Marie; Santos, Filipe; Beaumont, Mark A; Robert, Christian P; Marin, Jean-Michel; Balding, David J; Guillemaud, Thomas; Estoup, Arnaud

2008-12-01

Genetic data obtained on population samples convey information about their evolutionary history. Inference methods can extract part of this information but they require sophisticated statistical techniques that have been made available to the biologist community (through computer programs) only for simple and standard situations typically involving a small number of samples. We propose here a computer program (DIY ABC) for inference based on approximate Bayesian computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIY ABC can be used to compare competing scenarios, estimate parameters for one or more scenarios and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data). This article describes key methods used in the program and provides its main features. The analysis of one simulated and one real dataset, both with complex evolutionary scenarios, illustrates the main possibilities of DIY ABC. The software DIY ABC is freely available at http://www.montpellier.inra.fr/CBGP/diyabc.
Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.

PubMed

Stanislawski, Jerzy; Kotulska, Malgorzata; Unold, Olgierd

2013-01-17

Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.
A comparative analysis of chaotic particle swarm optimizations for detecting single nucleotide polymorphism barcodes.

PubMed

Chuang, Li-Yeh; Moi, Sin-Hua; Lin, Yu-Da; Yang, Cheng-Hong

2016-10-01

Evolutionary algorithms could overcome the computational limitations for the statistical evaluation of large datasets for high-order single nucleotide polymorphism (SNP) barcodes. Previous studies have proposed several chaotic particle swarm optimization (CPSO) methods to detect SNP barcodes for disease analysis (e.g., for breast cancer and chronic diseases). This work evaluated additional chaotic maps combined with the particle swarm optimization (PSO) method to detect SNP barcodes using a high-dimensional dataset. Nine chaotic maps were used to improve PSO method results and compared the searching ability amongst all CPSO methods. The XOR and ZZ disease models were used to compare all chaotic maps combined with PSO method. Efficacy evaluations of CPSO methods were based on statistical values from the chi-square test (χ 2 ). The results showed that chaotic maps could improve the searching ability of PSO method when population are trapped in the local optimum. The minor allele frequency (MAF) indicated that, amongst all CPSO methods, the numbers of SNPs, sample size, and the highest χ 2 value in all datasets were found in the Sinai chaotic map combined with PSO method. We used the simple linear regression results of the gbest values in all generations to compare the all methods. Sinai chaotic map combined with PSO method provided the highest β values (β≥0.32 in XOR disease model and β≥0.04 in ZZ disease model) and the significant p-value (p-value<0.001 in both the XOR and ZZ disease models). The Sinai chaotic map was found to effectively enhance the fitness values (χ 2 ) of PSO method, indicating that the Sinai chaotic map combined with PSO method is more effective at detecting potential SNP barcodes in both the XOR and ZZ disease models. Copyright © 2016 Elsevier B.V. All rights reserved.
Supervised Variational Relevance Learning, An Analytic Geometric Feature Selection with Applications to Omic Datasets.

PubMed

Boareto, Marcelo; Cesar, Jonatas; Leite, Vitor B P; Caticha, Nestor

2015-01-01

We introduce Supervised Variational Relevance Learning (Suvrel), a variational method to determine metric tensors to define distance based similarity in pattern classification, inspired in relevance learning. The variational method is applied to a cost function that penalizes large intraclass distances and favors small interclass distances. We find analytically the metric tensor that minimizes the cost function. Preprocessing the patterns by doing linear transformations using the metric tensor yields a dataset which can be more efficiently classified. We test our methods using publicly available datasets, for some standard classifiers. Among these datasets, two were tested by the MAQC-II project and, even without the use of further preprocessing, our results improve on their performance.
Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)

PubMed Central

Mascher, Martin; Muehlbauer, Gary J; Rokhsar, Daniel S; Chapman, Jarrod; Schmutz, Jeremy; Barry, Kerrie; Muñoz-Amatriaín, María; Close, Timothy J; Wise, Roger P; Schulman, Alan H; Himmelbach, Axel; Mayer, Klaus FX; Scholz, Uwe; Poland, Jesse A; Stein, Nils; Waugh, Robbie

2013-01-01

Next-generation whole-genome shotgun assemblies of complex genomes are highly useful, but fail to link nearby sequence contigs with each other or provide a linear order of contigs along individual chromosomes. Here, we introduce a strategy based on sequencing progeny of a segregating population that allows de novo production of a genetically anchored linear assembly of the gene space of an organism. We demonstrate the power of the approach by reconstructing the chromosomal organization of the gene space of barley, a large, complex and highly repetitive 5.1 Gb genome. We evaluate the robustness of the new assembly by comparison to a recently released physical and genetic framework of the barley genome, and to various genetically ordered sequence-based genotypic datasets. The method is independent of the need for any prior sequence resources, and will enable rapid and cost-efficient establishment of powerful genomic information for many species. PMID:23998490
Endmember extraction from hyperspectral image based on discrete firefly algorithm (EE-DFA)

NASA Astrophysics Data System (ADS)

Zhang, Chengye; Qin, Qiming; Zhang, Tianyuan; Sun, Yuanheng; Chen, Chao

2017-04-01

This study proposed a novel method to extract endmembers from hyperspectral image based on discrete firefly algorithm (EE-DFA). Endmembers are the input of many spectral unmixing algorithms. Hence, in this paper, endmember extraction from hyperspectral image is regarded as a combinational optimization problem to get best spectral unmixing results, which can be solved by the discrete firefly algorithm. Two series of experiments were conducted on the synthetic hyperspectral datasets with different SNR and the AVIRIS Cuprite dataset, respectively. The experimental results were compared with the endmembers extracted by four popular methods: the sequential maximum angle convex cone (SMACC), N-FINDR, Vertex Component Analysis (VCA), and Minimum Volume Constrained Nonnegative Matrix Factorization (MVC-NMF). What's more, the effect of the parameters in the proposed method was tested on both synthetic hyperspectral datasets and AVIRIS Cuprite dataset, and the recommended parameters setting was proposed. The results in this study demonstrated that the proposed EE-DFA method showed better performance than the existing popular methods. Moreover, EE-DFA is robust under different SNR conditions.
A hybrid approach for efficient anomaly detection using metaheuristic methods

PubMed Central

Ghanem, Tamer F.; Elkilani, Wail S.; Abdul-kader, Hatem M.

2014-01-01

Network intrusion detection based on anomaly detection techniques has a significant role in protecting networks and systems against harmful activities. Different metaheuristic techniques have been used for anomaly detector generation. Yet, reported literature has not studied the use of the multi-start metaheuristic method for detector generation. This paper proposes a hybrid approach for anomaly detection in large scale datasets using detectors generated based on multi-start metaheuristic method and genetic algorithms. The proposed approach has taken some inspiration of negative selection-based detector generation. The evaluation of this approach is performed using NSL-KDD dataset which is a modified version of the widely used KDD CUP 99 dataset. The results show its effectiveness in generating a suitable number of detectors with an accuracy of 96.1% compared to other competitors of machine learning algorithms. PMID:26199752
A hybrid approach for efficient anomaly detection using metaheuristic methods.

PubMed

Ghanem, Tamer F; Elkilani, Wail S; Abdul-Kader, Hatem M

2015-07-01

Network intrusion detection based on anomaly detection techniques has a significant role in protecting networks and systems against harmful activities. Different metaheuristic techniques have been used for anomaly detector generation. Yet, reported literature has not studied the use of the multi-start metaheuristic method for detector generation. This paper proposes a hybrid approach for anomaly detection in large scale datasets using detectors generated based on multi-start metaheuristic method and genetic algorithms. The proposed approach has taken some inspiration of negative selection-based detector generation. The evaluation of this approach is performed using NSL-KDD dataset which is a modified version of the widely used KDD CUP 99 dataset. The results show its effectiveness in generating a suitable number of detectors with an accuracy of 96.1% compared to other competitors of machine learning algorithms.
Research on cross - Project software defect prediction based on transfer learning

NASA Astrophysics Data System (ADS)

Chen, Ya; Ding, Xiaoming

2018-04-01

According to the two challenges in the prediction of cross-project software defects, the distribution differences between the source project and the target project dataset and the class imbalance in the dataset, proposing a cross-project software defect prediction method based on transfer learning, named NTrA. Firstly, solving the source project data's class imbalance based on the Augmented Neighborhood Cleaning Algorithm. Secondly, the data gravity method is used to give different weights on the basis of the attribute similarity of source project and target project data. Finally, a defect prediction model is constructed by using Trad boost algorithm. Experiments were conducted using data, come from NASA and SOFTLAB respectively, from a published PROMISE dataset. The results show that the method has achieved good values of recall and F-measure, and achieved good prediction results.
Developing population models with data from marked individuals

USGS Publications Warehouse

Hae Yeong Ryu,; Kevin T. Shoemaker,; Eva Kneip,; Anna Pidgeon,; Patricia Heglund,; Brooke Bateman,; Thogmartin, Wayne E.; Reşit Akçakaya,

2016-01-01

Population viability analysis (PVA) is a powerful tool for biodiversity assessments, but its use has been limited because of the requirements for fully specified population models such as demographic structure, density-dependence, environmental stochasticity, and specification of uncertainties. Developing a fully specified population model from commonly available data sources – notably, mark–recapture studies – remains complicated due to lack of practical methods for estimating fecundity, true survival (as opposed to apparent survival), natural temporal variability in both survival and fecundity, density-dependence in the demographic parameters, and uncertainty in model parameters. We present a general method that estimates all the key parameters required to specify a stochastic, matrix-based population model, constructed using a long-term mark–recapture dataset. Unlike standard mark–recapture analyses, our approach provides estimates of true survival rates and fecundities, their respective natural temporal variabilities, and density-dependence functions, making it possible to construct a population model for long-term projection of population dynamics. Furthermore, our method includes a formal quantification of parameter uncertainty for global (multivariate) sensitivity analysis. We apply this approach to 9 bird species and demonstrate the feasibility of using data from the Monitoring Avian Productivity and Survivorship (MAPS) program. Bias-correction factors for raw estimates of survival and fecundity derived from mark–recapture data (apparent survival and juvenile:adult ratio, respectively) were non-negligible, and corrected parameters were generally more biologically reasonable than their uncorrected counterparts. Our method allows the development of fully specified stochastic population models using a single, widely available data source, substantially reducing the barriers that have until now limited the widespread application of PVA. This method is expected to greatly enhance our understanding of the processes underlying population dynamics and our ability to analyze viability and project trends for species of conservation concern.
New Systematic Review Methodology for Visual Impairment and Blindness for the 2010 Global Burden of Disease Study

PubMed Central

Bourne, Rupert; Price, Holly; Taylor, Hugh; Leasher, Janet; Keeffe, Jill; Glanville, Julie; Sieving, Pamela C; Khairallah, Moncef; Wong, Tien Yin; Zheng, Yingfeng; Mathew, Anu; Katiyar, Suchitra; Mascarenhas, Maya; Stevens, Gretchen A; Resnikoff, Serge; Gichuhi, Stephen; Naidoo, Kovin; Wallace, Diane; Kymes, Steven; Peters, Colleen; Pesudovs, Konrad; Braithwaite, Tasanee; Limburg, Hans

2014-01-01

Purpose To describe a systematic review of population-based prevalence studies of visual impairment (VI) and blindness worldwide over the past 32 years that informs the Global Burden of Diseases, Injuries and Risk Factors Study. Methods A systematic review (Stage 1) of medical literature from 1 January 1980 to 31 January 2012 identified indexed articles containing data on incidence, prevalence and causes of blindness and VI. Only cross-sectional population-based representative studies were selected from which to extract data for a database of age- and sex-specific data of prevalence of 4 distance and one near visual loss categories (presenting and best-corrected). Unpublished data and data from studies using ‘rapid assessment’ methodology were later added (Stage 2). Results Stage 1 identified 14,908 references, of which 204 articles met the inclusion criteria. Stage 2 added unpublished data from 44 ‘rapid assessment studies’ and 4 other surveys. This resulted in a final dataset of 252 articles of 243 studies, of which 238 (98%) reported distance vision loss categories. Thirty-seven studies of the final dataset reported prevalence of mild VI and 4 reported near vision impairment. Conclusion We report a comprehensive systematic review of over 30 years of VI/blindness studies. While there has been an increase in population-based studies conducted in the 2000’s compared to previous decades; there is limited information from certain regions (eg. Central Africa and Central and Eastern Europe, and the Caribbean and Latin America), younger age groups and minimal data regarding prevalence of near vision and mild distance visual impairment. PMID:23350553
GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.

PubMed

Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin

2016-12-05

Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.
Stochastic models to demonstrate the effect of motivated testing on HIV incidence estimates using the serological testing algorithm for recent HIV seroconversion (STARHS).

PubMed

White, Edward W; Lumley, Thomas; Goodreau, Steven M; Goldbaum, Gary; Hawes, Stephen E

2010-12-01

To produce valid seroincidence estimates, the serological testing algorithm for recent HIV seroconversion (STARHS) assumes independence between infection and testing, which may be absent in clinical data. STARHS estimates are generally greater than cohort-based estimates of incidence from observable person-time and diagnosis dates. The authors constructed a series of partial stochastic models to examine whether testing motivated by suspicion of infection could bias STARHS. One thousand Monte Carlo simulations of 10,000 men who have sex with men were generated using parameters for HIV incidence and testing frequency from data from a clinical testing population in Seattle. In one set of simulations, infection and testing dates were independent. In another set, some intertest intervals were abbreviated to reflect the distribution of intervals between suspected HIV exposure and testing in a group of Seattle men who have sex with men recently diagnosed as having HIV. Both estimation methods were applied to the simulated datasets. Both cohort-based and STARHS incidence estimates were calculated using the simulated data and compared with previously calculated, empirical cohort-based and STARHS seroincidence estimates from the clinical testing population. Under simulated independence between infection and testing, cohort-based and STARHS incidence estimates resembled cohort estimates from the clinical dataset. Under simulated motivated testing, cohort-based estimates remained unchanged, but STARHS estimates were inflated similar to empirical STARHS estimates. Varying motivation parameters appreciably affected STARHS incidence estimates, but not cohort-based estimates. Cohort-based incidence estimates are robust against dependence between testing and acquisition of infection, whereas STARHS incidence estimates are not.
Development and application of a novel method for regional assessment of groundwater contamination risk in the Songhua River Basin.

PubMed

Nixdorf, Erik; Sun, Yuanyuan; Lin, Mao; Kolditz, Olaf

2017-12-15

The main objective of this study is to quantify the groundwater contamination risk of Songhua River Basin by applying a novel approach of integrating public datasets, web services and numerical modelling techniques. To our knowledge, this study is the first to establish groundwater risk maps for the entire Songhua River Basin, one of the largest and most contamination-endangered river basins in China. Index-based groundwater risk maps were created with GIS tools at a spatial resolution of 30arc sec by combining the results of groundwater vulnerability and hazard assessment. Groundwater vulnerability was evaluated using the DRASTIC index method based on public datasets at the highest available resolution in combination with numerical groundwater modelling. As a novel approach to overcome data scarcity at large scales, a web mapping service based data query was applied to obtain an inventory for potential hazardous sites within the basin. The groundwater risk assessment demonstrated that <1% of Songhua River Basin is at high or very high contamination risk. These areas were mainly located in the vast plain areas with hotspots particularly in the Changchun metropolitan area. Moreover, groundwater levels and pollution point sources were found to play a significantly larger impact in assessing these areas than originally assumed by the index scheme. Moderate contamination risk was assigned to 27% of the aquifers, predominantly associated with less densely populated agricultural areas. However, the majority of aquifer area in the sparsely populated mountain ranges displayed low groundwater contamination risk. Sensitivity analysis demonstrated that this novel method is valid for regional assessments of groundwater contamination risk. Despite limitations in resolution and input data consistency, the obtained groundwater contamination risk maps will be beneficial for regional and local decision-making processes with regard to groundwater protection measures, particularly if other data availability is limited. Copyright © 2017 Elsevier B.V. All rights reserved.
Comparison of parametric and bootstrap method in bioequivalence test.

PubMed

Ahn, Byung-Jin; Yim, Dong-Seok

2009-10-01

The estimation of 90% parametric confidence intervals (CIs) of mean AUC and Cmax ratios in bioequivalence (BE) tests are based upon the assumption that formulation effects in log-transformed data are normally distributed. To compare the parametric CIs with those obtained from nonparametric methods we performed repeated estimation of bootstrap-resampled datasets. The AUC and Cmax values from 3 archived datasets were used. BE tests on 1,000 resampled datasets from each archived dataset were performed using SAS (Enterprise Guide Ver.3). Bootstrap nonparametric 90% CIs of formulation effects were then compared with the parametric 90% CIs of the original datasets. The 90% CIs of formulation effects estimated from the 3 archived datasets were slightly different from nonparametric 90% CIs obtained from BE tests on resampled datasets. Histograms and density curves of formulation effects obtained from resampled datasets were similar to those of normal distribution. However, in 2 of 3 resampled log (AUC) datasets, the estimates of formulation effects did not follow the Gaussian distribution. Bias-corrected and accelerated (BCa) CIs, one of the nonparametric CIs of formulation effects, shifted outside the parametric 90% CIs of the archived datasets in these 2 non-normally distributed resampled log (AUC) datasets. Currently, the 80~125% rule based upon the parametric 90% CIs is widely accepted under the assumption of normally distributed formulation effects in log-transformed data. However, nonparametric CIs may be a better choice when data do not follow this assumption.
Comparison of Parametric and Bootstrap Method in Bioequivalence Test

PubMed Central

Ahn, Byung-Jin

2009-01-01

The estimation of 90% parametric confidence intervals (CIs) of mean AUC and Cmax ratios in bioequivalence (BE) tests are based upon the assumption that formulation effects in log-transformed data are normally distributed. To compare the parametric CIs with those obtained from nonparametric methods we performed repeated estimation of bootstrap-resampled datasets. The AUC and Cmax values from 3 archived datasets were used. BE tests on 1,000 resampled datasets from each archived dataset were performed using SAS (Enterprise Guide Ver.3). Bootstrap nonparametric 90% CIs of formulation effects were then compared with the parametric 90% CIs of the original datasets. The 90% CIs of formulation effects estimated from the 3 archived datasets were slightly different from nonparametric 90% CIs obtained from BE tests on resampled datasets. Histograms and density curves of formulation effects obtained from resampled datasets were similar to those of normal distribution. However, in 2 of 3 resampled log (AUC) datasets, the estimates of formulation effects did not follow the Gaussian distribution. Bias-corrected and accelerated (BCa) CIs, one of the nonparametric CIs of formulation effects, shifted outside the parametric 90% CIs of the archived datasets in these 2 non-normally distributed resampled log (AUC) datasets. Currently, the 80~125% rule based upon the parametric 90% CIs is widely accepted under the assumption of normally distributed formulation effects in log-transformed data. However, nonparametric CIs may be a better choice when data do not follow this assumption. PMID:19915699
Differential expression analysis for RNAseq using Poisson mixed models

PubMed Central

Sun, Shiquan; Hood, Michelle; Scott, Laura; Peng, Qinke; Mukherjee, Sayan; Tung, Jenny

2017-01-01

Abstract Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html. PMID:28369632
Determining Cutoff Point of Ensemble Trees Based on Sample Size in Predicting Clinical Dose with DNA Microarray Data.

PubMed

Yılmaz Isıkhan, Selen; Karabulut, Erdem; Alpar, Celal Reha

2016-01-01

Background/Aim . Evaluating the success of dose prediction based on genetic or clinical data has substantially advanced recently. The aim of this study is to predict various clinical dose values from DNA gene expression datasets using data mining techniques. Materials and Methods . Eleven real gene expression datasets containing dose values were included. First, important genes for dose prediction were selected using iterative sure independence screening. Then, the performances of regression trees (RTs), support vector regression (SVR), RT bagging, SVR bagging, and RT boosting were examined. Results . The results demonstrated that a regression-based feature selection method substantially reduced the number of irrelevant genes from raw datasets. Overall, the best prediction performance in nine of 11 datasets was achieved using SVR; the second most accurate performance was provided using a gradient-boosting machine (GBM). Conclusion . Analysis of various dose values based on microarray gene expression data identified common genes found in our study and the referenced studies. According to our findings, SVR and GBM can be good predictors of dose-gene datasets. Another result of the study was to identify the sample size of n = 25 as a cutoff point for RT bagging to outperform a single RT.
Efficient genotype compression and analysis of large genetic variation datasets

PubMed Central

Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.

2015-01-01

Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772

Multi-objective based spectral unmixing for hyperspectral images

NASA Astrophysics Data System (ADS)

Xu, Xia; Shi, Zhenwei

2017-02-01

Sparse hyperspectral unmixing assumes that each observed pixel can be expressed by a linear combination of several pure spectra in a priori library. Sparse unmixing is challenging, since it is usually transformed to a NP-hard l0 norm based optimization problem. Existing methods usually utilize a relaxation to the original l0 norm. However, the relaxation may bring in sensitive weighted parameters and additional calculation error. In this paper, we propose a novel multi-objective based algorithm to solve the sparse unmixing problem without any relaxation. We transform sparse unmixing to a multi-objective optimization problem, which contains two correlative objectives: minimizing the reconstruction error and controlling the endmember sparsity. To improve the efficiency of multi-objective optimization, a population-based randomly flipping strategy is designed. Moreover, we theoretically prove that the proposed method is able to recover a guaranteed approximate solution from the spectral library within limited iterations. The proposed method can directly deal with l0 norm via binary coding for the spectral signatures in the library. Experiments on both synthetic and real hyperspectral datasets demonstrate the effectiveness of the proposed method.
EnviroAtlas - Austin, TX - Residents with Minimal Potential Window Views of Trees by Block Group

EPA Pesticide Factsheets

This EnviroAtlas dataset shows the total block group population and the percentage of the block group population that has little access to potential window views of trees at home. Having little potential access to window views of trees is defined as having no trees & forest land cover within 50 meters. The window views are considered potential because the procedure does not account for presence or directionality of windows in one's home. Forest is defined as Trees & Forest. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Sensing Urban Patterns with Antenna Mappings: The Case of Santiago, Chile †

PubMed Central

Graells-Garrido, Eduardo; Peredo, Oscar; García, José

2016-01-01

Mobile data has allowed us to sense urban dynamics at scales and granularities not known before, helping urban planners to cope with urban growth. A frequently used kind of dataset are Call Detail Records (CDR), used by telecommunication operators for billing purposes. Being an already extracted and processed dataset, it is inexpensive and reliable. A common assumption with respect to geography when working with CDR data is that the position of a device is the same as the Base Transceiver Station (BTS) it is connected to. Because the city is divided into a square grid, or by coverage zones approximated by Voronoi tessellations, CDR network events are assigned to corresponding areas according to BTS position. This geolocation may suffer from non negligible error in almost all cases. In this paper we propose “Antenna Virtual Placement” (AVP), a method to geolocate mobile devices according to their connections to BTS, based on decoupling antennas from its corresponding BTS according to its physical configuration (height, downtilt, and azimuth). We use AVP applied to CDR data as input for two different tasks: first, from an individual perspective, what places are meaningful for them? And second, from a global perspective, how to cluster city areas to understand land use using floating population flows? For both tasks we propose methods that complement or improve prior work in the literature. Our proposed methods are simple, yet not trivial, and work with daily CDR data from the biggest telecommunication operator in Chile. We evaluate them in Santiago, the capital of Chile, with data from working days from June 2015. We find that: (1) AVP improves city coverage of CDR data by geolocating devices to more city areas than using standard methods; (2) we find important places (home and work) for a 10% of the sample using just daily information, and recreate the population distribution as well as commuting trips; (3) the daily rhythms of floating population allow to cluster areas of the city, and explain them from a land use perspective by finding signature points of interest from crowdsourced geographical information. These results have implications for the design of applications based on CDR data like recommendation of places and routes, retail store placement, and estimation of transport effects from pollution alerts. PMID:27428979
Sensing Urban Patterns with Antenna Mappings: The Case of Santiago, Chile.

PubMed

Graells-Garrido, Eduardo; Peredo, Oscar; García, José

2016-07-15

Mobile data has allowed us to sense urban dynamics at scales and granularities not known before, helping urban planners to cope with urban growth. A frequently used kind of dataset are Call Detail Records (CDR), used by telecommunication operators for billing purposes. Being an already extracted and processed dataset, it is inexpensive and reliable. A common assumption with respect to geography when working with CDR data is that the position of a device is the same as the Base Transceiver Station (BTS) it is connected to. Because the city is divided into a square grid, or by coverage zones approximated by Voronoi tessellations, CDR network events are assigned to corresponding areas according to BTS position. This geolocation may suffer from non negligible error in almost all cases. In this paper we propose "Antenna Virtual Placement" (AVP), a method to geolocate mobile devices according to their connections to BTS, based on decoupling antennas from its corresponding BTS according to its physical configuration (height, downtilt, and azimuth). We use AVP applied to CDR data as input for two different tasks: first, from an individual perspective, what places are meaningful for them? And second, from a global perspective, how to cluster city areas to understand land use using floating population flows? For both tasks we propose methods that complement or improve prior work in the literature. Our proposed methods are simple, yet not trivial, and work with daily CDR data from the biggest telecommunication operator in Chile. We evaluate them in Santiago, the capital of Chile, with data from working days from June 2015. We find that: (1) AVP improves city coverage of CDR data by geolocating devices to more city areas than using standard methods; (2) we find important places (home and work) for a 10% of the sample using just daily information, and recreate the population distribution as well as commuting trips; (3) the daily rhythms of floating population allow to cluster areas of the city, and explain them from a land use perspective by finding signature points of interest from crowdsourced geographical information. These results have implications for the design of applications based on CDR data like recommendation of places and routes, retail store placement, and estimation of transport effects from pollution alerts.
Fixing Dataset Search

NASA Technical Reports Server (NTRS)

Lynnes, Chris

2014-01-01

Three current search engines are queried for ozone data at the GES DISC. The results range from sub-optimal to counter-intuitive. We propose a method to fix dataset search by implementing a robust relevancy ranking scheme. The relevancy ranking scheme is based on several heuristics culled from more than 20 years of helping users select datasets.
Water use trends in Washington, 1985-2005

USGS Publications Warehouse

Lane, R.C.

2010-01-01

Since 1950, the U.S. Geological Survey Washington Water Science Center (USGS-WAWSC) has collected, compiled, and published, at 5-year intervals, statewide estimates of the amounts of water withdrawn and used for various purposes in Washington State. As new data and methods became available, some of the original datasets were recompiled. The most recent versions of these datasets were used in this fact sheet. The datasets are available online along with other USGS-WAWSC water-use publications at the USGS-WAWSC water use web page: http://wa.water.usgs.gov/data/wuse/. Values on these datasets and in this fact sheet may not sum to the indicated total due to independent rounding. Due to variations in data requirements, collection methods, terminology, and data sources, the direct assessment of water-use trends between compilations is difficult. This fact sheet focuses on the trends in total State and public-supplied populations, freshwater withdrawals and use, public-supply withdrawals and deliveries, and crop irrigation withdrawals and acreage in Washington from 1985 through 2005. These four categories were included in all five compilations and were the most stable in terms of data requirements, collection methods, terminology, and data sources.
Texture Descriptors Ensembles Enable Image-Based Classification of Maturation of Human Stem Cell-Derived Retinal Pigmented Epithelium

PubMed Central

Caetano dos Santos, Florentino Luciano; Skottman, Heli; Juuti-Uusitalo, Kati; Hyttinen, Jari

2016-01-01

Aims A fast, non-invasive and observer-independent method to analyze the homogeneity and maturity of human pluripotent stem cell (hPSC) derived retinal pigment epithelial (RPE) cells is warranted to assess the suitability of hPSC-RPE cells for implantation or in vitro use. The aim of this work was to develop and validate methods to create ensembles of state-of-the-art texture descriptors and to provide a robust classification tool to separate three different maturation stages of RPE cells by using phase contrast microscopy images. The same methods were also validated on a wide variety of biological image classification problems, such as histological or virus image classification. Methods For image classification we used different texture descriptors, descriptor ensembles and preprocessing techniques. Also, three new methods were tested. The first approach was an ensemble of preprocessing methods, to create an additional set of images. The second was the region-based approach, where saliency detection and wavelet decomposition divide each image in two different regions, from which features were extracted through different descriptors. The third method was an ensemble of Binarized Statistical Image Features, based on different sizes and thresholds. A Support Vector Machine (SVM) was trained for each descriptor histogram and the set of SVMs combined by sum rule. The accuracy of the computer vision tool was verified in classifying the hPSC-RPE cell maturation level. Dataset and Results The RPE dataset contains 1862 subwindows from 195 phase contrast images. The final descriptor ensemble outperformed the most recent stand-alone texture descriptors, obtaining, for the RPE dataset, an area under ROC curve (AUC) of 86.49% with the 10-fold cross validation and 91.98% with the leave-one-image-out protocol. The generality of the three proposed approaches was ascertained with 10 more biological image datasets, obtaining an average AUC greater than 97%. Conclusions Here we showed that the developed ensembles of texture descriptors are able to classify the RPE cell maturation stage. Moreover, we proved that preprocessing and region-based decomposition improves many descriptors’ accuracy in biological dataset classification. Finally, we built the first public dataset of stem cell-derived RPE cells, which is publicly available to the scientific community for classification studies. The proposed tool is available at https://www.dei.unipd.it/node/2357 and the RPE dataset at http://www.biomeditech.fi/data/RPE_dataset/. Both are available at https://figshare.com/s/d6fb591f1beb4f8efa6f. PMID:26895509
Genotype-based gene signature of glioma risk.

PubMed

Huang, Yen-Tsung; Zhang, Yi; Wu, Zhijin; Michaud, Dominique S

2017-07-01

Glioma accounts for 80% of malignant brain tumors, but its etiologic determinants remain elusive. Despite genetic susceptibility loci identified by genome-wide association study (GWAS), the agnostic approach leaves open the possibility that other susceptibility genes remain to be discovered. Here we conduct a gene-centric integrative GWAS (iGWAS) of glioma risk that combines transcriptomics and genetics. We synthesized a brain transcriptomics dataset (n = 354), a GWAS dataset (n = 4203), and an advanced glioma tumor transcriptomic dataset (n = 483) to conduct an iGWAS. Using the expression quantitative trait loci (eQTL) dataset, we built models to predict gene expression for the GWAS data, based on eQTL genotypes. With the predicted gene expression, iGWAS analyses were performed using a novel statistical method. Gene signature risk score was constructed using a penalized logistic regression model. A total of 30527 transcripts were analyzed using the iGWAS approach. Four novel glioma susceptibility genes were identified with internal and external validation, including DRD5 (P = 3.0 × 10-79), WDR1 (P = 8.4 × 10-77), NOMO1 (P = 1.3 × 10-25), and PDXDC1 (P = 8.3 × 10-24). The genotype-predicted transcription pattern between cases and controls is consistent with that between tumor and its matched normal tissue. The genotype-based 4-gene signature improved the classification between glioma cases and controls based on age, gender, and population stratification, with area under the receiver operating characteristic curve increasing from 0.77 to 0.85 (P = 8.1 × 10-23). A new genotype-based gene signature of glioma was identified using a novel iGWAS approach, which integrates multiplatform genomic data as well as different genetic association studies. © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Neuro-Oncology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
Genotype Imputation with Thousands of Genomes

PubMed Central

Howie, Bryan; Marchini, Jonathan; Stephens, Matthew

2011-01-01

Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package. PMID:22384356
Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000

NASA Astrophysics Data System (ADS)

Reba, Meredith; Reitsma, Femke; Seto, Karen C.

2016-06-01

How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends.
Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000

PubMed Central

Reba, Meredith; Reitsma, Femke; Seto, Karen C.

2016-01-01

How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends. PMID:27271481
Weakly supervised image semantic segmentation based on clustering superpixels

NASA Astrophysics Data System (ADS)

Yan, Xiong; Liu, Xiaohua

2018-04-01

In this paper, we propose an image semantic segmentation model which is trained from image-level labeled images. The proposed model starts with superpixel segmenting, and features of the superpixels are extracted by trained CNN. We introduce a superpixel-based graph followed by applying the graph partition method to group correlated superpixels into clusters. For the acquisition of inter-label correlations between the image-level labels in dataset, we not only utilize label co-occurrence statistics but also exploit visual contextual cues simultaneously. At last, we formulate the task of mapping appropriate image-level labels to the detected clusters as a problem of convex minimization. Experimental results on MSRC-21 dataset and LableMe dataset show that the proposed method has a better performance than most of the weakly supervised methods and is even comparable to fully supervised methods.
Construction of brain atlases based on a multi-center MRI dataset of 2020 Chinese adults

PubMed Central

Liang, Peipeng; Shi, Lin; Chen, Nan; Luo, Yishan; Wang, Xing; Liu, Kai; Mok, Vincent CT; Chu, Winnie CW; Wang, Defeng; Li, Kuncheng

2015-01-01

Despite the known morphological differences (e.g., brain shape and size) in the brains of populations of different origins (e.g., age and race), the Chinese brain atlas is less studied. In the current study, we developed a statistical brain atlas based on a multi-center high quality magnetic resonance imaging (MRI) dataset of 2020 Chinese adults (18–76 years old). We constructed 12 Chinese brain atlas from the age 20 year to the age 75 at a 5 years interval. New Chinese brain standard space, coordinates, and brain area labels were further defined. The new Chinese brain atlas was validated in brain registration and segmentation. It was found that, as contrast to the MNI152 template, the proposed Chinese atlas showed higher accuracy in hippocampus segmentation and relatively smaller shape deformations during registration. These results indicate that a population-specific time varying brain atlas may be more appropriate for studies involving Chinese populations. PMID:26678304
Who, What, When, Where? Determining the Health Implications of Wildfire Smoke Exposure

NASA Astrophysics Data System (ADS)

Ford, B.; Lassman, W.; Gan, R.; Burke, M.; Pfister, G.; Magzamen, S.; Fischer, E. V.; Volckens, J.; Pierce, J. R.

2016-12-01

Exposure to poor air quality is associated with negative impacts on human health. A large natural source of PM in the western U.S. is from wildland fires. Accurately attributing health endpoints to wildland-fire smoke requires a determination of the exposed population. This is a difficult endeavor because most current methods for monitoring air quality are not at high temporal and spatial resolutions. Therefore, there is a growing effort to include multiple datasets and create blended products of smoke exposure that can exploit the strengths of each dataset. In this work, we combine model (WRF-Chem) simulations, NASA satellite (MODIS) observations, and in-situ surface monitors to improve exposure estimates. We will also introduce a social-media dataset of self-reported smoke/haze/pollution to improve population-level exposure estimates for the summer of 2015. Finally, we use these detailed exposure estimates in different epidemiologic study designs to provide an in-depth understanding of the role wildfire exposure plays on health outcomes.
GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets.

PubMed

Jeong, Seongmun; Kim, Jae-Yoon; Jeong, Soon-Chun; Kang, Sung-Taeg; Moon, Jung-Kyung; Kim, Namshin

2017-01-01

Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.
ESSG-based global spatial reference frame for datasets interrelation

NASA Astrophysics Data System (ADS)

Yu, J. Q.; Wu, L. X.; Jia, Y. J.

2013-10-01

To know well about the highly complex earth system, a large volume of, as well as a large variety of, datasets on the planet Earth are being obtained, distributed, and shared worldwide everyday. However, seldom of existing systems concentrates on the distribution and interrelation of different datasets in a common Global Spatial Reference Frame (GSRF), which holds an invisble obstacle to the data sharing and scientific collaboration. Group on Earth Obeservation (GEO) has recently established a new GSRF, named Earth System Spatial Grid (ESSG), for global datasets distribution, sharing and interrelation in its 2012-2015 WORKING PLAN.The ESSG may bridge the gap among different spatial datasets and hence overcome the obstacles. This paper is to present the implementation of the ESSG-based GSRF. A reference spheroid, a grid subdvision scheme, and a suitable encoding system are required to implement it. The radius of ESSG reference spheroid was set to the double of approximated Earth radius to make datasets from different areas of earth system science being covered. The same paramerters of positioning and orienting as Earth Centred Earth Fixed (ECEF) was adopted for the ESSG reference spheroid to make any other GSRFs being freely transformed into the ESSG-based GSRF. Spheroid degenerated octree grid with radius refiment (SDOG-R) and its encoding method were taken as the grid subdvision and encoding scheme for its good performance in many aspects. A triple (C, T, A) model is introduced to represent and link different datasets based on the ESSG-based GSRF. Finally, the methods of coordinate transformation between the ESSGbased GSRF and other GSRFs were presented to make ESSG-based GSRF operable and propagable.
Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers.

PubMed

Moon, Myungjin; Nakai, Kenta

2018-04-01

Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.
A Lightweight Remote Parallel Visualization Platform for Interactive Massive Time-varying Climate Data Analysis

NASA Astrophysics Data System (ADS)

Li, J.; Zhang, T.; Huang, Q.; Liu, Q.

2014-12-01

Today's climate datasets are featured with large volume, high degree of spatiotemporal complexity and evolving fast overtime. As visualizing large volume distributed climate datasets is computationally intensive, traditional desktop based visualization applications fail to handle the computational intensity. Recently, scientists have developed remote visualization techniques to address the computational issue. Remote visualization techniques usually leverage server-side parallel computing capabilities to perform visualization tasks and deliver visualization results to clients through network. In this research, we aim to build a remote parallel visualization platform for visualizing and analyzing massive climate data. Our visualization platform was built based on Paraview, which is one of the most popular open source remote visualization and analysis applications. To further enhance the scalability and stability of the platform, we have employed cloud computing techniques to support the deployment of the platform. In this platform, all climate datasets are regular grid data which are stored in NetCDF format. Three types of data access methods are supported in the platform: accessing remote datasets provided by OpenDAP servers, accessing datasets hosted on the web visualization server and accessing local datasets. Despite different data access methods, all visualization tasks are completed at the server side to reduce the workload of clients. As a proof of concept, we have implemented a set of scientific visualization methods to show the feasibility of the platform. Preliminary results indicate that the framework can address the computation limitation of desktop based visualization applications.
Heterogeneous Ensemble Combination Search Using Genetic Algorithm for Class Imbalanced Data Classification.

PubMed

Haque, Mohammad Nazmul; Noman, Nasimul; Berretta, Regina; Moscato, Pablo

2016-01-01

Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble's output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) - k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer's disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases.
Heterogeneous Ensemble Combination Search Using Genetic Algorithm for Class Imbalanced Data Classification

PubMed Central

Haque, Mohammad Nazmul; Noman, Nasimul; Berretta, Regina; Moscato, Pablo

2016-01-01

Classification of datasets with imbalanced sample distributions has always been a challenge. In general, a popular approach for enhancing classification performance is the construction of an ensemble of classifiers. However, the performance of an ensemble is dependent on the choice of constituent base classifiers. Therefore, we propose a genetic algorithm-based search method for finding the optimum combination from a pool of base classifiers to form a heterogeneous ensemble. The algorithm, called GA-EoC, utilises 10 fold-cross validation on training data for evaluating the quality of each candidate ensembles. In order to combine the base classifiers decision into ensemble’s output, we used the simple and widely used majority voting approach. The proposed algorithm, along with the random sub-sampling approach to balance the class distribution, has been used for classifying class-imbalanced datasets. Additionally, if a feature set was not available, we used the (α, β) − k Feature Set method to select a better subset of features for classification. We have tested GA-EoC with three benchmarking datasets from the UCI-Machine Learning repository, one Alzheimer’s disease dataset and a subset of the PubFig database of Columbia University. In general, the performance of the proposed method on the chosen datasets is robust and better than that of the constituent base classifiers and many other well-known ensembles. Based on our empirical study we claim that a genetic algorithm is a superior and reliable approach to heterogeneous ensemble construction and we expect that the proposed GA-EoC would perform consistently in other cases. PMID:26764911

Plant selection for ethnobotanical uses on the Amalfi Coast (Southern Italy).

PubMed

Savo, V; Joy, R; Caneva, G; McClatchey, W C

2015-07-15

Many ethnobotanical studies have investigated selection criteria for medicinal and non-medicinal plants. In this paper we test several statistical methods using different ethnobotanical datasets in order to 1) define to which extent the nature of the datasets can affect the interpretation of results; 2) determine if the selection for different plant uses is based on phylogeny, or other selection criteria. We considered three different ethnobotanical datasets: two datasets of medicinal plants and a dataset of non-medicinal plants (handicraft production, domestic and agro-pastoral practices) and two floras of the Amalfi Coast. We performed residual analysis from linear regression, the binomial test and the Bayesian approach for calculating under-used and over-used plant families within ethnobotanical datasets. Percentages of agreement were calculated to compare the results of the analyses. We also analyzed the relationship between plant selection and phylogeny, chorology, life form and habitat using the chi-square test. Pearson's residuals for each of the significant chi-square analyses were examined for investigating alternative hypotheses of plant selection criteria. The three statistical analysis methods differed within the same dataset, and between different datasets and floras, but with some similarities. In the two medicinal datasets, only Lamiaceae was identified in both floras as an over-used family by all three statistical methods. All statistical methods in one flora agreed that Malvaceae was over-used and Poaceae under-used, but this was not found to be consistent with results of the second flora in which one statistical result was non-significant. All other families had some discrepancy in significance across methods, or floras. Significant over- or under-use was observed in only a minority of cases. The chi-square analyses were significant for phylogeny, life form and habitat. Pearson's residuals indicated a non-random selection of woody species for non-medicinal uses and an under-use of plants of temperate forests for medicinal uses. Our study showed that selection criteria for plant uses (including medicinal) are not always based on phylogeny. The comparison of different statistical methods (regression, binomial and Bayesian) under different conditions led to the conclusion that the most conservative results are obtained using regression analysis.
Genomic BLUP including additive and dominant variation in purebreds and F1 crossbreds, with an application in pigs.

PubMed

Vitezica, Zulma G; Varona, Luis; Elsen, Jean-Michel; Misztal, Ignacy; Herring, William; Legarra, Andrès

2016-01-29

Most developments in quantitative genetics theory focus on the study of intra-breed/line concepts. With the availability of massive genomic information, it becomes necessary to revisit the theory for crossbred populations. We propose methods to construct genomic covariances with additive and non-additive (dominance) inheritance in the case of pure lines and crossbred populations. We describe substitution effects and dominant deviations across two pure parental populations and the crossbred population. Gene effects are assumed to be independent of the origin of alleles and allelic frequencies can differ between parental populations. Based on these assumptions, the theoretical variance components (additive and dominant) are obtained as a function of marker effects and allelic frequencies. The additive genetic variance in the crossbred population includes the biological additive and dominant effects of a gene and a covariance term. Dominance variance in the crossbred population is proportional to the product of the heterozygosity coefficients of both parental populations. A genomic BLUP (best linear unbiased prediction) equivalent model is presented. We illustrate this approach by using pig data (two pure lines and their cross, including 8265 phenotyped and genotyped sows). For the total number of piglets born, the dominance variance in the crossbred population represented about 13 % of the total genetic variance. Dominance variation is only marginally important for litter size in the crossbred population. We present a coherent marker-based model that includes purebred and crossbred data and additive and dominant actions. Using this model, it is possible to estimate breeding values, dominant deviations and variance components in a dataset that comprises data on purebred and crossbred individuals. These methods can be exploited to plan assortative mating in pig, maize or other species, in order to generate superior crossbred individuals in terms of performance.
CT-based manual segmentation and evaluation of paranasal sinuses.

PubMed

Pirner, S; Tingelhoff, K; Wagner, I; Westphal, R; Rilk, M; Wahl, F M; Bootz, F; Eichhorn, Klaus W G

2009-04-01

Manual segmentation of computed tomography (CT) datasets was performed for robot-assisted endoscope movement during functional endoscopic sinus surgery (FESS). Segmented 3D models are needed for the robots' workspace definition. A total of 50 preselected CT datasets were each segmented in 150-200 coronal slices with 24 landmarks being set. Three different colors for segmentation represent diverse risk areas. Extension and volumetric measurements were performed. Three-dimensional reconstruction was generated after segmentation. Manual segmentation took 8-10 h for each CT dataset. The mean volumes were: right maxillary sinus 17.4 cm(3), left side 17.9 cm(3), right frontal sinus 4.2 cm(3), left side 4.0 cm(3), total frontal sinuses 7.9 cm(3), sphenoid sinus right side 5.3 cm(3), left side 5.5 cm(3), total sphenoid sinus volume 11.2 cm(3). Our manually segmented 3D-models present the patient's individual anatomy with a special focus on structures in danger according to the diverse colored risk areas. For safe robot assistance, the high-accuracy models represent an average of the population for anatomical variations, extension and volumetric measurements. They can be used as a database for automatic model-based segmentation. None of the segmentation methods so far described provide risk segmentation. The robot's maximum distance to the segmented border can be adjusted according to the differently colored areas.
Development and validation of a national data registry for midwife-led births: the Midwives Alliance of North America Statistics Project 2.0 dataset.

PubMed

Cheyney, Melissa; Bovbjerg, Marit; Everson, Courtney; Gordon, Wendy; Hannibal, Darcy; Vedam, Saraswathi

2014-01-01

In 2004, the Midwives Alliance of North America's (MANA's) Division of Research developed a Web-based data collection system to gather information on the practices and outcomes associated with midwife-led births in the United States. This system, called the MANA Statistics Project (MANA Stats), grew out of a widely acknowledged need for more reliable data on outcomes by intended place of birth. This article describes the history and development of the MANA Stats birth registry and provides an analysis of the 2.0 dataset's content, strengths, and limitations. Data collection and review procedures for the MANA Stats 2.0 dataset are described, along with methods for the assessment of data accuracy. We calculated descriptive statistics for client demographics and contributing midwife credentials, and assessed the quality of data by calculating point estimates, 95% confidence intervals, and kappa statistics for key outcomes on pre- and postreview samples of records. The MANA Stats 2.0 dataset (2004-2009) contains 24,848 courses of care, 20,893 of which are for women who planned a home or birth center birth at the onset of labor. The majority of these records were planned home births (81%). Births were attended primarily by certified professional midwives (73%), and clients were largely white (92%), married (87%), and college-educated (49%). Data quality analyses of 9932 records revealed no differences between pre- and postreviewed samples for 7 key benchmarking variables (kappa, 0.98-1.00). The MANA Stats 2.0 data were accurately entered by participants; any errors in this dataset are likely random and not systematic. The primary limitation of the 2.0 dataset is that the sample was captured through voluntary participation; thus, it may not accurately reflect population-based outcomes. The dataset's primary strength is that it will allow for the examination of research questions on normal physiologic birth and midwife-led birth outcomes by intended place of birth. © 2014 by the American College of Nurse-Midwives.
An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier.

PubMed

Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi

2017-03-15

Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Computer-aided classification of mammographic masses using the deep learning technology: a preliminary study

NASA Astrophysics Data System (ADS)

Qiu, Yuchen; Yan, Shiju; Tan, Maxine; Cheng, Samuel; Liu, Hong; Zheng, Bin

2016-03-01

Although mammography is the only clinically acceptable imaging modality used in the population-based breast cancer screening, its efficacy is quite controversy. One of the major challenges is how to help radiologists more accurately classify between benign and malignant lesions. The purpose of this study is to investigate a new mammographic mass classification scheme based on a deep learning method. In this study, we used an image dataset involving 560 regions of interest (ROIs) extracted from digital mammograms, which includes 280 malignant and 280 benign mass ROIs, respectively. An eight layer deep learning network was applied, which employs three pairs of convolution-max-pooling layers for automatic feature extraction and a multiple layer perception (MLP) classifier for feature categorization. In order to improve robustness of selected features, each convolution layer is connected with a max-pooling layer. A number of 20, 10, and 5 feature maps were utilized for the 1st, 2nd and 3rd convolution layer, respectively. The convolution networks are followed by a MLP classifier, which generates a classification score to predict likelihood of a ROI depicting a malignant mass. Among 560 ROIs, 420 ROIs were used as a training dataset and the remaining 140 ROIs were used as a validation dataset. The result shows that the new deep learning based classifier yielded an area under the receiver operation characteristic curve (AUC) of 0.810+/-0.036. This study demonstrated the potential superiority of using a deep learning based classifier to distinguish malignant and benign breast masses without segmenting the lesions and extracting the pre-defined image features.
A semiparametric graphical modelling approach for large-scale equity selection.

PubMed

Liu, Han; Mulvey, John; Zhao, Tianqi

2016-01-01

We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption.
Selection-Fusion Approach for Classification of Datasets with Missing Values

PubMed Central

Ghannad-Rezaie, Mostafa; Soltanian-Zadeh, Hamid; Ying, Hao; Dong, Ming

2010-01-01

This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values. PMID:20212921
Detection of stable community structures within gut microbiota co-occurrence networks from different human populations.

PubMed

Jackson, Matthew A; Bonder, Marc Jan; Kuncheva, Zhana; Zierer, Jonas; Fu, Jingyuan; Kurilshikov, Alexander; Wijmenga, Cisca; Zhernakova, Alexandra; Bell, Jordana T; Spector, Tim D; Steves, Claire J

2018-01-01

Microbes in the gut microbiome form sub-communities based on shared niche specialisations and specific interactions between individual taxa. The inter-microbial relationships that define these communities can be inferred from the co-occurrence of taxa across multiple samples. Here, we present an approach to identify comparable communities within different gut microbiota co-occurrence networks, and demonstrate its use by comparing the gut microbiota community structures of three geographically diverse populations. We combine gut microbiota profiles from 2,764 British, 1,023 Dutch, and 639 Israeli individuals, derive co-occurrence networks between their operational taxonomic units, and detect comparable communities within them. Comparing populations we find that community structure is significantly more similar between datasets than expected by chance. Mapping communities across the datasets, we also show that communities can have similar associations to host phenotypes in different populations. This study shows that the community structure within the gut microbiota is stable across populations, and describes a novel approach that facilitates comparative community-centric microbiome analyses.
An orientation soil survey at the Pebble Cu-Au-Mo porphyry deposit, Alaska

USGS Publications Warehouse

Smith, Steven M.; Eppinger, Robert G.; Fey, David L.; Kelley, Karen D.; Giles, S.A.

2009-01-01

Soil samples were collected in 2007 and 2008 along three traverses across the giant Pebble Cu-Au-Mo porphyry deposit. Within each soil pit, four subsamples were collected following recommended protocols for each of ten commonly-used and proprietary leach/digestion techniques. The significance of geochemical patterns generated by these techniques was classified by visual inspection of plots showing individual element concentration by each analytical method along the 2007 traverse. A simple matrix by element versus method, populated with a value based on the significance classification, provides a method for ranking the utility of methods and elements at this deposit. The interpretation of a complex multi-element dataset derived from multiple analytical techniques is challenging. An example of vanadium results from a single leach technique is used to illustrate the several possible interpretations of the data.
Cadastral Database Positional Accuracy Improvement

NASA Astrophysics Data System (ADS)

Hashim, N. M.; Omar, A. H.; Ramli, S. N. M.; Omar, K. M.; Din, N.

2017-10-01

Positional Accuracy Improvement (PAI) is the refining process of the geometry feature in a geospatial dataset to improve its actual position. This actual position relates to the absolute position in specific coordinate system and the relation to the neighborhood features. With the growth of spatial based technology especially Geographical Information System (GIS) and Global Navigation Satellite System (GNSS), the PAI campaign is inevitable especially to the legacy cadastral database. Integration of legacy dataset and higher accuracy dataset like GNSS observation is a potential solution for improving the legacy dataset. However, by merely integrating both datasets will lead to a distortion of the relative geometry. The improved dataset should be further treated to minimize inherent errors and fitting to the new accurate dataset. The main focus of this study is to describe a method of angular based Least Square Adjustment (LSA) for PAI process of legacy dataset. The existing high accuracy dataset known as National Digital Cadastral Database (NDCDB) is then used as bench mark to validate the results. It was found that the propose technique is highly possible for positional accuracy improvement of legacy spatial datasets.
Simulation of Smart Home Activity Datasets

PubMed Central

Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

2015-01-01

A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation. PMID:26087371
Simulation of Smart Home Activity Datasets.

PubMed

Synnott, Jonathan; Nugent, Chris; Jeffers, Paul

2015-06-16

A globally ageing population is resulting in an increased prevalence of chronic conditions which affect older adults. Such conditions require long-term care and management to maximize quality of life, placing an increasing strain on healthcare resources. Intelligent environments such as smart homes facilitate long-term monitoring of activities in the home through the use of sensor technology. Access to sensor datasets is necessary for the development of novel activity monitoring and recognition approaches. Access to such datasets is limited due to issues such as sensor cost, availability and deployment time. The use of simulated environments and sensors may address these issues and facilitate the generation of comprehensive datasets. This paper provides a review of existing approaches for the generation of simulated smart home activity datasets, including model-based approaches and interactive approaches which implement virtual sensors, environments and avatars. The paper also provides recommendation for future work in intelligent environment simulation.
Using neutral, selected, and hitchhiker loci to assess connectivity of marine populations in the genomic era

PubMed Central

Gagnaire, Pierre-Alexandre; Broquet, Thomas; Aurelle, Didier; Viard, Frédérique; Souissi, Ahmed; Bonhomme, François; Arnaud-Haond, Sophie; Bierne, Nicolas

2015-01-01

Estimating the rate of exchange of individuals among populations is a central concern to evolutionary ecology and its applications to conservation and management. For instance, the efficiency of protected areas in sustaining locally endangered populations and ecosystems depends on reserve network connectivity. The population genetics theory offers a powerful framework for estimating dispersal distances and migration rates from molecular data. In the marine realm, however, decades of molecular studies have met limited success in inferring genetic connectivity, due to the frequent lack of spatial genetic structure in species exhibiting high fecundity and dispersal capabilities. This is especially true within biogeographic regions bounded by well-known hotspots of genetic differentiation. Here, we provide an overview of the current methods for estimating genetic connectivity using molecular markers and propose several directions for improving existing approaches using large population genomic datasets. We highlight several issues that limit the effectiveness of methods based on neutral markers when there is virtually no genetic differentiation among samples. We then focus on alternative methods based on markers influenced by selection. Although some of these methodologies are still underexplored, our aim was to stimulate new research to test how broadly they are applicable to nonmodel marine species. We argue that the increased ability to apply the concepts of cline analyses will improve dispersal inferences across physical and ecological barriers that reduce connectivity locally. We finally present how neutral markers hitchhiking with selected loci can also provide information about connectivity patterns within apparently well-mixed biogeographic regions. We contend that one of the most promising applications of population genomics is the use of outlier loci to delineate relevant conservation units and related eco-geographic features across which connectivity can be measured. PMID:26366195
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.

PubMed

Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi

2015-01-01

Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation

PubMed Central

Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi

2015-01-01

Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
Identifying key genes in glaucoma based on a benchmarked dataset and the gene regulatory network.

PubMed

Chen, Xi; Wang, Qiao-Ling; Zhang, Meng-Hui

2017-10-01

The current study aimed to identify key genes in glaucoma based on a benchmarked dataset and gene regulatory network (GRN). Local and global noise was added to the gene expression dataset to produce a benchmarked dataset. Differentially-expressed genes (DEGs) between patients with glaucoma and normal controls were identified utilizing the Linear Models for Microarray Data (Limma) package based on benchmarked dataset. A total of 5 GRN inference methods, including Zscore, GeneNet, context likelihood of relatedness (CLR) algorithm, Partial Correlation coefficient with Information Theory (PCIT) and GEne Network Inference with Ensemble of Trees (Genie3) were evaluated using receiver operating characteristic (ROC) and precision and recall (PR) curves. The interference method with the best performance was selected to construct the GRN. Subsequently, topological centrality (degree, closeness and betweenness) was conducted to identify key genes in the GRN of glaucoma. Finally, the key genes were validated by performing reverse transcription-quantitative polymerase chain reaction (RT-qPCR). A total of 176 DEGs were detected from the benchmarked dataset. The ROC and PR curves of the 5 methods were analyzed and it was determined that Genie3 had a clear advantage over the other methods; thus, Genie3 was used to construct the GRN. Following topological centrality analysis, 14 key genes for glaucoma were identified, including IL6 , EPHA2 and GSTT1 and 5 of these 14 key genes were validated by RT-qPCR. Therefore, the current study identified 14 key genes in glaucoma, which may be potential biomarkers to use in the diagnosis of glaucoma and aid in identifying the molecular mechanism of this disease.
DCS-SVM: a novel semi-automated method for human brain MR image segmentation.

PubMed

Ahmadvand, Ali; Daliri, Mohammad Reza; Hajiali, Mohammadtaghi

2017-11-27

In this paper, a novel method is proposed which appropriately segments magnetic resonance (MR) brain images into three main tissues. This paper proposes an extension of our previous work in which we suggested a combination of multiple classifiers (CMC)-based methods named dynamic classifier selection-dynamic local training local Tanimoto index (DCS-DLTLTI) for MR brain image segmentation into three main cerebral tissues. This idea is used here and a novel method is developed that tries to use more complex and accurate classifiers like support vector machine (SVM) in the ensemble. This work is challenging because the CMC-based methods are time consuming, especially on huge datasets like three-dimensional (3D) brain MR images. Moreover, SVM is a powerful method that is used for modeling datasets with complex feature space, but it also has huge computational cost for big datasets, especially those with strong interclass variability problems and with more than two classes such as 3D brain images; therefore, we cannot use SVM in DCS-DLTLTI. Therefore, we propose a novel approach named "DCS-SVM" to use SVM in DCS-DLTLTI to improve the accuracy of segmentation results. The proposed method is applied on well-known datasets of the Internet Brain Segmentation Repository (IBSR) and promising results are obtained.
A New Combinatorial Optimization Approach for Integrated Feature Selection Using Different Datasets: A Prostate Cancer Transcriptomic Study

PubMed Central

Puthiyedth, Nisha; Riveros, Carlos; Berretta, Regina; Moscato, Pablo

2015-01-01

Background The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics. Methods We propose a new combinatorial optimization problem that addresses the core issue of biomarker detection in integrated datasets. Optimal solutions for this model deliver a feature selection from a panel of prospective biomarkers. The model we propose is a generalised version of the (α,β)-k-Feature Set problem. We illustrate the performance of this new methodology via a challenging meta-analysis task involving six prostate cancer microarray datasets. The results are then compared to the popular RankProd meta-analysis tool and to what can be obtained by analysing the individual datasets by statistical and combinatorial methods alone. Results Application of the integrated method resulted in a more informative signature than the rank-based meta-analysis or individual dataset results, and overcomes problems arising from real world datasets. The set of genes identified is highly significant in the context of prostate cancer. The method used does not rely on homogenisation or transformation of values to a common scale, and at the same time is able to capture markers associated with subgroups of the disease. PMID:26106884
Robust semi-automatic segmentation of pulmonary subsolid nodules in chest computed tomography scans

NASA Astrophysics Data System (ADS)

Lassen, B. C.; Jacobs, C.; Kuhnigk, J.-M.; van Ginneken, B.; van Rikxoort, E. M.

2015-02-01

The malignancy of lung nodules is most often detected by analyzing changes of the nodule diameter in follow-up scans. A recent study showed that comparing the volume or the mass of a nodule over time is much more significant than comparing the diameter. Since the survival rate is higher when the disease is still in an early stage it is important to detect the growth rate as soon as possible. However manual segmentation of a volume is time-consuming. Whereas there are several well evaluated methods for the segmentation of solid nodules, less work is done on subsolid nodules which actually show a higher malignancy rate than solid nodules. In this work we present a fast, semi-automatic method for segmentation of subsolid nodules. As minimal user interaction the method expects a user-drawn stroke on the largest diameter of the nodule. First, a threshold-based region growing is performed based on intensity analysis of the nodule region and surrounding parenchyma. In the next step the chest wall is removed by a combination of a connected component analyses and convex hull calculation. Finally, attached vessels are detached by morphological operations. The method was evaluated on all nodules of the publicly available LIDC/IDRI database that were manually segmented and rated as non-solid or part-solid by four radiologists (Dataset 1) and three radiologists (Dataset 2). For these 59 nodules the Jaccard index for the agreement of the proposed method with the manual reference segmentations was 0.52/0.50 (Dataset 1/Dataset 2) compared to an inter-observer agreement of the manual segmentations of 0.54/0.58 (Dataset 1/Dataset 2). Furthermore, the inter-observer agreement using the proposed method (i.e. different input strokes) was analyzed and gave a Jaccard index of 0.74/0.74 (Dataset 1/Dataset 2). The presented method provides satisfactory segmentation results with minimal observer effort in minimal time and can reduce the inter-observer variability for segmentation of subsolid nodules in clinical routine.

Stratification of American hearing aid users by age and audiometric characteristics: a method for representative sampling.

PubMed

Aronoff, Justin M; Yoon, Yang-soo; Soli, Sigfrid D

2010-06-01

Stratified sampling plans can increase the accuracy and facilitate the interpretation of a dataset characterizing a large population. However, such sampling plans have found minimal use in hearing aid (HA) research, in part because of a paucity of quantitative data on the characteristics of HA users. The goal of this study was to devise a quantitatively derived stratified sampling plan for HA research, so that such studies will be more representative and generalizable, and the results obtained using this method are more easily reinterpreted as the population changes. Pure-tone average (PTA) and age information were collected for 84,200 HAs acquired in 2006 and 2007. The distribution of PTA and age was quantified for each HA type and for a composite of all HA users. Based on their respective distributions, PTA and age were each divided into three groups, the combination of which defined the stratification plan. The most populous PTA and age group was also subdivided, allowing greater homogeneity within strata. Finally, the percentage of users in each stratum was calculated. This article provides a stratified sampling plan for HA research, based on a quantitative analysis of the distribution of PTA and age for HA users. Adopting such a sampling plan will make HA research results more representative and generalizable. In addition, data acquired using such plans can be reinterpreted as the HA population changes.
Correlation set analysis: detecting active regulators in disease populations using prior causal knowledge

PubMed Central

2012-01-01

Background Identification of active causal regulators is a crucial problem in understanding mechanism of diseases or finding drug targets. Methods that infer causal regulators directly from primary data have been proposed and successfully validated in some cases. These methods necessarily require very large sample sizes or a mix of different data types. Recent studies have shown that prior biological knowledge can successfully boost a method's ability to find regulators. Results We present a simple data-driven method, Correlation Set Analysis (CSA), for comprehensively detecting active regulators in disease populations by integrating co-expression analysis and a specific type of literature-derived causal relationships. Instead of investigating the co-expression level between regulators and their regulatees, we focus on coherence of regulatees of a regulator. Using simulated datasets we show that our method performs very well at recovering even weak regulatory relationships with a low false discovery rate. Using three separate real biological datasets we were able to recover well known and as yet undescribed, active regulators for each disease population. The results are represented as a rank-ordered list of regulators, and reveals both single and higher-order regulatory relationships. Conclusions CSA is an intuitive data-driven way of selecting directed perturbation experiments that are relevant to a disease population of interest and represent a starting point for further investigation. Our findings demonstrate that combining co-expression analysis on regulatee sets with a literature-derived network can successfully identify causal regulators and help develop possible hypothesis to explain disease progression. PMID:22443377
Roads Data Conflation Using Update High Resolution Satellite Images

NASA Astrophysics Data System (ADS)

Abdollahi, A.; Riyahi Bakhtiari, H. R.

2017-11-01

Urbanization, industrialization and modernization are rapidly growing in developing countries. New industrial cities, with all the problems brought on by rapid population growth, need infrastructure to support the growth. This has led to the expansion and development of the road network. A great deal of road network data has made by using traditional methods in the past years. Over time, a large amount of descriptive information has assigned to these map data, but their geometric accuracy and precision is not appropriate to today's need. In this regard, the improvement of the geometric accuracy of road network data by preserving the descriptive data attributed to them and updating of the existing geo databases is necessary. Due to the size and extent of the country, updating the road network maps using traditional methods is time consuming and costly. Conversely, using remote sensing technology and geographic information systems can reduce costs, save time and increase accuracy and speed. With increasing the availability of high resolution satellite imagery and geospatial datasets there is an urgent need to combine geographic information from overlapping sources to retain accurate data, minimize redundancy, and reconcile data conflicts. In this research, an innovative method for a vector-to-imagery conflation by integrating several image-based and vector-based algorithms presented. The SVM method for image classification and Level Set method used to extract the road the different types of road intersections extracted from imagery using morphological operators. For matching the extracted points and to find the corresponding points, matching function which uses the nearest neighborhood method was applied. Finally, after identifying the matching points rubber-sheeting method used to align two datasets. Two residual and RMSE criteria used to evaluate accuracy. The results demonstrated excellent performance. The average root-mean-square error decreased from 11.8 to 4.1 m.
Fast and accurate inference of local ancestry in Latino populations

PubMed Central

Baran, Yael; Pasaniuc, Bogdan; Sankararaman, Sriram; Torgerson, Dara G.; Gignoux, Christopher; Eng, Celeste; Rodriguez-Cintron, William; Chapela, Rocio; Ford, Jean G.; Avila, Pedro C.; Rodriguez-Santana, Jose; Burchard, Esteban Gonzàlez; Halperin, Eran

2012-01-01

Motivation: It is becoming increasingly evident that the analysis of genotype data from recently admixed populations is providing important insights into medical genetics and population history. Such analyses have been used to identify novel disease loci, to understand recombination rate variation and to detect recent selection events. The utility of such studies crucially depends on accurate and unbiased estimation of the ancestry at every genomic locus in recently admixed populations. Although various methods have been proposed and shown to be extremely accurate in two-way admixtures (e.g. African Americans), only a few approaches have been proposed and thoroughly benchmarked on multi-way admixtures (e.g. Latino populations of the Americas). Results: To address these challenges we introduce here methods for local ancestry inference which leverage the structure of linkage disequilibrium in the ancestral population (LAMP-LD), and incorporate the constraint of Mendelian segregation when inferring local ancestry in nuclear family trios (LAMP-HAP). Our algorithms uniquely combine hidden Markov models (HMMs) of haplotype diversity within a novel window-based framework to achieve superior accuracy as compared with published methods. Further, unlike previous methods, the structure of our HMM does not depend on the number of reference haplotypes but on a fixed constant, and it is thereby capable of utilizing large datasets while remaining highly efficient and robust to over-fitting. Through simulations and analysis of real data from 489 nuclear trio families from the mainland US, Puerto Rico and Mexico, we demonstrate that our methods achieve superior accuracy compared with published methods for local ancestry inference in Latinos. Availability: http://lamp.icsi.berkeley.edu/lamp/lampld/ Contact: bpasaniu@hsph.harvard.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22495753
Binomial outcomes in dataset with some clusters of size two: can the dependence of twins be accounted for? A simulation study comparing the reliability of statistical methods based on a dataset of preterm infants.

PubMed

Sauzet, Odile; Peacock, Janet L

2017-07-20

The analysis of perinatal outcomes often involves datasets with some multiple births. These are datasets mostly formed of independent observations and a limited number of clusters of size two (twins) and maybe of size three or more. This non-independence needs to be accounted for in the statistical analysis. Using simulated data based on a dataset of preterm infants we have previously investigated the performance of several approaches to the analysis of continuous outcomes in the presence of some clusters of size two. Mixed models have been developed for binomial outcomes but very little is known about their reliability when only a limited number of small clusters are present. Using simulated data based on a dataset of preterm infants we investigated the performance of several approaches to the analysis of binomial outcomes in the presence of some clusters of size two. Logistic models, several methods of estimation for the logistic random intercept models and generalised estimating equations were compared. The presence of even a small percentage of twins means that a logistic regression model will underestimate all parameters but a logistic random intercept model fails to estimate the correlation between siblings if the percentage of twins is too small and will provide similar estimates to logistic regression. The method which seems to provide the best balance between estimation of the standard error and the parameter for any percentage of twins is the generalised estimating equations. This study has shown that the number of covariates or the level two variance do not necessarily affect the performance of the various methods used to analyse datasets containing twins but when the percentage of small clusters is too small, mixed models cannot capture the dependence between siblings.
Analyzing contentious relationships and outlier genes in phylogenomics.

PubMed

Walker, Joseph F; Brown, Joseph W; Smith, Stephen A

2018-06-08

Recent studies have demonstrated that conflict is common among gene trees in phylogenomic studies, and that less than one percent of genes may ultimately drive species tree inference in supermatrix analyses. Here, we examined two datasets where supermatrix and coalescent-based species trees conflict. We identified two highly influential "outlier" genes in each dataset. When removed from each dataset, the inferred supermatrix trees matched the topologies obtained from coalescent analyses. We also demonstrate that, while the outlier genes in the vertebrate dataset have been shown in a previous study to be the result of errors in orthology detection, the outlier genes from a plant dataset did not exhibit any obvious systematic error and therefore may be the result of some biological process yet to be determined. While topological comparisons among a small set of alternate topologies can be helpful in discovering outlier genes, they can be limited in several ways, such as assuming all genes share the same topology. Coalescent species tree methods relax this assumption but do not explicitly facilitate the examination of specific edges. Coalescent methods often also assume that conflict is the result of incomplete lineage sorting (ILS). Here we explored a framework that allows for quickly examining alternative edges and support for large phylogenomic datasets that does not assume a single topology for all genes. For both datasets, these analyses provided detailed results confirming the support for coalescent-based topologies. This framework suggests that we can improve our understanding of the underlying signal in phylogenomic datasets by asking more targeted edge-based questions.
A comparison of two global datasets of extreme sea levels and resulting flood exposure

NASA Astrophysics Data System (ADS)

Muis, Sanne; Verlaan, Martin; Nicholls, Robert J.; Brown, Sally; Hinkel, Jochen; Lincke, Daniel; Vafeidis, Athanasios T.; Scussolini, Paolo; Winsemius, Hessel C.; Ward, Philip J.

2017-04-01

Estimating the current risk of coastal flooding requires adequate information on extreme sea levels. For over a decade, the only global data available was the DINAS-COAST Extreme Sea Levels (DCESL) dataset, which applies a static approximation to estimate extreme sea levels. Recently, a dynamically derived dataset was developed: the Global Tide and Surge Reanalysis (GTSR) dataset. Here, we compare the two datasets. The differences between DCESL and GTSR are generally larger than the confidence intervals of GTSR. Compared to observed extremes, DCESL generally overestimates extremes with a mean bias of 0.6 m. With a mean bias of -0.2 m GTSR generally underestimates extremes, particularly in the tropics. The Dynamic Interactive Vulnerability Assessment model is applied to calculate the present-day flood exposure in terms of the land area and the population below the 1 in 100-year sea levels. Global exposed population is 28% lower when based on GTSR instead of DCESL. Considering the limited data available at the time, DCESL provides a good estimate of the spatial variation in extremes around the world. However, GTSR allows for an improved assessment of the impacts of coastal floods, including confidence bounds. We further improve the assessment of coastal impacts by correcting for the conflicting vertical datum of sea-level extremes and land elevation, which has not been accounted for in previous global assessments. Converting the extreme sea levels to the same vertical reference used for the elevation data is shown to be a critical step resulting in 39-59% higher estimate of population exposure.
CIFAR10-DVS: An Event-Stream Dataset for Object Classification

PubMed Central

Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

2017-01-01

Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as “CIFAR10-DVS.” The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification. PMID:28611582
CIFAR10-DVS: An Event-Stream Dataset for Object Classification.

PubMed

Li, Hongmin; Liu, Hanchao; Ji, Xiangyang; Li, Guoqi; Shi, Luping

2017-01-01

Neuromorphic vision research requires high-quality and appropriately challenging event-stream datasets to support continuous improvement of algorithms and methods. However, creating event-stream datasets is a time-consuming task, which needs to be recorded using the neuromorphic cameras. Currently, there are limited event-stream datasets available. In this work, by utilizing the popular computer vision dataset CIFAR-10, we converted 10,000 frame-based images into 10,000 event streams using a dynamic vision sensor (DVS), providing an event-stream dataset of intermediate difficulty in 10 different classes, named as "CIFAR10-DVS." The conversion of event-stream dataset was implemented by a repeated closed-loop smooth (RCLS) movement of frame-based images. Unlike the conversion of frame-based images by moving the camera, the image movement is more realistic in respect of its practical applications. The repeated closed-loop image movement generates rich local intensity changes in continuous time which are quantized by each pixel of the DVS camera to generate events. Furthermore, a performance benchmark in event-driven object classification is provided based on state-of-the-art classification algorithms. This work provides a large event-stream dataset and an initial benchmark for comparison, which may boost algorithm developments in even-driven pattern recognition and object classification.
ETHNOPRED: a novel machine learning method for accurate continental and sub-continental ancestry identification and population stratification correction

PubMed Central

2013-01-01

Background Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification. Results We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver. Conclusions ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values. PMID:23432980
ddClone: joint statistical inference of clonal populations from single cell and bulk tumour sequencing data.

PubMed

Salehi, Sohrab; Steif, Adi; Roth, Andrew; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P

2017-03-01

Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone.
Automatic building extraction from LiDAR data fusion of point and grid-based features

NASA Astrophysics Data System (ADS)

Du, Shouji; Zhang, Yunsheng; Zou, Zhengrong; Xu, Shenghua; He, Xue; Chen, Siyang

2017-08-01

This paper proposes a method for extracting buildings from LiDAR point cloud data by combining point-based and grid-based features. To accurately discriminate buildings from vegetation, a point feature based on the variance of normal vectors is proposed. For a robust building extraction, a graph cuts algorithm is employed to combine the used features and consider the neighbor contexture information. As grid feature computing and a graph cuts algorithm are performed on a grid structure, a feature-retained DSM interpolation method is proposed in this paper. The proposed method is validated by the benchmark ISPRS Test Project on Urban Classification and 3D Building Reconstruction and compared to the state-art-of-the methods. The evaluation shows that the proposed method can obtain a promising result both at area-level and at object-level. The method is further applied to the entire ISPRS dataset and to a real dataset of the Wuhan City. The results show a completeness of 94.9% and a correctness of 92.2% at the per-area level for the former dataset and a completeness of 94.4% and a correctness of 95.8% for the latter one. The proposed method has a good potential for large-size LiDAR data.
A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay

PubMed Central

Bray, Mark-Anthony; Gustafsdottir, Sigrun M; Rohban, Mohammad H; Singh, Shantanu; Ljosa, Vebjorn; Sokolnicki, Katherine L; Bittker, Joshua A; Bodycombe, Nicole E; Dančík, Vlado; Hasaka, Thomas P; Hon, Cindy S; Kemp, Melissa M; Li, Kejie; Walpita, Deepika; Wawer, Mathias J; Golub, Todd R; Schreiber, Stuart L; Clemons, Paul A; Shamji, Alykhan F

2017-01-01

Abstract Background Large-scale image sets acquired by automated microscopy of perturbed samples enable a detailed comparison of cell states induced by each perturbation, such as a small molecule from a diverse library. Highly multiplexed measurements of cellular morphology can be extracted from each image and subsequently mined for a number of applications. Findings This microscopy dataset includes 919 265 five-channel fields of view, representing 30 616 tested compounds, available at “The Cell Image Library” (CIL) repository. It also includes data files containing morphological features derived from each cell in each image, both at the single-cell level and population-averaged (i.e., per-well) level; the image analysis workflows that generated the morphological features are also provided. Quality-control metrics are provided as metadata, indicating fields of view that are out-of-focus or containing highly fluorescent material or debris. Lastly, chemical annotations are supplied for the compound treatments applied. Conclusions Because computational algorithms and methods for handling single-cell morphological measurements are not yet routine, the dataset serves as a useful resource for the wider scientific community applying morphological (image-based) profiling. The dataset can be mined for many purposes, including small-molecule library enrichment and chemical mechanism-of-action studies, such as target identification. Integration with genetically perturbed datasets could enable identification of small-molecule mimetics of particular disease- or gene-related phenotypes that could be useful as probes or potential starting points for development of future therapeutics. PMID:28327978
A Monocular Vision Sensor-Based Obstacle Detection Algorithm for Autonomous Robots.

PubMed

Lee, Tae-Jae; Yi, Dong-Hoon; Cho, Dong-Il Dan

2016-03-01

This paper presents a monocular vision sensor-based obstacle detection algorithm for autonomous robots. Each individual image pixel at the bottom region of interest is labeled as belonging either to an obstacle or the floor. While conventional methods depend on point tracking for geometric cues for obstacle detection, the proposed algorithm uses the inverse perspective mapping (IPM) method. This method is much more advantageous when the camera is not high off the floor, which makes point tracking near the floor difficult. Markov random field-based obstacle segmentation is then performed using the IPM results and a floor appearance model. Next, the shortest distance between the robot and the obstacle is calculated. The algorithm is tested by applying it to 70 datasets, 20 of which include nonobstacle images where considerable changes in floor appearance occur. The obstacle segmentation accuracies and the distance estimation error are quantitatively analyzed. For obstacle datasets, the segmentation precision and the average distance estimation error of the proposed method are 81.4% and 1.6 cm, respectively, whereas those for a conventional method are 57.5% and 9.9 cm, respectively. For nonobstacle datasets, the proposed method gives 0.0% false positive rates, while the conventional method gives 17.6%.
Characterization and prediction of residues determining protein functional specificity.

PubMed

Capra, John A; Singh, Mona

2008-07-01

Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs. We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs. Datasets and GroupSim code are available online at http://compbio.cs.princeton.edu/specificity/. Supplementary data are available at Bioinformatics online.
Prediction of Body Fluids where Proteins are Secreted into Based on Protein Interaction Network

PubMed Central

Hu, Le-Le; Huang, Tao; Cai, Yu-Dong; Chou, Kuo-Chen

2011-01-01

Determining the body fluids where secreted proteins can be secreted into is important for protein function annotation and disease biomarker discovery. In this study, we developed a network-based method to predict which kind of body fluids human proteins can be secreted into. For a newly constructed benchmark dataset that consists of 529 human-secreted proteins, the prediction accuracy for the most possible body fluid location predicted by our method via the jackknife test was 79.02%, significantly higher than the success rate by a random guess (29.36%). The likelihood that the predicted body fluids of the first four orders contain all the true body fluids where the proteins can be secreted into is 62.94%. Our method was further demonstrated with two independent datasets: one contains 57 proteins that can be secreted into blood; while the other contains 61 proteins that can be secreted into plasma/serum and were possible biomarkers associated with various cancers. For the 57 proteins in first dataset, 55 were correctly predicted as blood-secrete proteins. For the 61 proteins in the second dataset, 58 were predicted to be most possible in plasma/serum. These encouraging results indicate that the network-based prediction method is quite promising. It is anticipated that the method will benefit the relevant areas for both basic research and drug development. PMID:21829572
CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets

PubMed Central

Nowicka, Malgorzata; Krieg, Carsten; Weber, Lukas M.; Hartmann, Felix J.; Guglietta, Silvia; Becher, Burkhard; Levesque, Mitchell P.; Robinson, Mark D.

2017-01-01

High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals). PMID:28663787
Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation

PubMed Central

Cornuet, Jean-Marie; Santos, Filipe; Beaumont, Mark A.; Robert, Christian P.; Marin, Jean-Michel; Balding, David J.; Guillemaud, Thomas; Estoup, Arnaud

2008-01-01

Summary: Genetic data obtained on population samples convey information about their evolutionary history. Inference methods can extract part of this information but they require sophisticated statistical techniques that have been made available to the biologist community (through computer programs) only for simple and standard situations typically involving a small number of samples. We propose here a computer program (DIY ABC) for inference based on approximate Bayesian computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIY ABC can be used to compare competing scenarios, estimate parameters for one or more scenarios and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data). This article describes key methods used in the program and provides its main features. The analysis of one simulated and one real dataset, both with complex evolutionary scenarios, illustrates the main possibilities of DIY ABC. Availability: The software DIY ABC is freely available at http://www.montpellier.inra.fr/CBGP/diyabc. Contact: j.cornuet@imperial.ac.uk Supplementary information: Supplementary data are also available at http://www.montpellier.inra.fr/CBGP/diyabc PMID:18842597
[Spatial domain display for interference image dataset].

PubMed

Wang, Cai-Ling; Li, Yu-Shan; Liu, Xue-Bin; Hu, Bing-Liang; Jing, Juan-Juan; Wen, Jia

2011-11-01

The requirements of imaging interferometer visualization is imminent for the user of image interpretation and information extraction. However, the conventional researches on visualization only focus on the spectral image dataset in spectral domain. Hence, the quick show of interference spectral image dataset display is one of the nodes in interference image processing. The conventional visualization of interference dataset chooses classical spectral image dataset display method after Fourier transformation. In the present paper, the problem of quick view of interferometer imager in image domain is addressed and the algorithm is proposed which simplifies the matter. The Fourier transformation is an obstacle since its computation time is very large and the complexion would be even deteriorated with the size of dataset increasing. The algorithm proposed, named interference weighted envelopes, makes the dataset divorced from transformation. The authors choose three interference weighted envelopes respectively based on the Fourier transformation, features of interference data and human visual system. After comparing the proposed with the conventional methods, the results show the huge difference in display time.
A semiparametric graphical modelling approach for large-scale equity selection

PubMed Central

Liu, Han; Mulvey, John; Zhao, Tianqi

2016-01-01

We propose a new stock selection strategy that exploits rebalancing returns and improves portfolio performance. To effectively harvest rebalancing gains, we apply ideas from elliptical-copula graphical modelling and stability inference to select stocks that are as independent as possible. The proposed elliptical-copula graphical model has a latent Gaussian representation; its structure can be effectively inferred using the regularized rank-based estimators. The resulting algorithm is computationally efficient and scales to large data-sets. To show the efficacy of the proposed method, we apply it to conduct equity selection based on a 16-year health care stock data-set and a large 34-year stock data-set. Empirical tests show that the proposed method is superior to alternative strategies including a principal component analysis-based approach and the classical Markowitz strategy based on the traditional buy-and-hold assumption. PMID:28316507

HormoneBase, a population-level database of steroid hormone levels across vertebrates

PubMed Central

Vitousek, Maren N.; Johnson, Michele A.; Donald, Jeremy W.; Francis, Clinton D.; Fuxjager, Matthew J.; Goymann, Wolfgang; Hau, Michaela; Husak, Jerry F.; Kircher, Bonnie K.; Knapp, Rosemary; Martin, Lynn B.; Miller, Eliot T.; Schoenle, Laura A.; Uehling, Jennifer J.; Williams, Tony D.

2018-01-01

Hormones are central regulators of organismal function and flexibility that mediate a diversity of phenotypic traits from early development through senescence. Yet despite these important roles, basic questions about how and why hormone systems vary within and across species remain unanswered. Here we describe HormoneBase, a database of circulating steroid hormone levels and their variation across vertebrates. This database aims to provide all available data on the mean, variation, and range of plasma glucocorticoids (both baseline and stress-induced) and androgens in free-living and un-manipulated adult vertebrates. HormoneBase (www.HormoneBase.org) currently includes >6,580 entries from 476 species, reported in 648 publications from 1967 to 2015, and unpublished datasets. Entries are associated with data on the species and population, sex, year and month of study, geographic coordinates, life history stage, method and latency of hormone sampling, and analysis technique. This novel resource could be used for analyses of the function and evolution of hormone systems, and the relationships between hormonal variation and a variety of processes including phenotypic variation, fitness, and species distributions. PMID:29786693
The diffusion tensor imaging (DTI) component of the NIH MRI study of normal brain development (PedsDTI).

PubMed

Walker, Lindsay; Chang, Lin-Ching; Nayak, Amritha; Irfanoglu, M Okan; Botteron, Kelly N; McCracken, James; McKinstry, Robert C; Rivkin, Michael J; Wang, Dah-Jyuu; Rumsey, Judith; Pierpaoli, Carlo

2016-01-01

The NIH MRI Study of normal brain development sought to characterize typical brain development in a population of infants, toddlers, children and adolescents/young adults, covering the socio-economic and ethnic diversity of the population of the United States. The study began in 1999 with data collection commencing in 2001 and concluding in 2007. The study was designed with the final goal of providing a controlled-access database; open to qualified researchers and clinicians, which could serve as a powerful tool for elucidating typical brain development and identifying deviations associated with brain-based disorders and diseases, and as a resource for developing computational methods and image processing tools. This paper focuses on the DTI component of the NIH MRI study of normal brain development. In this work, we describe the DTI data acquisition protocols, data processing steps, quality assessment procedures, and data included in the database, along with database access requirements. For more details, visit http://www.pediatricmri.nih.gov. This longitudinal DTI dataset includes raw and processed diffusion data from 498 low resolution (3 mm) DTI datasets from 274 unique subjects, and 193 high resolution (2.5 mm) DTI datasets from 152 unique subjects. Subjects range in age from 10 days (from date of birth) through 22 years. Additionally, a set of age-specific DTI templates are included. This forms one component of the larger NIH MRI study of normal brain development which also includes T1-, T2-, proton density-weighted, and proton magnetic resonance spectroscopy (MRS) imaging data, and demographic, clinical and behavioral data. Published by Elsevier Inc.
Fuzzy neural network technique for system state forecasting.

PubMed

Li, Dezhi; Wang, Wilson; Ismail, Fathy

2013-10-01

In many system state forecasting applications, the prediction is performed based on multiple datasets, each corresponding to a distinct system condition. The traditional methods dealing with multiple datasets (e.g., vector autoregressive moving average models and neural networks) have some shortcomings, such as limited modeling capability and opaque reasoning operations. To tackle these problems, a novel fuzzy neural network (FNN) is proposed in this paper to effectively extract information from multiple datasets, so as to improve forecasting accuracy. The proposed predictor consists of both autoregressive (AR) nodes modeling and nonlinear nodes modeling; AR models/nodes are used to capture the linear correlation of the datasets, and the nonlinear correlation of the datasets are modeled with nonlinear neuron nodes. A novel particle swarm technique [i.e., Laplace particle swarm (LPS) method] is proposed to facilitate parameters estimation of the predictor and improve modeling accuracy. The effectiveness of the developed FNN predictor and the associated LPS method is verified by a series of tests related to Mackey-Glass data forecast, exchange rate data prediction, and gear system prognosis. Test results show that the developed FNN predictor and the LPS method can capture the dynamics of multiple datasets effectively and track system characteristics accurately.
OpenCL based machine learning labeling of biomedical datasets

NASA Astrophysics Data System (ADS)

Amoros, Oscar; Escalera, Sergio; Puig, Anna

2011-03-01

In this paper, we propose a two-stage labeling method of large biomedical datasets through a parallel approach in a single GPU. Diagnostic methods, structures volume measurements, and visualization systems are of major importance for surgery planning, intra-operative imaging and image-guided surgery. In all cases, to provide an automatic and interactive method to label or to tag different structures contained into input data becomes imperative. Several approaches to label or segment biomedical datasets has been proposed to discriminate different anatomical structures in an output tagged dataset. Among existing methods, supervised learning methods for segmentation have been devised to easily analyze biomedical datasets by a non-expert user. However, they still have some problems concerning practical application, such as slow learning and testing speeds. In addition, recent technological developments have led to widespread availability of multi-core CPUs and GPUs, as well as new software languages, such as NVIDIA's CUDA and OpenCL, allowing to apply parallel programming paradigms in conventional personal computers. Adaboost classifier is one of the most widely applied methods for labeling in the Machine Learning community. In a first stage, Adaboost trains a binary classifier from a set of pre-labeled samples described by a set of features. This binary classifier is defined as a weighted combination of weak classifiers. Each weak classifier is a simple decision function estimated on a single feature value. Then, at the testing stage, each weak classifier is independently applied on the features of a set of unlabeled samples. In this work, we propose an alternative representation of the Adaboost binary classifier. We use this proposed representation to define a new GPU-based parallelized Adaboost testing stage using OpenCL. We provide numerical experiments based on large available data sets and we compare our results to CPU-based strategies in terms of time and labeling speeds.
Status and interconnections of selected environmental issues in the global coastal zones

USGS Publications Warehouse

Shi, Hua; Singh, Ashbindu

2003-01-01

This study focuses on assessing the state of population distribution, land cover distribution, biodiversity hotspots, and protected areas in global coastal zones. The coastal zone is defined as land within 100 km of the coastline. This study attempts to answer such questions as: how crowded are the coastal zones, what is the pattern of land cover distribution in these areas, how much of these areas are designated as protected areas, what is the state of the biodiversity hotspots, and what are the interconnections between people and coastal environment. This study uses globally consistent and comprehensive geospatial datasets based on remote sensing and other sources. The application of Geographic Information System (GIS) layering methods and consistent datasets has made it possible to identify and quantify selected coastal zones environmental issues and their interconnections. It is expected that such information provide a scientific basis for global coastal zones management and assist in policy formulations at the national and international levels.
Size matters: How population size influences genotype–phenotype association studies in anonymized data

PubMed Central

Denny, Joshua C.; Haines, Jonathan L.; Roden, Dan M.; Malin, Bradley A.

2014-01-01

Objective Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be “re-identified” through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome- phenome association studies under various conditions. Methods We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r2) of the p-values of association significance. Results Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼ 100;000) retained better utility to those on smaller sizes (population ∼ 6000—75;000). We observed a general trend of increasing r2 for larger data set sizes: r2 = 0.9481 for small-sized datasets, r2 = 0.9493 for moderately-sized datasets, r2 = 0.9934 for large-sized datasets. Conclusions This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients. PMID:25038554
RchyOptimyx: Cellular Hierarchy Optimization for Flow Cytometry

PubMed Central

Aghaeepour, Nima; Jalali, Adrin; O’Neill, Kieran; Chattopadhyay, Pratip K.; Roederer, Mario; Hoos, Holger H.; Brinkman, Ryan R.

2013-01-01

Analysis of high-dimensional flow cytometry datasets can reveal novel cell populations with poorly understood biology. Following discovery, characterization of these populations in terms of the critical markers involved is an important step, as this can help to both better understand the biology of these populations and aid in designing simpler marker panels to identify them on simpler instruments and with fewer reagents (i.e., in resource poor or highly regulated clinical settings). However, current tools to design panels based on the biological characteristics of the target cell populations work exclusively based on technical parameters (e.g., instrument configurations, spectral overlap, and reagent availability). To address this shortcoming, we developed RchyOptimyx (cellular hieraRCHY OPTIMization), a computational tool that constructs cellular hierarchies by combining automated gating with dynamic programming and graph theory to provide the best gating strategies to identify a target population to a desired level of purity or correlation with a clinical outcome, using the simplest possible marker panels. RchyOptimyx can assess and graphically present the trade-offs between marker choice and population specificity in high-dimensional flow or mass cytometry datasets. We present three proof-of-concept use cases for RchyOptimyx that involve 1) designing a panel of surface markers for identification of rare populations that are primarily characterized using their intracellular signature; 2) simplifying the gating strategy for identification of a target cell population; 3) identification of a non-redundant marker set to identify a target cell population. PMID:23044634
A comparison of public datasets for acceleration-based fall detection.

PubMed

Igual, Raul; Medrano, Carlos; Plaza, Inmaculada

2015-09-01

Falls are one of the leading causes of mortality among the older population, being the rapid detection of a fall a key factor to mitigate its main adverse health consequences. In this context, several authors have conducted studies on acceleration-based fall detection using external accelerometers or smartphones. The published detection rates are diverse, sometimes close to a perfect detector. This divergence may be explained by the difficulties in comparing different fall detection studies in a fair play since each study uses its own dataset obtained under different conditions. In this regard, several datasets have been made publicly available recently. This paper presents a comparison, to the best of our knowledge for the first time, of these public fall detection datasets in order to determine whether they have an influence on the declared performances. Using two different detection algorithms, the study shows that the performances of the fall detection techniques are affected, to a greater or lesser extent, by the specific datasets used to validate them. We have also found large differences in the generalization capability of a fall detector depending on the dataset used for training. In fact, the performance decreases dramatically when the algorithms are tested on a dataset different from the one used for training. Other characteristics of the datasets like the number of training samples also have an influence on the performance while algorithms seem less sensitive to the sampling frequency or the acceleration range. Copyright © 2015 IPEM. Published by Elsevier Ltd. All rights reserved.
New Data Pre-processing on Assessing of Obstructive Sleep Apnea Syndrome: Line Based Normalization Method (LBNM)

NASA Astrophysics Data System (ADS)

Akdemir, Bayram; Güneş, Salih; Yosunkaya, Şebnem

Sleep disorders are a very common unawareness illness among public. Obstructive Sleep Apnea Syndrome (OSAS) is characterized with decreased oxygen saturation level and repetitive upper respiratory tract obstruction episodes during full night sleep. In the present study, we have proposed a novel data normalization method called Line Based Normalization Method (LBNM) to evaluate OSAS using real data set obtained from Polysomnography device as a diagnostic tool in patients and clinically suspected of suffering OSAS. Here, we have combined the LBNM and classification methods comprising C4.5 decision tree classifier and Artificial Neural Network (ANN) to diagnose the OSAS. Firstly, each clinical feature in OSAS dataset is scaled by LBNM method in the range of [0,1]. Secondly, normalized OSAS dataset is classified using different classifier algorithms including C4.5 decision tree classifier and ANN, respectively. The proposed normalization method was compared with min-max normalization, z-score normalization, and decimal scaling methods existing in literature on the diagnosis of OSAS. LBNM has produced very promising results on the assessing of OSAS. Also, this method could be applied to other biomedical datasets.
Size-based trends and management implications of microhabitat utilization by Brown Treesnakes, with an emphasis on juvenile snakes

USGS Publications Warehouse

Rodda, Gordon H.; Reed, Robert N.

2007-01-01

The brown treesnake (Boiga irregularis, or BTS), a costly invasive species, has been the subject of intensive research on Guam over the past two decades. The behavior and habitat use of hatchling and juvenile snakes, however, remain largely unknown. We used a long-term dataset of BTS captures (N = 2,415) and a dataset resulting from intensive sampling within and immediately around a 5-ha fenced population (N = 2,541) to examine habitat use of BTS. Small snakes were almost exclusively arboreal and that they appeared to prefer tangantangan (Leucaena leucocephala) habitats. In contrast, large snakes used arboreal and terrestrial habitats in roughly equal proportion, and were less frequently found in tangantangan. Among snakes found in trees, there were no clear size-based preferences for certain heights above ground, nor for size-based choice of perch diameters. We discuss these results as they relate to management and interdiction implications for brown treesnakes on Guam and in potential incipient populations on other islands.
A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation

PubMed Central

2012-01-01

Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519
Differentially Private Histogram Publication For Dynamic Datasets: An Adaptive Sampling Approach

PubMed Central

Li, Haoran; Jiang, Xiaoqian; Xiong, Li; Liu, Jinfei

2016-01-01

Differential privacy has recently become a de facto standard for private statistical data release. Many algorithms have been proposed to generate differentially private histograms or synthetic data. However, most of them focus on “one-time” release of a static dataset and do not adequately address the increasing need of releasing series of dynamic datasets in real time. A straightforward application of existing histogram methods on each snapshot of such dynamic datasets will incur high accumulated error due to the composibility of differential privacy and correlations or overlapping users between the snapshots. In this paper, we address the problem of releasing series of dynamic datasets in real time with differential privacy, using a novel adaptive distance-based sampling approach. Our first method, DSFT, uses a fixed distance threshold and releases a differentially private histogram only when the current snapshot is sufficiently different from the previous one, i.e., with a distance greater than a predefined threshold. Our second method, DSAT, further improves DSFT and uses a dynamic threshold adaptively adjusted by a feedback control mechanism to capture the data dynamics. Extensive experiments on real and synthetic datasets demonstrate that our approach achieves better utility than baseline methods and existing state-of-the-art methods. PMID:26973795
Ensemble stacking mitigates biases in inference of synaptic connectivity.

PubMed

Chambers, Brendan; Levy, Maayan; Dechery, Joseph B; MacLean, Jason N

2018-01-01

A promising alternative to directly measuring the anatomical connections in a neuronal population is inferring the connections from the activity. We employ simulated spiking neuronal networks to compare and contrast commonly used inference methods that identify likely excitatory synaptic connections using statistical regularities in spike timing. We find that simple adjustments to standard algorithms improve inference accuracy: A signing procedure improves the power of unsigned mutual-information-based approaches and a correction that accounts for differences in mean and variance of background timing relationships, such as those expected to be induced by heterogeneous firing rates, increases the sensitivity of frequency-based methods. We also find that different inference methods reveal distinct subsets of the synaptic network and each method exhibits different biases in the accurate detection of reciprocity and local clustering. To correct for errors and biases specific to single inference algorithms, we combine methods into an ensemble. Ensemble predictions, generated as a linear combination of multiple inference algorithms, are more sensitive than the best individual measures alone, and are more faithful to ground-truth statistics of connectivity, mitigating biases specific to single inference methods. These weightings generalize across simulated datasets, emphasizing the potential for the broad utility of ensemble-based approaches.
Groupwise shape analysis of the hippocampus using spectral matching

NASA Astrophysics Data System (ADS)

Shakeri, Mahsa; Lombaert, Hervé; Lippé, Sarah; Kadoury, Samuel

2014-03-01

The hippocampus is a prominent subcortical feature of interest in many neuroscience studies. Its subtle morphological changes often predicate illnesses, including Alzheimer's, schizophrenia or epilepsy. The precise location of structural differences requires a reliable correspondence between shapes across a population. In this paper, we propose an automated method for groupwise hippocampal shape analysis based on a spectral decomposition of a group of shapes to solve the correspondence problem between sets of meshes. The framework generates diffeomorphic correspondence maps across a population, which enables us to create a mean shape. Morphological changes are then located between two groups of subjects. The performance of the proposed method was evaluated on a dataset of 42 hippocampus shapes and compared with a state-of-the-art structural shape analysis approach, using spherical harmonics. Difference maps between mean shapes of two test groups demonstrates that the two approaches showed results with insignificant differences, while Gaussian curvature measures calculated between matched vertices showed a better fit and reduced variability with spectral matching.
Mean composite fire severity metrics computed with Google Earth Engine offer improved accuracy and expanded mapping potential

USGS Publications Warehouse

Parks, Sean; Holsinger, Lisa M.; Voss, Morgan; Loehman, Rachel A.; Robinson, Nathaniel P.

2018-01-01

Landsat-based fire severity datasets are an invaluable resource for monitoring and research purposes. These gridded fire severity datasets are generally produced with pre-and post-fire imagery to estimate the degree of fire-induced ecological change. Here, we introduce methods to produce three Landsat-based fire severity metrics using the Google Earth Engine (GEE) platform: the delta normalized burn ratio (dNBR), the relativized delta normalized burn ratio (RdNBR), and the relativized burn ratio (RBR). Our methods do not rely on time-consuming a priori scene selection and instead use a mean compositing approach in which all valid pixels (e.g. cloud-free) over a pre-specified date range (pre- and post-fire) are stacked and the mean value for each pixel over each stack is used to produce the resulting fire severity datasets. This approach demonstrates that fire severity datasets can be produced with relative ease and speed compared the standard approach in which one pre-fire and post-fire scene are judiciously identified and used to produce fire severity datasets. We also validate the GEE-derived fire severity metrics using field-based fire severity plots for 18 fires in the western US. These validations are compared to Landsat-based fire severity datasets produced using only one pre- and post-fire scene, which has been the standard approach in producing such datasets since their inception. Results indicate that the GEE-derived fire severity datasets show improved validation statistics compared to parallel versions in which only one pre-fire and post-fire scene are used. We provide code and a sample geospatial fire history layer to produce dNBR, RdNBR, and RBR for the 18 fires we evaluated. Although our approach requires that a geospatial fire history layer (i.e. fire perimeters) be produced independently and prior to applying our methods, we suggest our GEE methodology can reasonably be implemented on hundreds to thousands of fires, thereby increasing opportunities for fire severity monitoring and research across the globe.
EnviroAtlas - Austin, TX - Potential Window Views of Water by Block Group

EPA Pesticide Factsheets

This EnviroAtlas dataset describes the block group population and the percentage of the block group population that has potential views of water bodies. A potential view of water is defined as having a body of water that is greater than 300m2 within 50m of a residential location. The window views are considered potential because the procedure does not account for presence or directionality of windows in one's home. The residential locations are defined using the EnviroAtlas Dasymetric (2011/October 2015 version) map. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
EnviroAtlas - Austin, TX - BenMAP Results by Block Group

EPA Pesticide Factsheets

This EnviroAtlas dataset demonstrates the effect of changes in pollution concentration on local populations in 750 block groups in Austin, Texas. The US EPA's Environmental Benefits Mapping and Analysis Program (BenMAP) was used to estimate the incidence of adverse health effects (i.e., mortality and morbidity) and associated monetary value that result from changes in pollution concentrations for Travis and Williamson Counties, TX. Incidence and value estimates for the block groups are calculated using i-Tree models (www.itreetools.org), local weather data, pollution data, and U.S. Census derived population data. This dataset was produced by the US Forest Service to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

PubMed Central

Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene

2018-01-01

Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379
Toward robust estimation of the components of forest population change: simulation results

Treesearch

Francis A. Roesch

2014-01-01

This report presents the full simulation results of the work described in Roesch (2014), in which multiple levels of simulation were used to test the robustness of estimators for the components of forest change. In that study, a variety of spatial-temporal populations were created based on, but more variable than, an actual forest monitoring dataset, and then those...
[Theory, method and application of method R on estimation of (co)variance components].

PubMed

Liu, Wen-Zhong

2004-07-01

Theory, method and application of Method R on estimation of (co)variance components were reviewed in order to make the method be reasonably used. Estimation requires R values,which are regressions of predicted random effects that are calculated using complete dataset on predicted random effects that are calculated using random subsets of the same data. By using multivariate iteration algorithm based on a transformation matrix,and combining with the preconditioned conjugate gradient to solve the mixed model equations, the computation efficiency of Method R is much improved. Method R is computationally inexpensive,and the sampling errors and approximate credible intervals of estimates can be obtained. Disadvantages of Method R include a larger sampling variance than other methods for the same data,and biased estimates in small datasets. As an alternative method, Method R can be used in larger datasets. It is necessary to study its theoretical properties and broaden its application range further.

Calibration of limited-area ensemble precipitation forecasts for hydrological predictions

NASA Astrophysics Data System (ADS)

Diomede, Tommaso; Marsigli, Chiara; Montani, Andrea; Nerozzi, Fabrizio; Paccagnella, Tiziana

2015-04-01

The main objective of this study is to investigate the impact of calibration for limited-area ensemble precipitation forecasts, to be used for driving discharge predictions up to 5 days in advance. A reforecast dataset, which spans 30 years, based on the Consortium for Small Scale Modeling Limited-Area Ensemble Prediction System (COSMO-LEPS) was used for testing the calibration strategy. Three calibration techniques were applied: quantile-to-quantile mapping, linear regression, and analogs. The performance of these methodologies was evaluated in terms of statistical scores for the precipitation forecasts operationally provided by COSMO-LEPS in the years 2003-2007 over Germany, Switzerland, and the Emilia-Romagna region (northern Italy). The analog-based method seemed to be preferred because of its capability of correct position errors and spread deficiencies. A suitable spatial domain for the analog search can help to handle model spatial errors as systematic errors. However, the performance of the analog-based method may degrade in cases where a limited training dataset is available. A sensitivity test on the length of the training dataset over which to perform the analog search has been performed. The quantile-to-quantile mapping and linear regression methods were less effective, mainly because the forecast-analysis relation was not so strong for the available training dataset. A comparison between the calibration based on the deterministic reforecast and the calibration based on the full operational ensemble used as training dataset has been considered, with the aim to evaluate whether reforecasts are really worthy for calibration, given that their computational cost is remarkable. The verification of the calibration process was then performed by coupling ensemble precipitation forecasts with a distributed rainfall-runoff model. This test was carried out for a medium-sized catchment located in Emilia-Romagna, showing a beneficial impact of the analog-based method on the reduction of missed events for discharge predictions.
Assessment of the effects and limitations of the 1998 to 2008 Abbreviated Injury Scale map using a large population-based dataset

PubMed Central

2011-01-01

Background Trauma systems should consistently monitor a given trauma population over a period of time. The Abbreviated Injury Scale (AIS) and derived scores such as the Injury Severity Score (ISS) are commonly used to quantify injury severities in trauma registries. To reflect contemporary trauma management and treatment, the most recent version of the AIS (AIS08) contains many codes which differ in severity from their equivalents in the earlier 1998 version (AIS98). Consequently, the adoption of AIS08 may impede comparisons between data coded using different AIS versions. It may also affect the number of patients classified as major trauma. Methods The entire AIS98-coded injury dataset of a large population based trauma registry was retrieved and mapped to AIS08 using the currently available AIS98-AIS08 dictionary map. The percentage of codes which had increased or decreased in severity, or could not be mapped, was examined in conjunction with the effect of these changes to the calculated ISS. The potential for free text information accompanying AIS coding to improve the quality of AIS mapping was explored. Results A total of 128280 AIS98-coded injuries were evaluated in 32134 patients, 15471 patients of whom were classified as major trauma. Although only 4.5% of dictionary codes decreased in severity from AIS98 to AIS08, this represented almost 13% of injuries in the registry. In 4.9% of patients, no injuries could be mapped. ISS was potentially unreliable in one-third of patients, as they had at least one AIS98 code which could not be mapped. Using AIS08, the number of patients classified as major trauma decreased by between 17.3% and 30.3%. Evaluation of free text descriptions for some injuries demonstrated the potential to improve mapping between AIS versions. Conclusions Converting AIS98-coded data to AIS08 results in a significant decrease in the number of patients classified as major trauma. Many AIS98 codes are missing from the existing AIS map, and across a trauma population the AIS08 dataset estimates which it produces are of insufficient quality to be used in practice. However, it may be possible to improve AIS98 to AIS08 mapping to the point where it is useful to established registries. PMID:21214906
Genomic prediction based on data from three layer lines using non-linear regression models.

PubMed

Huang, Heyun; Windig, Jack J; Vereijken, Addie; Calus, Mario P L

2014-11-06

Most studies on genomic prediction with reference populations that include multiple lines or breeds have used linear models. Data heterogeneity due to using multiple populations may conflict with model assumptions used in linear regression methods. In an attempt to alleviate potential discrepancies between assumptions of linear models and multi-population data, two types of alternative models were used: (1) a multi-trait genomic best linear unbiased prediction (GBLUP) model that modelled trait by line combinations as separate but correlated traits and (2) non-linear models based on kernel learning. These models were compared to conventional linear models for genomic prediction for two lines of brown layer hens (B1 and B2) and one line of white hens (W1). The three lines each had 1004 to 1023 training and 238 to 240 validation animals. Prediction accuracy was evaluated by estimating the correlation between observed phenotypes and predicted breeding values. When the training dataset included only data from the evaluated line, non-linear models yielded at best a similar accuracy as linear models. In some cases, when adding a distantly related line, the linear models showed a slight decrease in performance, while non-linear models generally showed no change in accuracy. When only information from a closely related line was used for training, linear models and non-linear radial basis function (RBF) kernel models performed similarly. The multi-trait GBLUP model took advantage of the estimated genetic correlations between the lines. Combining linear and non-linear models improved the accuracy of multi-line genomic prediction. Linear models and non-linear RBF models performed very similarly for genomic prediction, despite the expectation that non-linear models could deal better with the heterogeneous multi-population data. This heterogeneity of the data can be overcome by modelling trait by line combinations as separate but correlated traits, which avoids the occasional occurrence of large negative accuracies when the evaluated line was not included in the training dataset. Furthermore, when using a multi-line training dataset, non-linear models provided information on the genotype data that was complementary to the linear models, which indicates that the underlying data distributions of the three studied lines were indeed heterogeneous.
Information Theory and Voting Based Consensus Clustering for Combining Multiple Clusterings of Chemical Structures.

PubMed

Saeed, Faisal; Salim, Naomie; Abdo, Ammar

2013-07-01

Many consensus clustering methods have been applied in different areas such as pattern recognition, machine learning, information theory and bioinformatics. However, few methods have been used for chemical compounds clustering. In this paper, an information theory and voting based algorithm (Adaptive Cumulative Voting-based Aggregation Algorithm A-CVAA) was examined for combining multiple clusterings of chemical structures. The effectiveness of clusterings was evaluated based on the ability of the clustering method to separate active from inactive molecules in each cluster, and the results were compared with Ward's method. The chemical dataset MDL Drug Data Report (MDDR) and the Maximum Unbiased Validation (MUV) dataset were used. Experiments suggest that the adaptive cumulative voting-based consensus method can improve the effectiveness of combining multiple clusterings of chemical structures. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Differential expression analysis for RNAseq using Poisson mixed models.

PubMed

Sun, Shiquan; Hood, Michelle; Scott, Laura; Peng, Qinke; Mukherjee, Sayan; Tung, Jenny; Zhou, Xiang

2017-06-20

Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
SEER Data & Software

Cancer.gov

Options for accessing datasets for incidence, mortality, county populations, standard populations, expected survival, and SEER-linked and specialized data. Plus variable definitions, documentation for reporting and using datasets, statistical software (SEER*Stat), and observational research resources.
Developing approaches for linear mixed modeling in landscape genetics through landscape-directed dispersal simulations

USGS Publications Warehouse

Row, Jeffrey R.; Knick, Steven T.; Oyler-McCance, Sara J.; Lougheed, Stephen C.; Fedy, Bradley C.

2017-01-01

Dispersal can impact population dynamics and geographic variation, and thus, genetic approaches that can establish which landscape factors influence population connectivity have ecological and evolutionary importance. Mixed models that account for the error structure of pairwise datasets are increasingly used to compare models relating genetic differentiation to pairwise measures of landscape resistance. A model selection framework based on information criteria metrics or explained variance may help disentangle the ecological and landscape factors influencing genetic structure, yet there are currently no consensus for the best protocols. Here, we develop landscape-directed simulations and test a series of replicates that emulate independent empirical datasets of two species with different life history characteristics (greater sage-grouse; eastern foxsnake). We determined that in our simulated scenarios, AIC and BIC were the best model selection indices and that marginal R2 values were biased toward more complex models. The model coefficients for landscape variables generally reflected the underlying dispersal model with confidence intervals that did not overlap with zero across the entire model set. When we controlled for geographic distance, variables not in the underlying dispersal models (i.e., nontrue) typically overlapped zero. Our study helps establish methods for using linear mixed models to identify the features underlying patterns of dispersal across a variety of landscapes.
Validation for Vegetation Green-up Date Extracted from GIMMS NDVI and NDVI3g Using Variety of Methods

NASA Astrophysics Data System (ADS)

Chang, Q.; Jiao, W.

2017-12-01

Phenology is a sensitive and critical feature of vegetation change that has regarded as a good indicator in climate change studies. So far, variety of remote sensing data sources and phenology extraction methods from satellite datasets have been developed to study the spatial-temporal dynamics of vegetation phenology. However, the differences between vegetation phenology results caused by the varies satellite datasets and phenology extraction methods are not clear, and the reliability for different phenology results extracted from remote sensing datasets is not verified and compared using the ground observation data. Based on three most popular remote sensing phenology extraction methods, this research calculated the Start of the growing season (SOS) for each pixels in the Northern Hemisphere for two kinds of long time series satellite datasets: GIMMS NDVIg (SOSg) and GIMMS NDVI3g (SOS3g). The three methods used in this research are: maximum increase method, dynamic threshold method and midpoint method. Then, this study used SOS calculated from NEE datasets (SOS_NEE) monitored by 48 eddy flux tower sites in global flux website to validate the reliability of six phenology results calculated from remote sensing datasets. Results showed that both SOSg and SOS3g extracted by maximum increase method are not correlated with ground observed phenology metrics. SOSg and SOS3g extracted by the dynamic threshold method and midpoint method are both correlated with SOS_NEE significantly. Compared with SOSg extracted by the dynamic threshold method, SOSg extracted by the midpoint method have a stronger correlation with SOS_NEE. And, the same to SOS3g. Additionally, SOSg showed stronger correlation with SOS_NEE than SOS3g extracted by the same method. SOS extracted by the midpoint method from GIMMS NDVIg datasets seemed to be the most reliable results when validated with SOS_NEE. These results can be used as reference for data and method selection in future's phenology study.
Predicting Overall Survival After Stereotactic Ablative Radiation Therapy in Early-Stage Lung Cancer: Development and External Validation of the Amsterdam Prognostic Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Louie, Alexander V., E-mail: Dr.alexlouie@gmail.com; Department of Radiation Oncology, London Regional Cancer Program, University of Western Ontario, London, Ontario; Department of Epidemiology, Harvard School of Public Health, Harvard University, Boston, Massachusetts

Purpose: A prognostic model for 5-year overall survival (OS), consisting of recursive partitioning analysis (RPA) and a nomogram, was developed for patients with early-stage non-small cell lung cancer (ES-NSCLC) treated with stereotactic ablative radiation therapy (SABR). Methods and Materials: A primary dataset of 703 ES-NSCLC SABR patients was randomly divided into a training (67%) and an internal validation (33%) dataset. In the former group, 21 unique parameters consisting of patient, treatment, and tumor factors were entered into an RPA model to predict OS. Univariate and multivariate models were constructed for RPA-selected factors to evaluate their relationship with OS. A nomogrammore » for OS was constructed based on factors significant in multivariate modeling and validated with calibration plots. Both the RPA and the nomogram were externally validated in independent surgical (n=193) and SABR (n=543) datasets. Results: RPA identified 2 distinct risk classes based on tumor diameter, age, World Health Organization performance status (PS) and Charlson comorbidity index. This RPA had moderate discrimination in SABR datasets (c-index range: 0.52-0.60) but was of limited value in the surgical validation cohort. The nomogram predicting OS included smoking history in addition to RPA-identified factors. In contrast to RPA, validation of the nomogram performed well in internal validation (r{sup 2}=0.97) and external SABR (r{sup 2}=0.79) and surgical cohorts (r{sup 2}=0.91). Conclusions: The Amsterdam prognostic model is the first externally validated prognostication tool for OS in ES-NSCLC treated with SABR available to individualize patient decision making. The nomogram retained strong performance across surgical and SABR external validation datasets. RPA performance was poor in surgical patients, suggesting that 2 different distinct patient populations are being treated with these 2 effective modalities.« less
A robust and efficient statistical method for genetic association studies using case and control samples from multiple cohorts

PubMed Central

2013-01-01

Background The theoretical basis of genome-wide association studies (GWAS) is statistical inference of linkage disequilibrium (LD) between any polymorphic marker and a putative disease locus. Most methods widely implemented for such analyses are vulnerable to several key demographic factors and deliver a poor statistical power for detecting genuine associations and also a high false positive rate. Here, we present a likelihood-based statistical approach that accounts properly for non-random nature of case–control samples in regard of genotypic distribution at the loci in populations under study and confers flexibility to test for genetic association in presence of different confounding factors such as population structure, non-randomness of samples etc. Results We implemented this novel method together with several popular methods in the literature of GWAS, to re-analyze recently published Parkinson’s disease (PD) case–control samples. The real data analysis and computer simulation show that the new method confers not only significantly improved statistical power for detecting the associations but also robustness to the difficulties stemmed from non-randomly sampling and genetic structures when compared to its rivals. In particular, the new method detected 44 significant SNPs within 25 chromosomal regions of size < 1 Mb but only 6 SNPs in two of these regions were previously detected by the trend test based methods. It discovered two SNPs located 1.18 Mb and 0.18 Mb from the PD candidates, FGF20 and PARK8, without invoking false positive risk. Conclusions We developed a novel likelihood-based method which provides adequate estimation of LD and other population model parameters by using case and control samples, the ease in integration of these samples from multiple genetically divergent populations and thus confers statistically robust and powerful analyses of GWAS. On basis of simulation studies and analysis of real datasets, we demonstrated significant improvement of the new method over the non-parametric trend test, which is the most popularly implemented in the literature of GWAS. PMID:23394771
PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

PubMed

Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

2014-01-01

Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.
Independent Demographic Responses to Climate Change among Temperate and Tropical Milksnakes (Colubridae: Genus Lampropeltis)

PubMed Central

Ruane, Sara; Torres-Carvajal, Omar; Burbrink, Frank T.

2015-01-01

The effects of Late Quaternary climate change have been examined for many temperate New World taxa, but the impact of Pleistocene glacial cycles on Neotropical taxa is less well understood, specifically with respect to changes in population demography. Here, we examine historical demographic trends for six species of milksnake with representatives in both the temperate and tropical Americas to determine if species share responses to climate change as a taxon or by area (i.e., temperate versus tropical environments). Using a multilocus dataset, we test for the demographic signature of population expansion and decline using non-genealogical summary statistics, as well as coalescent-based methods. In addition, we determine whether range sizes are correlated with effective population sizes for milksnakes. Results indicate that there are no identifiable trends with respect to demographic response based on location, and that species responded to changing climates independently, with tropical taxa showing greater instability. There is also no correlation between range size and effective population size, with the largest population size belonging to the species with the smallest geographic distribution. Our study highlights the importance of not generalizing the demographic histories of taxa by region and further illustrates that the New World tropics may not have been a stable refuge during the Pleistocene. PMID:26083467
Independent Demographic Responses to Climate Change among Temperate and Tropical Milksnakes (Colubridae: Genus Lampropeltis).

PubMed

Ruane, Sara; Torres-Carvajal, Omar; Burbrink, Frank T

2015-01-01

The effects of Late Quaternary climate change have been examined for many temperate New World taxa, but the impact of Pleistocene glacial cycles on Neotropical taxa is less well understood, specifically with respect to changes in population demography. Here, we examine historical demographic trends for six species of milksnake with representatives in both the temperate and tropical Americas to determine if species share responses to climate change as a taxon or by area (i.e., temperate versus tropical environments). Using a multilocus dataset, we test for the demographic signature of population expansion and decline using non-genealogical summary statistics, as well as coalescent-based methods. In addition, we determine whether range sizes are correlated with effective population sizes for milksnakes. Results indicate that there are no identifiable trends with respect to demographic response based on location, and that species responded to changing climates independently, with tropical taxa showing greater instability. There is also no correlation between range size and effective population size, with the largest population size belonging to the species with the smallest geographic distribution. Our study highlights the importance of not generalizing the demographic histories of taxa by region and further illustrates that the New World tropics may not have been a stable refuge during the Pleistocene.
Effects of Sample Selection Bias on the Accuracy of Population Structure and Ancestry Inference

PubMed Central

Shringarpure, Suyash; Xing, Eric P.

2014-01-01

Population stratification is an important task in genetic analyses. It provides information about the ancestry of individuals and can be an important confounder in genome-wide association studies. Public genotyping projects have made a large number of datasets available for study. However, practical constraints dictate that of a geographical/ethnic population, only a small number of individuals are genotyped. The resulting data are a sample from the entire population. If the distribution of sample sizes is not representative of the populations being sampled, the accuracy of population stratification analyses of the data could be affected. We attempt to understand the effect of biased sampling on the accuracy of population structure analysis and individual ancestry recovery. We examined two commonly used methods for analyses of such datasets, ADMIXTURE and EIGENSOFT, and found that the accuracy of recovery of population structure is affected to a large extent by the sample used for analysis and how representative it is of the underlying populations. Using simulated data and real genotype data from cattle, we show that sample selection bias can affect the results of population structure analyses. We develop a mathematical framework for sample selection bias in models for population structure and also proposed a correction for sample selection bias using auxiliary information about the sample. We demonstrate that such a correction is effective in practice using simulated and real data. PMID:24637351
Supervised learning based multimodal MRI brain tumour segmentation using texture features from supervoxels.

PubMed

Soltaninejad, Mohammadreza; Yang, Guang; Lambrou, Tryphon; Allinson, Nigel; Jones, Timothy L; Barrick, Thomas R; Howe, Franklyn A; Ye, Xujiong

2018-04-01

Accurate segmentation of brain tumour in magnetic resonance images (MRI) is a difficult task due to various tumour types. Using information and features from multimodal MRI including structural MRI and isotropic (p) and anisotropic (q) components derived from the diffusion tensor imaging (DTI) may result in a more accurate analysis of brain images. We propose a novel 3D supervoxel based learning method for segmentation of tumour in multimodal MRI brain images (conventional MRI and DTI). Supervoxels are generated using the information across the multimodal MRI dataset. For each supervoxel, a variety of features including histograms of texton descriptor, calculated using a set of Gabor filters with different sizes and orientations, and first order intensity statistical features are extracted. Those features are fed into a random forests (RF) classifier to classify each supervoxel into tumour core, oedema or healthy brain tissue. The method is evaluated on two datasets: 1) Our clinical dataset: 11 multimodal images of patients and 2) BRATS 2013 clinical dataset: 30 multimodal images. For our clinical dataset, the average detection sensitivity of tumour (including tumour core and oedema) using multimodal MRI is 86% with balanced error rate (BER) 7%; while the Dice score for automatic tumour segmentation against ground truth is 0.84. The corresponding results of the BRATS 2013 dataset are 96%, 2% and 0.89, respectively. The method demonstrates promising results in the segmentation of brain tumour. Adding features from multimodal MRI images can largely increase the segmentation accuracy. The method provides a close match to expert delineation across all tumour grades, leading to a faster and more reproducible method of brain tumour detection and delineation to aid patient management. Copyright © 2018 Elsevier B.V. All rights reserved.
U.S. Datasets

Cancer.gov

Datasets for U.S. mortality, U.S. populations, standard populations, county attributes, and expected survival. Plus SEER-linked databases (SEER-Medicare, SEER-Medicare Health Outcomes Survey [SEER-MHOS], SEER-Consumer Assessment of Healthcare Providers and Systems [SEER-CAHPS]).
Allele Frequencies Net Database: Improvements for storage of individual genotypes and analysis of existing data.

PubMed

Santos, Eduardo Jose Melos Dos; McCabe, Antony; Gonzalez-Galarza, Faviel F; Jones, Andrew R; Middleton, Derek

2016-03-01

The Allele Frequencies Net Database (AFND) is a freely accessible database which stores population frequencies for alleles or genes of the immune system in worldwide populations. Herein we introduce two new tools. We have defined new classifications of data (gold, silver and bronze) to assist users in identifying the most suitable populations for their tasks. The gold standard datasets are defined by allele frequencies summing to 1, sample sizes >50 and high resolution genotyping, while silver standard datasets do not meet gold standard genotyping resolution and/or sample size criteria. The bronze standard datasets are those that could not be classified under the silver or gold standards. The gold standard includes >500 datasets covering over 3 million individuals from >100 countries at one or more of the following loci: HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1 and -DRB1 - with all loci except DPA1 present in more than 220 datasets. Three out of 12 geographic regions have low representation (the majority of their countries having less than five datasets) and the Central Asia region has no representation. There are 18 countries that are not represented by any gold standard datasets but are represented by at least one dataset that is either silver or bronze standard. We also briefly summarize the data held by AFND for KIR genes, alleles and their ligands. Our second new component is a data submission tool to assist users in the collection of the genotypes of the individuals (raw data), facilitating submission of short population reports to Human Immunology, as well as simplifying the submission of population demographics and frequency data. Copyright © 2015 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.
The Isprs Benchmark on Indoor Modelling

NASA Astrophysics Data System (ADS)

Khoshelham, K.; Díaz Vilariño, L.; Peter, M.; Kang, Z.; Acharya, D.

2017-09-01

Automated generation of 3D indoor models from point cloud data has been a topic of intensive research in recent years. While results on various datasets have been reported in literature, a comparison of the performance of different methods has not been possible due to the lack of benchmark datasets and a common evaluation framework. The ISPRS benchmark on indoor modelling aims to address this issue by providing a public benchmark dataset and an evaluation framework for performance comparison of indoor modelling methods. In this paper, we present the benchmark dataset comprising several point clouds of indoor environments captured by different sensors. We also discuss the evaluation and comparison of indoor modelling methods based on manually created reference models and appropriate quality evaluation criteria. The benchmark dataset is available for download at: http://www2.isprs.org/commissions/comm4/wg5/benchmark-on-indoor-modelling.html.
A framework for automatic creation of gold-standard rigid 3D-2D registration datasets.

PubMed

Madan, Hennadii; Pernuš, Franjo; Likar, Boštjan; Špiclin, Žiga

2017-02-01

Advanced image-guided medical procedures incorporate 2D intra-interventional information into pre-interventional 3D image and plan of the procedure through 3D/2D image registration (32R). To enter clinical use, and even for publication purposes, novel and existing 32R methods have to be rigorously validated. The performance of a 32R method can be estimated by comparing it to an accurate reference or gold standard method (usually based on fiducial markers) on the same set of images (gold standard dataset). Objective validation and comparison of methods are possible only if evaluation methodology is standardized, and the gold standard dataset is made publicly available. Currently, very few such datasets exist and only one contains images of multiple patients acquired during a procedure. To encourage the creation of gold standard 32R datasets, we propose an automatic framework. The framework is based on rigid registration of fiducial markers. The main novelty is spatial grouping of fiducial markers on the carrier device, which enables automatic marker localization and identification across the 3D and 2D images. The proposed framework was demonstrated on clinical angiograms of 20 patients. Rigid 32R computed by the framework was more accurate than that obtained manually, with the respective target registration error below 0.027 mm compared to 0.040 mm. The framework is applicable for gold standard setup on any rigid anatomy, provided that the acquired images contain spatially grouped fiducial markers. The gold standard datasets and software will be made publicly available.
The StreamCat Dataset: Accumulated Attributes for NHDPlusV2 (Version 2.1) Catchments Riparian Buffer for the Conterminous United States: 2010 US Census Housing Unit and Population Density

EPA Pesticide Factsheets

This dataset represents the population and housing unit density within individual, local NHDPlusV2 catchments and upstream, contributing watersheds riparian buffers based on 2010 US Census data. Densities are calculated for every block group and watershed averages are calculated for every local NHDPlusV2 catchment(see Data Sources for links to NHDPlusV2 data and Census Data). This data set is derived from The TIGER/Line Files and related database (.dbf) files for the conterminous USA. It was downloaded as Block Group-Level Census 2010 SF1 Data in File Geodatabase Format (ArcGIS version 10.0). The landscape raster (LR) was produced based on the data compiled from the questions asked of all people and about every housing unit. The (block-group population / block group area) and (block-group housing units / block group area) were summarized by local catchment and by watershed to produce local catchment-level and watershed-level metrics as a continuous data type (see Data Structure and Attribute Information for a description).

EBprot: Statistical analysis of labeling-based quantitative proteomics data.

PubMed

Koh, Hiromi W L; Swa, Hannah L F; Fermin, Damian; Ler, Siok Ghee; Gunaratne, Jayantha; Choi, Hyungwon

2015-08-01

Labeling-based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein-level ratios, which is obtained by summarizing peptide-level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide-protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide-level analysis of EBprot provides better receiver-operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein-level ratios. We also demonstrate superior classification performance of peptide-level EBprot analysis in a spike-in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF-stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide-level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling-based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/. All MS data have been deposited in the ProteomeXchange with identifier PXD001426 (http://proteomecentral.proteomexchange.org/dataset/PXD001426/). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Risk behaviours among internet-facilitated sex workers: evidence from two new datasets.

PubMed

Cunningham, Scott; Kendall, Todd D

2010-12-01

Sex workers have historically played a central role in STI outbreaks by forming a core group for transmission and due to their higher rates of concurrency and inconsistent condom usage. Over the past 15 years, North American commercial sex markets have been radically reorganised by internet technologies that channelled a sizeable share of the marketplace online. These changes may have had a meaningful impact on the role that sex workers play in STI epidemics. In this study, two new datasets documenting the characteristics and practices of internet-facilitated sex workers are presented and analysed. The first dataset comes from a ratings website where clients share detailed information on over 94,000 sex workers in over 40 cities between 1999 and 2008. The second dataset reflects a year-long field survey of 685 sex workers who advertise online. Evidence from these datasets suggests that internet-facilitated sex workers are dissimilar from the street-based workers who largely populated the marketplace in earlier eras. Differences in characteristics and practices were found which suggest a lower potential for the spread of STIs among internet-facilitated sex workers. The internet-facilitated population appears to include a high proportion of sex workers who are well-educated, hold health insurance and operate only part time. They also engage in relatively low levels of risky sexual practices.
US geographic distribution of rt-PA utilization by hospital for acute ischemic stroke.

PubMed

Kleindorfer, Dawn; Xu, Yingying; Moomaw, Charles J; Khatri, Pooja; Adeoye, Opeolu; Hornung, Richard

2009-11-01

Previously, we have estimated US national rates of recombinant tissue plasminogen activator (rt-PA) use to be 1.8% to 3.0% of all ischemic stroke patients. However, we hypothesized that the rate of rt-PA use may vary widely depending on regional variation, and that a large percentage of the US population likely does not have access to hospitals using rt-PA regularly. We describe the US geographic distribution of hospitals using rt-PA for acute ischemic stroke. This analysis used the MEDPAR database, which is a claims-based dataset that contains every fee-for-service Medicare-eligible hospital discharge in the US. Cases potentially eligible for rt-PA treatment based on diagnosis were defined as those with a hospital DRG code of 14, 15, or 559, and that also had an ICD-9 code of 433, 434, or 436. Thrombolysis use was defined as an ICD-9 code of 99.1. Study interval was July 1, 2005 to June 30, 2007. Hospital locations were mapped using ArcView software; population densities and regions of the US are based on US Census 2000. There were 4750 hospitals in the MEDPAR database, which included 495 186 ischemic stroke admissions during the study period. Of these hospitals, 64% had no reported treatments with rt-PA for ischemic stroke, and 0.9% reported >10% treatment rates within the MEDPAR dataset. Bed size, rural or underserved designation, and population density were significantly associated with reported rt-PA treatment rates, and remained significant in the multivariable regression. Approximately 162 million US citizens reside in counties containing a hospital reporting a >or=2.4% treatment rate within the MEDPAR dataset. We report the first description of US hospital rt-PA treatment rates by hospital. Unfortunately, we found that 64% of US hospitals did not report giving rt-PA at all within the MEDPAR database within a 2-year period. These tended to be hospitals that were smaller (average bed size of 95), located in less densely populated areas, or located in the South or Midwest. In addition, 40% of the US population resides in counties without a hospital that administered rt-PA to at least 2.4% of ischemic stroke patients, although distinguishing transferred patients is problematic within administrative datasets. Such national-based resource-utilization data is important for planning at the local and national level, especially for such initiatives as telemedicine, to reach underserved areas.
Improved Statistical Method For Hydrographic Climatic Records Quality Control

NASA Astrophysics Data System (ADS)

Gourrion, J.; Szekely, T.

2016-02-01

Climate research benefits from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of a quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to early 2014, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has been implemented in the latest version of the CORA dataset and will benefit to the next version of the Copernicus CMEMS dataset.
A Hybrid Method for Endocardial Contour Extraction of Right Ventricle in 4-Slices from 3D Echocardiography Dataset.

PubMed

Dawood, Faten A; Rahmat, Rahmita W; Kadiman, Suhaini B; Abdullah, Lili N; Zamrin, Mohd D

2014-01-01

This paper presents a hybrid method to extract endocardial contour of the right ventricular (RV) in 4-slices from 3D echocardiography dataset. The overall framework comprises four processing phases. In Phase I, the region of interest (ROI) is identified by estimating the cavity boundary. Speckle noise reduction and contrast enhancement were implemented in Phase II as preprocessing tasks. In Phase III, the RV cavity region was segmented by generating intensity threshold which was used for once for all frames. Finally, Phase IV is proposed to extract the RV endocardial contour in a complete cardiac cycle using a combination of shape-based contour detection and improved radial search algorithm. The proposed method was applied to 16 datasets of 3D echocardiography encompassing the RV in long-axis view. The accuracy of experimental results obtained by the proposed method was evaluated qualitatively and quantitatively. It has been done by comparing the segmentation results of RV cavity based on endocardial contour extraction with the ground truth. The comparative analysis results show that the proposed method performs efficiently in all datasets with overall performance of 95% and the root mean square distances (RMSD) measure in terms of mean ± SD was found to be 2.21 ± 0.35 mm for RV endocardial contours.
Reliable DNA Barcoding Performance Proved for Species and Island Populations of Comoran Squamate Reptiles

PubMed Central

Hawlitschek, Oliver; Nagy, Zoltán T.; Berger, Johannes; Glaw, Frank

2013-01-01

In the past decade, DNA barcoding became increasingly common as a method for species identification in biodiversity inventories and related studies. However, mainly due to technical obstacles, squamate reptiles have been the target of few barcoding studies. In this article, we present the results of a DNA barcoding study of squamates of the Comoros archipelago, a poorly studied group of oceanic islands close to and mostly colonized from Madagascar. The barcoding dataset presented here includes 27 of the 29 currently recognized squamate species of the Comoros, including 17 of the 18 endemic species. Some species considered endemic to the Comoros according to current taxonomy were found to cluster with non-Comoran lineages, probably due to poorly resolved taxonomy. All other species for which more than one barcode was obtained corresponded to distinct clusters useful for species identification by barcoding. In most species, even island populations could be distinguished using barcoding. Two cryptic species were identified using the DNA barcoding approach. The obtained barcoding topology, a Bayesian tree based on COI sequences of 5 genera, was compared with available multigene topologies, and in 3 cases, major incongruences between the two topologies became evident. Three of the multigene studies were initiated after initial screening of a preliminary version of the barcoding dataset presented here. We conclude that in the case of the squamates of the Comoros Islands, DNA barcoding has proven a very useful and efficient way of detecting isolated populations and promising starting points for subsequent research. PMID:24069192
Evaluation of unrestrained replica-exchange simulations using dynamic walkers in temperature space for protein structure refinement.

PubMed

Olson, Mark A; Lee, Michael S

2014-01-01

A central problem of computational structural biology is the refinement of modeled protein structures taken from either comparative modeling or knowledge-based methods. Simulations are commonly used to achieve higher resolution of the structures at the all-atom level, yet methodologies that consistently yield accurate results remain elusive. In this work, we provide an assessment of an adaptive temperature-based replica exchange simulation method where the temperature clients dynamically walk in temperature space to enrich their population and exchanges near steep energetic barriers. This approach is compared to earlier work of applying the conventional method of static temperature clients to refine a dataset of conformational decoys. Our results show that, while an adaptive method has many theoretical advantages over a static distribution of client temperatures, only limited improvement was gained from this strategy in excursions of the downhill refinement regime leading to an increase in the fraction of native contacts. To illustrate the sampling differences between the two simulation methods, energy landscapes are presented along with their temperature client profiles.
Modified Cut-Off Value of the Urine Protein-To-Creatinine Ratio Is Helpful for Identifying Patients at High Risk for Chronic Kidney Disease: Validation of the Revised Japanese Guideline.

PubMed

Yamamoto, Hiroyuki; Yamamoto, Kyoko; Yoshida, Katsumi; Shindoh, Chiyohiko; Takeda, Kyoko; Monden, Masami; Izumo, Hiroko; Niinuma, Hiroyuki; Nishi, Yutaro; Niwa, Koichiro; Komatsu, Yasuhiro

2015-11-01

Chronic kidney disease (CKD) is a global public health issue, and strategies for its early detection and intervention are imperative. The latest Japanese CKD guideline recommends that patients without diabetes should be classified using the urine protein-to-creatinine ratio (PCR) instead of the urine albumin-to-creatinine ratio (ACR); however, no validation studies are available. This study aimed to validate the PCR-based CKD risk classification compared with the ACR-based classification and to explore more accurate classification methods. We analyzed two previously reported datasets that included diabetic and/or cardiovascular patients who were classified into early CKD stages. In total, 860 patients (131 diabetic patients and 729 cardiovascular patients, including 193 diabetic patients) were enrolled. We assessed the CKD risk classification of each patient according to the estimated glomerular filtration rate and the ACR-based or PCR-based classification. The use of the cut-off value recommended in the current guideline (PCR 0.15 g/g creatinine) resulted in risk misclassification rates of 26.0% and 16.6% for the two datasets. The misclassification was primarily caused by underestimation. Moderate to substantial agreement between each classification was achieved: Cohen's kappa, 0.56 (95% confidence interval, 0.45-0.69) and 0.72 (0.67-0.76) in each dataset, respectively. To improve the accuracy, we tested various candidate PCR cut-off values, showing that a PCR cut-off value of 0.08-0.10 g/g creatinine resulted in improvement in the misclassification rates and kappa values. Modification of the PCR cut-off value would improve its efficacy to identify high-risk populations who will benefit from early intervention.
Wildlife disease ecology from the individual to the population: Insights from a long-term study of a naturally infected European badger population.

PubMed

McDonald, Jenni L; Robertson, Andrew; Silk, Matthew J

2018-01-01

Long-term individual-based datasets on host-pathogen systems are a rare and valuable resource for understanding the infectious disease dynamics in wildlife. A study of European badgers (Meles meles) naturally infected with bovine tuberculosis (bTB) at Woodchester Park in Gloucestershire (UK) has produced a unique dataset, facilitating investigation of a diverse range of epidemiological and ecological questions with implications for disease management. Since the 1970s, this badger population has been monitored with a systematic mark-recapture regime yielding a dataset of >15,000 captures of >3,000 individuals, providing detailed individual life-history, morphometric, genetic, reproductive and disease data. The annual prevalence of bTB in the Woodchester Park badger population exhibits no straightforward relationship with population density, and both the incidence and prevalence of Mycobacterium bovis show marked variation in space. The study has revealed phenotypic traits that are critical for understanding the social structure of badger populations along with mechanisms vital for understanding disease spread at different spatial resolutions. Woodchester-based studies have provided key insights into how host ecology can influence infection at different spatial and temporal scales. Specifically, it has revealed heterogeneity in epidemiological parameters; intrinsic and extrinsic factors affecting population dynamics; provided insights into senescence and individual life histories; and revealed consistent individual variation in foraging patterns, refuge use and social interactions. An improved understanding of ecological and epidemiological processes is imperative for effective disease management. Woodchester Park research has provided information of direct relevance to bTB management, and a better appreciation of the role of individual heterogeneity in disease transmission can contribute further in this regard. The Woodchester Park study system now offers a rare opportunity to seek a dynamic understanding of how individual-, group- and population-level processes interact. The wealth of existing data makes it possible to take a more integrative approach to examining how the consequences of individual heterogeneity scale to determine population-level pathogen dynamics and help advance our understanding of the ecological drivers of host-pathogen systems. © 2017 The Authors. Journal of Animal Ecology published by John Wiley & Sons Ltd on behalf of British Ecological Society.
Cetacean Density Estimation from Novel Acoustic Datasets by Acoustic Propagation Modeling

DTIC Science & Technology

2014-09-30

hydrophone, to estimate the population density of false killer whales (Pseudorca crassidens) off of the Kona coast of the Island of Hawai’i... killer whale , suffers from interaction with the fisheries industry and its population has been reported to have declined in the past 20 years. Studies...of abundance estimate of false killer whales in Hawai’i through mark recapture methods will provide comparable results to the ones obtained by this
Study on a pattern classification method of soil quality based on simplified learning sample dataset

USGS Publications Warehouse

Zhang, Jiahua; Liu, S.; Hu, Y.; Tian, Y.

2011-01-01

Based on the massive soil information in current soil quality grade evaluation, this paper constructed an intelligent classification approach of soil quality grade depending on classical sampling techniques and disordered multiclassification Logistic regression model. As a case study to determine the learning sample capacity under certain confidence level and estimation accuracy, and use c-means algorithm to automatically extract the simplified learning sample dataset from the cultivated soil quality grade evaluation database for the study area, Long chuan county in Guangdong province, a disordered Logistic classifier model was then built and the calculation analysis steps of soil quality grade intelligent classification were given. The result indicated that the soil quality grade can be effectively learned and predicted by the extracted simplified dataset through this method, which changed the traditional method for soil quality grade evaluation. ?? 2011 IEEE.
Deep learning based beat event detection in action movie franchises

NASA Astrophysics Data System (ADS)

Ejaz, N.; Khan, U. A.; Martínez-del-Amor, M. A.; Sparenberg, H.

2018-04-01

Automatic understanding and interpretation of movies can be used in a variety of ways to semantically manage the massive volumes of movies data. "Action Movie Franchises" dataset is a collection of twenty Hollywood action movies from five famous franchises with ground truth annotations at shot and beat level of each movie. In this dataset, the annotations are provided for eleven semantic beat categories. In this work, we propose a deep learning based method to classify shots and beat-events on this dataset. The training dataset for each of the eleven beat categories is developed and then a Convolution Neural Network is trained. After finding the shot boundaries, key frames are extracted for each shot and then three classification labels are assigned to each key frame. The classification labels for each of the key frames in a particular shot are then used to assign a unique label to each shot. A simple sliding window based method is then used to group adjacent shots having the same label in order to find a particular beat event. The results of beat event classification are presented based on criteria of precision, recall, and F-measure. The results are compared with the existing technique and significant improvements are recorded.
Accurate age estimation in small-scale societies

PubMed Central

Smith, Daniel; Gerbault, Pascale; Dyble, Mark; Migliano, Andrea Bamberg; Thomas, Mark G.

2017-01-01

Precise estimation of age is essential in evolutionary anthropology, especially to infer population age structures and understand the evolution of human life history diversity. However, in small-scale societies, such as hunter-gatherer populations, time is often not referred to in calendar years, and accurate age estimation remains a challenge. We address this issue by proposing a Bayesian approach that accounts for age uncertainty inherent to fieldwork data. We developed a Gibbs sampling Markov chain Monte Carlo algorithm that produces posterior distributions of ages for each individual, based on a ranking order of individuals from youngest to oldest and age ranges for each individual. We first validate our method on 65 Agta foragers from the Philippines with known ages, and show that our method generates age estimations that are superior to previously published regression-based approaches. We then use data on 587 Agta collected during recent fieldwork to demonstrate how multiple partial age ranks coming from multiple camps of hunter-gatherers can be integrated. Finally, we exemplify how the distributions generated by our method can be used to estimate important demographic parameters in small-scale societies: here, age-specific fertility patterns. Our flexible Bayesian approach will be especially useful to improve cross-cultural life history datasets for small-scale societies for which reliable age records are difficult to acquire. PMID:28696282
Accurate age estimation in small-scale societies.

PubMed

Diekmann, Yoan; Smith, Daniel; Gerbault, Pascale; Dyble, Mark; Page, Abigail E; Chaudhary, Nikhil; Migliano, Andrea Bamberg; Thomas, Mark G

2017-08-01

Precise estimation of age is essential in evolutionary anthropology, especially to infer population age structures and understand the evolution of human life history diversity. However, in small-scale societies, such as hunter-gatherer populations, time is often not referred to in calendar years, and accurate age estimation remains a challenge. We address this issue by proposing a Bayesian approach that accounts for age uncertainty inherent to fieldwork data. We developed a Gibbs sampling Markov chain Monte Carlo algorithm that produces posterior distributions of ages for each individual, based on a ranking order of individuals from youngest to oldest and age ranges for each individual. We first validate our method on 65 Agta foragers from the Philippines with known ages, and show that our method generates age estimations that are superior to previously published regression-based approaches. We then use data on 587 Agta collected during recent fieldwork to demonstrate how multiple partial age ranks coming from multiple camps of hunter-gatherers can be integrated. Finally, we exemplify how the distributions generated by our method can be used to estimate important demographic parameters in small-scale societies: here, age-specific fertility patterns. Our flexible Bayesian approach will be especially useful to improve cross-cultural life history datasets for small-scale societies for which reliable age records are difficult to acquire.
Enhanced reporting of deaths among Aboriginal and Torres Strait Islander peoples using linked administrative health datasets.

PubMed

Taylor, Lee K; Bentley, Jason; Hunt, Jennifer; Madden, Richard; McKeown, Sybille; Brandt, Peter; Baker, Deborah

2012-07-02

Aboriginal and Torres Strait Islander peoples are under-reported in administrative health datasets in NSW, Australia. Correct reporting of Aboriginal and Torres Strait Islander peoples is essential to measure the effectiveness of policies and programmes aimed at reducing the health disadvantage experienced by Aboriginal and Torres Strait Islander peoples. This study investigates the potential of record linkage to enhance reporting of deaths among Aboriginal and Torres Strait Islander peoples in NSW, Australia. Australian Bureau of Statistics death registration data for 2007 were linked with four population health datasets relating to hospitalisations, emergency department attendances and births. Reporting of deaths was enhanced from linked records using two methods, and effects on patterns of demographic characteristics and mortality indicators were examined. Reporting of deaths increased by 34.5% using an algorithm based on a weight of evidence of a person being Aboriginal or Torres Strait Islander, and by 56.6% using an approach based on 'at least one report' of a person being Aboriginal or Torres Strait Islander. The increase was relatively greater in older persons and those living in less geographically remote areas. Enhancement resulted in a reduction in the urban-remote differential in median age at death and increases in standardised mortality ratios particularly for chronic conditions. Record linkage creates a statistical construct that helps to correct under-reporting of deaths and potential bias in mortality statistics for Aboriginal and Torres Strait Islander peoples.
Prediction of drug indications based on chemical interactions and chemical similarities.

PubMed

Huang, Guohua; Lu, Yin; Lu, Changhong; Zheng, Mingyue; Cai, Yu-Dong

2015-01-01

Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs.
Prediction of Drug Indications Based on Chemical Interactions and Chemical Similarities

PubMed Central

Huang, Guohua; Lu, Yin; Lu, Changhong; Cai, Yu-Dong

2015-01-01

Discovering potential indications of novel or approved drugs is a key step in drug development. Previous computational approaches could be categorized into disease-centric and drug-centric based on the starting point of the issues or small-scaled application and large-scale application according to the diversity of the datasets. Here, a classifier has been constructed to predict the indications of a drug based on the assumption that interactive/associated drugs or drugs with similar structures are more likely to target the same diseases using a large drug indication dataset. To examine the classifier, it was conducted on a dataset with 1,573 drugs retrieved from Comprehensive Medicinal Chemistry database for five times, evaluated by 5-fold cross-validation, yielding five 1st order prediction accuracies that were all approximately 51.48%. Meanwhile, the model yielded an accuracy rate of 50.00% for the 1st order prediction by independent test on a dataset with 32 other drugs in which drug repositioning has been confirmed. Interestingly, some clinically repurposed drug indications that were not included in the datasets are successfully identified by our method. These results suggest that our method may become a useful tool to associate novel molecules with new indications or alternative indications with existing drugs. PMID:25821813
Particle Swarm Optimization approach to defect detection in armour ceramics.

PubMed

Kesharaju, Manasa; Nagarajah, Romesh

2017-03-01

In this research, various extracted features were used in the development of an automated ultrasonic sensor based inspection system that enables defect classification in each ceramic component prior to despatch to the field. Classification is an important task and large number of irrelevant, redundant features commonly introduced to a dataset reduces the classifiers performance. Feature selection aims to reduce the dimensionality of the dataset while improving the performance of a classification system. In the context of a multi-criteria optimization problem (i.e. to minimize classification error rate and reduce number of features) such as one discussed in this research, the literature suggests that evolutionary algorithms offer good results. Besides, it is noted that Particle Swarm Optimization (PSO) has not been explored especially in the field of classification of high frequency ultrasonic signals. Hence, a binary coded Particle Swarm Optimization (BPSO) technique is investigated in the implementation of feature subset selection and to optimize the classification error rate. In the proposed method, the population data is used as input to an Artificial Neural Network (ANN) based classification system to obtain the error rate, as ANN serves as an evaluator of PSO fitness function. Copyright © 2016. Published by Elsevier B.V.
Transcriptomics hit the target: Monitoring of ligand-activated and stress response pathways for chemical testing.

PubMed

Limonciel, Alice; Moenks, Konrad; Stanzel, Sven; Truisi, Germaine L; Parmentier, Céline; Aschauer, Lydia; Wilmes, Anja; Richert, Lysiane; Hewitt, Philip; Mueller, Stefan O; Lukas, Arno; Kopp-Schneider, Annette; Leonard, Martin O; Jennings, Paul

2015-12-25

High content omic methods provide a deep insight into cellular events occurring upon chemical exposure of a cell population or tissue. However, this improvement in analytic precision is not yet matched by a thorough understanding of molecular mechanisms that would allow an optimal interpretation of these biological changes. For transcriptomics (TCX), one type of molecular effects that can be assessed already is the modulation of the transcriptional activity of a transcription factor (TF). As more ChIP-seq datasets reporting genes specifically bound by a TF become publicly available for mining, the generation of target gene lists of TFs of toxicological relevance becomes possible, based on actual protein-DNA interaction and modulation of gene expression. In this study, we generated target gene signatures for Nrf2, ATF4, XBP1, p53, HIF1a, AhR and PPAR gamma and tracked TF modulation in a large collection of in vitro TCX datasets from renal and hepatic cell models exposed to clinical nephro- and hepato-toxins. The result is a global monitoring of TF modulation with great promise as a mechanistically based tool for chemical hazard identification. Copyright © 2014 Elsevier Ltd. All rights reserved.
Integrated omics analysis of specialized metabolism in medicinal plants.

PubMed

Rai, Amit; Saito, Kazuki; Yamazaki, Mami

2017-05-01

Medicinal plants are a rich source of highly diverse specialized metabolites with important pharmacological properties. Until recently, plant biologists were limited in their ability to explore the biosynthetic pathways of these metabolites, mainly due to the scarcity of plant genomics resources. However, recent advances in high-throughput large-scale analytical methods have enabled plant biologists to discover biosynthetic pathways for important plant-based medicinal metabolites. The reduced cost of generating omics datasets and the development of computational tools for their analysis and integration have led to the elucidation of biosynthetic pathways of several bioactive metabolites of plant origin. These discoveries have inspired synthetic biology approaches to develop microbial systems to produce bioactive metabolites originating from plants, an alternative sustainable source of medicinally important chemicals. Since the demand for medicinal compounds are increasing with the world's population, understanding the complete biosynthesis of specialized metabolites becomes important to identify or develop reliable sources in the future. Here, we review the contributions of major omics approaches and their integration to our understanding of the biosynthetic pathways of bioactive metabolites. We briefly discuss different approaches for integrating omics datasets to extract biologically relevant knowledge and the application of omics datasets in the construction and reconstruction of metabolic models. © 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.

Otolith reading and multi-model inference for improved estimation of age and growth in the gilthead seabream Sparus aurata (L.)

NASA Astrophysics Data System (ADS)

Mercier, Lény; Panfili, Jacques; Paillon, Christelle; N'diaye, Awa; Mouillot, David; Darnaude, Audrey M.

2011-05-01

Accurate knowledge of fish age and growth is crucial for species conservation and management of exploited marine stocks. In exploited species, age estimation based on otolith reading is routinely used for building growth curves that are used to implement fishery management models. However, the universal fit of the von Bertalanffy growth function (VBGF) on data from commercial landings can lead to uncertainty in growth parameter inference, preventing accurate comparison of growth-based history traits between fish populations. In the present paper, we used a comprehensive annual sample of wild gilthead seabream ( Sparus aurata L.) in the Gulf of Lions (France, NW Mediterranean) to test a methodology improving growth modelling for exploited fish populations. After validating the timing for otolith annual increment formation for all life stages, a comprehensive set of growth models (including VBGF) were fitted to the obtained age-length data, used as a whole or sub-divided between group 0 individuals and those coming from commercial landings (ages 1-6). Comparisons in growth model accuracy based on Akaike Information Criterion allowed assessment of the best model for each dataset and, when no model correctly fitted the data, a multi-model inference (MMI) based on model averaging was carried out. The results provided evidence that growth parameters inferred with VBGF must be used with high caution. Hence, VBGF turned to be among the less accurate for growth prediction irrespective of the dataset and its fit to the whole population, the juvenile or the adult datasets provided different growth parameters. The best models for growth prediction were the Tanaka model, for group 0 juveniles, and the MMI, for the older fish, confirming that growth differs substantially between juveniles and adults. All asymptotic models failed to correctly describe the growth of adult S. aurata, probably because of the poor representation of old individuals in the dataset. Multi-model inference associated with separate analysis of juveniles and adult fish is then advised to obtain objective estimations of growth parameters when sampling cannot be corrected towards older fish.
A genome scan for selection signatures comparing farmed Atlantic salmon with two wild populations: Testing colocalization among outlier markers, candidate genes, and quantitative trait loci for production traits.

PubMed

Liu, Lei; Ang, Keng Pee; Elliott, J A K; Kent, Matthew Peter; Lien, Sigbjørn; MacDonald, Danielle; Boulding, Elizabeth Grace

2017-03-01

Comparative genome scans can be used to identify chromosome regions, but not traits, that are putatively under selection. Identification of targeted traits may be more likely in recently domesticated populations under strong artificial selection for increased production. We used a North American Atlantic salmon 6K SNP dataset to locate genome regions of an aquaculture strain (Saint John River) that were highly diverged from that of its putative wild founder population (Tobique River). First, admixed individuals with partial European ancestry were detected using STRUCTURE and removed from the dataset. Outlier loci were then identified as those showing extreme differentiation between the aquaculture population and the founder population. All Arlequin methods identified an overlapping subset of 17 outlier loci, three of which were also identified by BayeScan. Many outlier loci were near candidate genes and some were near published quantitative trait loci (QTLs) for growth, appetite, maturity, or disease resistance. Parallel comparisons using a wild, nonfounder population (Stewiacke River) yielded only one overlapping outlier locus as well as a known maturity QTL. We conclude that genome scans comparing a recently domesticated strain with its wild founder population can facilitate identification of candidate genes for traits known to have been under strong artificial selection.
Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison

PubMed Central

2013-01-01

Background Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. Results We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. Conclusion CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets. PMID:23617892
Automated brain tumour detection and segmentation using superpixel-based extremely randomized trees in FLAIR MRI.

PubMed

Soltaninejad, Mohammadreza; Yang, Guang; Lambrou, Tryphon; Allinson, Nigel; Jones, Timothy L; Barrick, Thomas R; Howe, Franklyn A; Ye, Xujiong

2017-02-01

We propose a fully automated method for detection and segmentation of the abnormal tissue associated with brain tumour (tumour core and oedema) from Fluid- Attenuated Inversion Recovery (FLAIR) Magnetic Resonance Imaging (MRI). The method is based on superpixel technique and classification of each superpixel. A number of novel image features including intensity-based, Gabor textons, fractal analysis and curvatures are calculated from each superpixel within the entire brain area in FLAIR MRI to ensure a robust classification. Extremely randomized trees (ERT) classifier is compared with support vector machine (SVM) to classify each superpixel into tumour and non-tumour. The proposed method is evaluated on two datasets: (1) Our own clinical dataset: 19 MRI FLAIR images of patients with gliomas of grade II to IV, and (2) BRATS 2012 dataset: 30 FLAIR images with 10 low-grade and 20 high-grade gliomas. The experimental results demonstrate the high detection and segmentation performance of the proposed method using ERT classifier. For our own cohort, the average detection sensitivity, balanced error rate and the Dice overlap measure for the segmented tumour against the ground truth are 89.48 %, 6 % and 0.91, respectively, while, for the BRATS dataset, the corresponding evaluation results are 88.09 %, 6 % and 0.88, respectively. This provides a close match to expert delineation across all grades of glioma, leading to a faster and more reproducible method of brain tumour detection and delineation to aid patient management.
Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison.

PubMed

Yang, Fang; Chia, Nicholas; White, Bryan A; Schook, Lawrence B

2013-04-23

Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets.
Pollution Sources and Mortality Rates across Rural-Urban Areas in the United States

ERIC Educational Resources Information Center

Hendryx, Michael; Fedorko, Evan; Halverson, Joel

2010-01-01

Purpose: To conduct an assessment of rural environmental pollution sources and associated population mortality rates. Methods: The design is a secondary analysis of county-level data from the Environmental Protection Agency (EPA), Department of Agriculture, National Land Cover Dataset, Energy Information Administration, Centers for Disease Control…
A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method.

PubMed

Yang, Jun-He; Cheng, Ching-Hsue; Chan, Chia-Pan

2017-01-01

Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
Controlling feeding behavior by chemical or gene-directed targeting in the brain: what's so spatial about our methods?

PubMed Central

Khan, Arshad M.

2013-01-01

Intracranial chemical injection (ICI) methods have been used to identify the locations in the brain where feeding behavior can be controlled acutely. Scientists conducting ICI studies often document their injection site locations, thereby leaving kernels of valuable location data for others to use to further characterize feeding control circuits. Unfortunately, this rich dataset has not yet been formally contextualized with other published neuroanatomical data. In particular, axonal tracing studies have delineated several neural circuits originating in the same areas where ICI injection feeding-control sites have been documented, but it remains unclear whether these circuits participate in feeding control. Comparing injection sites with other types of location data would require careful anatomical registration between the datasets. Here, a conceptual framework is presented for how such anatomical registration efforts can be performed. For example, by using a simple atlas alignment tool, a hypothalamic locus sensitive to the orexigenic effects of neuropeptide Y (NPY) can be aligned accurately with the locations of neurons labeled by anterograde tracers or those known to express NPY receptors or feeding-related peptides. This approach can also be applied to those intracranial “gene-directed” injection (IGI) methods (e.g., site-specific recombinase methods, RNA expression or interference, optogenetics, and pharmacosynthetics) that involve viral injections to targeted neuronal populations. Spatial alignment efforts can be accelerated if location data from ICI/IGI methods are mapped to stereotaxic brain atlases to allow powerful neuroinformatics tools to overlay different types of data in the same reference space. Atlas-based mapping will be critical for community-based sharing of location data for feeding control circuits, and will accelerate our understanding of structure-function relationships in the brain for mammalian models of obesity and metabolic disorders. PMID:24385950
A Monocular Vision Sensor-Based Obstacle Detection Algorithm for Autonomous Robots

PubMed Central

Lee, Tae-Jae; Yi, Dong-Hoon; Cho, Dong-Il “Dan”

2016-01-01

This paper presents a monocular vision sensor-based obstacle detection algorithm for autonomous robots. Each individual image pixel at the bottom region of interest is labeled as belonging either to an obstacle or the floor. While conventional methods depend on point tracking for geometric cues for obstacle detection, the proposed algorithm uses the inverse perspective mapping (IPM) method. This method is much more advantageous when the camera is not high off the floor, which makes point tracking near the floor difficult. Markov random field-based obstacle segmentation is then performed using the IPM results and a floor appearance model. Next, the shortest distance between the robot and the obstacle is calculated. The algorithm is tested by applying it to 70 datasets, 20 of which include nonobstacle images where considerable changes in floor appearance occur. The obstacle segmentation accuracies and the distance estimation error are quantitatively analyzed. For obstacle datasets, the segmentation precision and the average distance estimation error of the proposed method are 81.4% and 1.6 cm, respectively, whereas those for a conventional method are 57.5% and 9.9 cm, respectively. For nonobstacle datasets, the proposed method gives 0.0% false positive rates, while the conventional method gives 17.6%. PMID:26938540
A normalization method for combination of laboratory test results from different electronic healthcare databases in a distributed research network.

PubMed

Yoon, Dukyong; Schuemie, Martijn J; Kim, Ju Han; Kim, Dong Ki; Park, Man Young; Ahn, Eun Kyoung; Jung, Eun-Young; Park, Dong Kyun; Cho, Soo Yeon; Shin, Dahye; Hwang, Yeonsoo; Park, Rae Woong

2016-03-01

Distributed research networks (DRNs) afford statistical power by integrating observational data from multiple partners for retrospective studies. However, laboratory test results across care sites are derived using different assays from varying patient populations, making it difficult to simply combine data for analysis. Additionally, existing normalization methods are not suitable for retrospective studies. We normalized laboratory results from different data sources by adjusting for heterogeneous clinico-epidemiologic characteristics of the data and called this the subgroup-adjusted normalization (SAN) method. Subgroup-adjusted normalization renders the means and standard deviations of distributions identical under population structure-adjusted conditions. To evaluate its performance, we compared SAN with existing methods for simulated and real datasets consisting of blood urea nitrogen, serum creatinine, hematocrit, hemoglobin, serum potassium, and total bilirubin. Various clinico-epidemiologic characteristics can be applied together in SAN. For simplicity of comparison, age and gender were used to adjust population heterogeneity in this study. In simulations, SAN had the lowest standardized difference in means (SDM) and Kolmogorov-Smirnov values for all tests (p < 0.05). In a real dataset, SAN had the lowest SDM and Kolmogorov-Smirnov values for blood urea nitrogen, hematocrit, hemoglobin, and serum potassium, and the lowest SDM for serum creatinine (p < 0.05). Subgroup-adjusted normalization performed better than normalization using other methods. The SAN method is applicable in a DRN environment and should facilitate analysis of data integrated across DRN partners for retrospective observational studies. Copyright © 2015 John Wiley & Sons, Ltd.
Spatiotemporal dataset on Chinese population distribution and its driving factors from 1949 to 2013.

PubMed

Wang, Lizhe; Chen, Lajiao

2016-07-05

Spatio-temporal data on human population and its driving factors is critical to understanding and responding to population problems. Unfortunately, such spatio-temporal data on a large scale and over the long term are often difficult to obtain. Here, we present a dataset on Chinese population distribution and its driving factors over a remarkably long period, from 1949 to 2013. Driving factors of population distribution were selected according to the push-pull migration laws, which were summarized into four categories: natural environment, natural resources, economic factors and social factors. Natural environment and natural resources indicators were calculated using Geographic Information System (GIS) and Remote Sensing (RS) techniques, whereas economic and social factors from 1949 to 2013 were collected from the China Statistical Yearbook and China Compendium of Statistics from 1949 to 2008. All of the data were quality controlled and unified into an identical dataset with the same spatial scope and time period. The dataset is expected to be useful for understanding how population responds to and impacts environmental change.
Spatiotemporal dataset on Chinese population distribution and its driving factors from 1949 to 2013

NASA Astrophysics Data System (ADS)

Wang, Lizhe; Chen, Lajiao

2016-07-01

Spatio-temporal data on human population and its driving factors is critical to understanding and responding to population problems. Unfortunately, such spatio-temporal data on a large scale and over the long term are often difficult to obtain. Here, we present a dataset on Chinese population distribution and its driving factors over a remarkably long period, from 1949 to 2013. Driving factors of population distribution were selected according to the push-pull migration laws, which were summarized into four categories: natural environment, natural resources, economic factors and social factors. Natural environment and natural resources indicators were calculated using Geographic Information System (GIS) and Remote Sensing (RS) techniques, whereas economic and social factors from 1949 to 2013 were collected from the China Statistical Yearbook and China Compendium of Statistics from 1949 to 2008. All of the data were quality controlled and unified into an identical dataset with the same spatial scope and time period. The dataset is expected to be useful for understanding how population responds to and impacts environmental change.
A Review on Human Activity Recognition Using Vision-Based Method.

PubMed

Zhang, Shugang; Wei, Zhiqiang; Nie, Jie; Huang, Lei; Wang, Shuang; Li, Zhen

2017-01-01

Human activity recognition (HAR) aims to recognize activities from a series of observations on the actions of subjects and the environmental conditions. The vision-based HAR research is the basis of many applications including video surveillance, health care, and human-computer interaction (HCI). This review highlights the advances of state-of-the-art activity recognition approaches, especially for the activity representation and classification methods. For the representation methods, we sort out a chronological research trajectory from global representations to local representations, and recent depth-based representations. For the classification methods, we conform to the categorization of template-based methods, discriminative models, and generative models and review several prevalent methods. Next, representative and available datasets are introduced. Aiming to provide an overview of those methods and a convenient way of comparing them, we classify existing literatures with a detailed taxonomy including representation and classification methods, as well as the datasets they used. Finally, we investigate the directions for future research.
A Review on Human Activity Recognition Using Vision-Based Method

PubMed Central

Nie, Jie

2017-01-01

Human activity recognition (HAR) aims to recognize activities from a series of observations on the actions of subjects and the environmental conditions. The vision-based HAR research is the basis of many applications including video surveillance, health care, and human-computer interaction (HCI). This review highlights the advances of state-of-the-art activity recognition approaches, especially for the activity representation and classification methods. For the representation methods, we sort out a chronological research trajectory from global representations to local representations, and recent depth-based representations. For the classification methods, we conform to the categorization of template-based methods, discriminative models, and generative models and review several prevalent methods. Next, representative and available datasets are introduced. Aiming to provide an overview of those methods and a convenient way of comparing them, we classify existing literatures with a detailed taxonomy including representation and classification methods, as well as the datasets they used. Finally, we investigate the directions for future research. PMID:29065585
Population specific biomarkers of human aging: a big data study using South Korean, Canadian and Eastern European patient populations.

PubMed

Mamoshina, Polina; Kochetov, Kirill; Putin, Evgeny; Cortese, Franco; Aliper, Alexander; Lee, Won-Suk; Ahn, Sung-Min; Uhn, Lee; Skjodt, Neil; Kovalchuk, Olga; Scheibye-Knudsen, Morten; Zhavoronkov, Alex

2018-01-11

Accurate and physiologically meaningful biomarkers for human aging are key to assessing anti-aging therapies. Given ethnic differences in health, diet, lifestyle, behaviour, environmental exposures and even average rate of biological aging, it stands to reason that aging clocks trained on datasets obtained from specific ethnic populations are more likely to account for these potential confounding factors, resulting in an enhanced capacity to predict chronological age and quantify biological age. Here we present a deep learning-based hematological aging clock modeled using the large combined dataset of Canadian, South Korean and Eastern European population blood samples that show increased predictive accuracy in individual populations compared to population-specific hematologic aging clocks. The performance of models was also evaluated on publicly-available samples of the American population from the National Health and Nutrition Examination Survey (NHANES). In addition, we explored the association between age predicted by both population-specific and combined hematological clocks and all-cause mortality. Overall, this study suggests a) the population-specificity of aging patterns and b) hematologic clocks predicts all-cause mortality. Proposed models added to the freely available Aging.AI system allowing improved ability to assess human aging. © The Author(s) 2018. Published by Oxford University Press on behalf of The Gerontological Society of America.
Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models.

PubMed

Sul, Jae Hoon; Bilow, Michael; Yang, Wen-Yun; Kostem, Emrah; Furlotte, Nick; He, Dan; Eskin, Eleazar

2016-03-01

Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants.
Haplowebs as a graphical tool for delimiting species: a revival of Doyle's "field for recombination" approach and its application to the coral genus Pocillopora in Clipperton

PubMed Central

2010-01-01

Background Usual methods for inferring species boundaries from molecular sequence data rely either on gene trees or on population genetic analyses. Another way of delimiting species, based on a view of species as "fields for recombination" (FFRs) characterized by mutual allelic exclusivity, was suggested in 1995 by Doyle. Here we propose to use haplowebs (haplotype networks with additional connections between haplotypes found co-occurring in heterozygous individuals) to visualize and delineate single-locus FFRs (sl-FFRs). Furthermore, we introduce a method to quantify the reliability of putative species boundaries according to the number of independent markers that support them, and illustrate this approach with a case study of taxonomically difficult corals of the genus Pocillopora collected around Clipperton Island (far eastern Pacific). Results One haploweb built from intron sequences of the ATP synthase β subunit gene revealed the presence of two sl-FFRs among our 74 coral samples, whereas a second one built from ITS sequences turned out to be composed of four sl-FFRs. As a third independent marker, we performed a combined analysis of two regions of the mitochondrial genome: since haplowebs are not suited to analyze non-recombining markers, individuals were sorted into four haplogroups according to their mitochondrial sequences. Among all possible bipartitions of our set of samples, thirteen were supported by at least one molecular dataset, none by two and only one by all three datasets: this congruent pattern obtained from independent nuclear and mitochondrial markers indicates that two species of Pocillopora are present in Clipperton. Conclusions Our approach builds on Doyle's method and extends it by introducing an intuitive, user-friendly graphical representation and by proposing a conceptual framework to analyze and quantify the congruence between sl-FFRs obtained from several independent markers. Like delineation methods based on population-level statistical approaches, our method can distinguish closely-related species that have not yet reached reciprocal monophyly at most or all of their loci; like tree-based approaches, it can yield meaningful conclusions using a number of independent markers as low as three. Future efforts will aim to develop programs that speed up the construction of haplowebs from FASTA sequence alignments and help perform the congruence analysis outlined in this article. PMID:21118572
Haplowebs as a graphical tool for delimiting species: a revival of Doyle's "field for recombination" approach and its application to the coral genus Pocillopora in Clipperton.

PubMed

Flot, Jean-François; Couloux, Arnaud; Tillier, Simon

2010-11-30

Usual methods for inferring species boundaries from molecular sequence data rely either on gene trees or on population genetic analyses. Another way of delimiting species, based on a view of species as "fields for recombination" (FFRs) characterized by mutual allelic exclusivity, was suggested in 1995 by Doyle. Here we propose to use haplowebs (haplotype networks with additional connections between haplotypes found co-occurring in heterozygous individuals) to visualize and delineate single-locus FFRs (sl-FFRs). Furthermore, we introduce a method to quantify the reliability of putative species boundaries according to the number of independent markers that support them, and illustrate this approach with a case study of taxonomically difficult corals of the genus Pocillopora collected around Clipperton Island (far eastern Pacific). One haploweb built from intron sequences of the ATP synthase β subunit gene revealed the presence of two sl-FFRs among our 74 coral samples, whereas a second one built from ITS sequences turned out to be composed of four sl-FFRs. As a third independent marker, we performed a combined analysis of two regions of the mitochondrial genome: since haplowebs are not suited to analyze non-recombining markers, individuals were sorted into four haplogroups according to their mitochondrial sequences. Among all possible bipartitions of our set of samples, thirteen were supported by at least one molecular dataset, none by two and only one by all three datasets: this congruent pattern obtained from independent nuclear and mitochondrial markers indicates that two species of Pocillopora are present in Clipperton. Our approach builds on Doyle's method and extends it by introducing an intuitive, user-friendly graphical representation and by proposing a conceptual framework to analyze and quantify the congruence between sl-FFRs obtained from several independent markers. Like delineation methods based on population-level statistical approaches, our method can distinguish closely-related species that have not yet reached reciprocal monophyly at most or all of their loci; like tree-based approaches, it can yield meaningful conclusions using a number of independent markers as low as three. Future efforts will aim to develop programs that speed up the construction of haplowebs from FASTA sequence alignments and help perform the congruence analysis outlined in this article.
Estimating parameters for probabilistic linkage of privacy-preserved datasets.

PubMed

Brown, Adrian P; Randall, Sean M; Ferrante, Anna M; Semmens, James B; Boyd, James H

2017-07-10

Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets.
Plant species classification using flower images—A comparative study of local feature representations

PubMed Central

Seeland, Marco; Rzanny, Michael; Alaqraa, Nedal; Wäldchen, Jana; Mäder, Patrick

2017-01-01

Steady improvements of image description methods induced a growing interest in image-based plant species classification, a task vital to the study of biodiversity and ecological sensitivity. Various techniques have been proposed for general object classification over the past years and several of them have already been studied for plant species classification. However, results of these studies are selective in the evaluated steps of a classification pipeline, in the utilized datasets for evaluation, and in the compared baseline methods. No study is available that evaluates the main competing methods for building an image representation on the same datasets allowing for generalized findings regarding flower-based plant species classification. The aim of this paper is to comparatively evaluate methods, method combinations, and their parameters towards classification accuracy. The investigated methods span from detection, extraction, fusion, pooling, to encoding of local features for quantifying shape and color information of flower images. We selected the flower image datasets Oxford Flower 17 and Oxford Flower 102 as well as our own Jena Flower 30 dataset for our experiments. Findings show large differences among the various studied techniques and that their wisely chosen orchestration allows for high accuracies in species classification. We further found that true local feature detectors in combination with advanced encoding methods yield higher classification results at lower computational costs compared to commonly used dense sampling and spatial pooling methods. Color was found to be an indispensable feature for high classification results, especially while preserving spatial correspondence to gray-level features. In result, our study provides a comprehensive overview of competing techniques and the implications of their main parameters for flower-based plant species classification. PMID:28234999

Prediction of rat protein subcellular localization with pseudo amino acid composition based on multiple sequential features.

PubMed

Shi, Ruijia; Xu, Cunshuan

2011-06-01

The study of rat proteins is an indispensable task in experimental medicine and drug development. The function of a rat protein is closely related to its subcellular location. Based on the above concept, we construct the benchmark rat proteins dataset and develop a combined approach for predicting the subcellular localization of rat proteins. From protein primary sequence, the multiple sequential features are obtained by using of discrete Fourier analysis, position conservation scoring function and increment of diversity, and these sequential features are selected as input parameters of the support vector machine. By the jackknife test, the overall success rate of prediction is 95.6% on the rat proteins dataset. Our method are performed on the apoptosis proteins dataset and the Gram-negative bacterial proteins dataset with the jackknife test, the overall success rates are 89.9% and 96.4%, respectively. The above results indicate that our proposed method is quite promising and may play a complementary role to the existing predictors in this area.
Age, Gender, and Fine-Grained Ethnicity Prediction using Convolutional Neural Networks for the East Asian Face Dataset

DOE Office of Scientific and Technical Information (OSTI.GOV)

Srinivas, Nisha; Rose, Derek C; Bolme, David S

This paper examines the difficulty associated with performing machine-based automatic demographic prediction on a sub-population of Asian faces. We introduce the Wild East Asian Face dataset (WEAFD), a new and unique dataset to the research community. This dataset consists primarily of labeled face images of individuals from East Asian countries, including Vietnam, Burma, Thailand, China, Korea, Japan, Indonesia, and Malaysia. East Asian turk annotators were uniquely used to judge the age and fine grain ethnicity attributes to reduce the impact of the other race effect and improve quality of annotations. We focus on predicting age, gender and fine-grained ethnicity ofmore » an individual by providing baseline results with a convolutional neural network (CNN). Finegrained ethnicity prediction refers to predicting ethnicity of an individual by country or sub-region (Chinese, Japanese, Korean, etc.) of the East Asian continent. Performance for two CNN architectures is presented, highlighting the difficulty of these tasks and showcasing potential design considerations that ease network optimization by promoting region based feature extraction.« less
The global coastline dataset: the observed relation between erosion and sea-level rise

NASA Astrophysics Data System (ADS)

Donchyts, G.; Baart, F.; Luijendijk, A.; Hagenaars, G.

2017-12-01

Erosion of sandy coasts is considered one of the key risks of sea-level rise. Because sandy coastlines of the world are often highly populated, erosive coastline trends result in risk to populations and infrastructure. Most of our understanding of the relation between sea-level rise and coastal erosion is based on local or regional observations and generalizations of numerical and physical experiments. Until recently there was no reliable global scale assessment of the location of sandy coasts and their rate of erosion and accretion. Here we present the global coastline dataset that covers erosion indicators on a local scale with global coverage. The dataset uses our global coastline transects grid defined with an alongshore spacing of 250 m and a cross shore length extending 1 km seaward and 1 km landward. This grid matches up with pre-existing local grids where available. We present the latest results on validation of coastal-erosion trends (based on optical satellites) and classification of sandy versus non-sandy coasts. We show the relation between sea-level rise (based both on tide-gauges and multi-mission satellite altimetry) and observed erosion trends over the last decades, taking into account broken-coastline trends (for example due to nourishments).An interactive web application presents the publicly-accessible results using a backend based on Google Earth Engine. It allows both researchers and stakeholders to use objective estimates of coastline trends, particularly when authoritative sources are not available.
Prediction of Drug-Target Interaction Networks from the Integration of Protein Sequences and Drug Chemical Structures.

PubMed

Meng, Fan-Rong; You, Zhu-Hong; Chen, Xing; Zhou, Yong; An, Ji-Yong

2017-07-05

Knowledge of drug-target interaction (DTI) plays an important role in discovering new drug candidates. Unfortunately, there are unavoidable shortcomings; including the time-consuming and expensive nature of the experimental method to predict DTI. Therefore, it motivates us to develop an effective computational method to predict DTI based on protein sequence. In the paper, we proposed a novel computational approach based on protein sequence, namely PDTPS (Predicting Drug Targets with Protein Sequence) to predict DTI. The PDTPS method combines Bi-gram probabilities (BIGP), Position Specific Scoring Matrix (PSSM), and Principal Component Analysis (PCA) with Relevance Vector Machine (RVM). In order to evaluate the prediction capacity of the PDTPS, the experiment was carried out on enzyme, ion channel, GPCR, and nuclear receptor datasets by using five-fold cross-validation tests. The proposed PDTPS method achieved average accuracy of 97.73%, 93.12%, 86.78%, and 87.78% on enzyme, ion channel, GPCR and nuclear receptor datasets, respectively. The experimental results showed that our method has good prediction performance. Furthermore, in order to further evaluate the prediction performance of the proposed PDTPS method, we compared it with the state-of-the-art support vector machine (SVM) classifier on enzyme and ion channel datasets, and other exiting methods on four datasets. The promising comparison results further demonstrate that the efficiency and robust of the proposed PDTPS method. This makes it a useful tool and suitable for predicting DTI, as well as other bioinformatics tasks.
Application of an imputation method for geospatial inventory of forest structural attributes across multiple spatial scales in the Lake States, U.S.A

NASA Astrophysics Data System (ADS)

Deo, Ram K.

Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.
Feature Representations for Neuromorphic Audio Spike Streams.

PubMed

Anumula, Jithendar; Neil, Daniel; Delbruck, Tobi; Liu, Shih-Chii

2018-01-01

Event-driven neuromorphic spiking sensors such as the silicon retina and the silicon cochlea encode the external sensory stimuli as asynchronous streams of spikes across different channels or pixels. Combining state-of-art deep neural networks with the asynchronous outputs of these sensors has produced encouraging results on some datasets but remains challenging. While the lack of effective spiking networks to process the spike streams is one reason, the other reason is that the pre-processing methods required to convert the spike streams to frame-based features needed for the deep networks still require further investigation. This work investigates the effectiveness of synchronous and asynchronous frame-based features generated using spike count and constant event binning in combination with the use of a recurrent neural network for solving a classification task using N-TIDIGITS18 dataset. This spike-based dataset consists of recordings from the Dynamic Audio Sensor, a spiking silicon cochlea sensor, in response to the TIDIGITS audio dataset. We also propose a new pre-processing method which applies an exponential kernel on the output cochlea spikes so that the interspike timing information is better preserved. The results from the N-TIDIGITS18 dataset show that the exponential features perform better than the spike count features, with over 91% accuracy on the digit classification task. This accuracy corresponds to an improvement of at least 2.5% over the use of spike count features, establishing a new state of the art for this dataset.
Feature Representations for Neuromorphic Audio Spike Streams

PubMed Central

Anumula, Jithendar; Neil, Daniel; Delbruck, Tobi; Liu, Shih-Chii

2018-01-01

Event-driven neuromorphic spiking sensors such as the silicon retina and the silicon cochlea encode the external sensory stimuli as asynchronous streams of spikes across different channels or pixels. Combining state-of-art deep neural networks with the asynchronous outputs of these sensors has produced encouraging results on some datasets but remains challenging. While the lack of effective spiking networks to process the spike streams is one reason, the other reason is that the pre-processing methods required to convert the spike streams to frame-based features needed for the deep networks still require further investigation. This work investigates the effectiveness of synchronous and asynchronous frame-based features generated using spike count and constant event binning in combination with the use of a recurrent neural network for solving a classification task using N-TIDIGITS18 dataset. This spike-based dataset consists of recordings from the Dynamic Audio Sensor, a spiking silicon cochlea sensor, in response to the TIDIGITS audio dataset. We also propose a new pre-processing method which applies an exponential kernel on the output cochlea spikes so that the interspike timing information is better preserved. The results from the N-TIDIGITS18 dataset show that the exponential features perform better than the spike count features, with over 91% accuracy on the digit classification task. This accuracy corresponds to an improvement of at least 2.5% over the use of spike count features, establishing a new state of the art for this dataset. PMID:29479300
Species limits in the Morelet's Alligator lizard (Anguidae: Gerrhonotinae).

PubMed

Solano-Zavaleta, Israel; Nieto-Montes de Oca, Adrián

2018-03-01

The widely distributed, Central American anguid lizard Mesaspis moreletii is currently recognized as a polytypic species with five subspecies (M. m. fulvus, M. m. moreletii, M. m. rafaeli, M. m. salvadorensis, and M. m. temporalis). We reevaluated the species limits within Mesaspis moreletii using DNA sequences of one mitochondrial and three nuclear genes. The multi-locus data set included samples of all of the subspecies of M. moreletii, the other species of Mesaspis in Central America (M. cuchumatanus and M. monticola), and some populations assignable to M. moreletii but of uncertain subspecific identity from Honduras and Nicaragua. We first used a tree-based method for delimiting species based on mtDNA data to identify potential evolutionary independent lineages, and then analized the multilocus dataset with two species delimitation methods that use the multispecies coalescent model to evaluate different competing species delimitation models: the Bayes factors species delimitation method (BFD) implemented in ∗ BEAST, and the Bayesian Phylogenetics and Phylogeography (BP&P) method. Our results suggest that M. m. moreletii, M. m. rafaeli, M. m. salvadorensis, and M. m. temporalis represent distinct evolutionary independent lineages, and that the populations of uncertain status from Honduras and Nicaragua may represent additional undescribed species. Our results also suggest that M. m. fulvus is a synonym of M. m. moreletii. The biogeography of the Central American lineages of Mesaspis is discussed. Copyright © 2017 Elsevier Inc. All rights reserved.
Antibody-protein interactions: benchmark datasets and prediction tools evaluation

PubMed Central

Ponomarenko, Julia V; Bourne, Philip E

2007-01-01

Background The ability to predict antibody binding sites (aka antigenic determinants or B-cell epitopes) for a given protein is a precursor to new vaccine design and diagnostics. Among the various methods of B-cell epitope identification X-ray crystallography is one of the most reliable methods. Using these experimental data computational methods exist for B-cell epitope prediction. As the number of structures of antibody-protein complexes grows, further interest in prediction methods using 3D structure is anticipated. This work aims to establish a benchmark for 3D structure-based epitope prediction methods. Results Two B-cell epitope benchmark datasets inferred from the 3D structures of antibody-protein complexes were defined. The first is a dataset of 62 representative 3D structures of protein antigens with inferred structural epitopes. The second is a dataset of 82 structures of antibody-protein complexes containing different structural epitopes. Using these datasets, eight web-servers developed for antibody and protein binding sites prediction have been evaluated. In no method did performance exceed a 40% precision and 46% recall. The values of the area under the receiver operating characteristic curve for the evaluated methods were about 0.6 for ConSurf, DiscoTope, and PPI-PRED methods and above 0.65 but not exceeding 0.70 for protein-protein docking methods when the best of the top ten models for the bound docking were considered; the remaining methods performed close to random. The benchmark datasets are included as a supplement to this paper. Conclusion It may be possible to improve epitope prediction methods through training on datasets which include only immune epitopes and through utilizing more features characterizing epitopes, for example, the evolutionary conservation score. Notwithstanding, overall poor performance may reflect the generality of antigenicity and hence the inability to decipher B-cell epitopes as an intrinsic feature of the protein. It is an open question as to whether ultimately discriminatory features can be found. PMID:17910770
Recognition of Damaged Arrow-Road Markings by Visible Light Camera Sensor Based on Convolutional Neural Network.

PubMed

Vokhidov, Husan; Hong, Hyung Gil; Kang, Jin Kyu; Hoang, Toan Minh; Park, Kang Ryoung

2016-12-16

Automobile driver information as displayed on marked road signs indicates the state of the road, traffic conditions, proximity to schools, etc. These signs are important to insure the safety of the driver and pedestrians. They are also important input to the automated advanced driver assistance system (ADAS), installed in many automobiles. Over time, the arrow-road markings may be eroded or otherwise damaged by automobile contact, making it difficult for the driver to correctly identify the marking. Failure to properly identify an arrow-road marker creates a dangerous situation that may result in traffic accidents or pedestrian injury. Very little research exists that studies the problem of automated identification of damaged arrow-road marking painted on the road. In this study, we propose a method that uses a convolutional neural network (CNN) to recognize six types of arrow-road markings, possibly damaged, by visible light camera sensor. Experimental results with six databases of Road marking dataset, KITTI dataset, Málaga dataset 2009, Málaga urban dataset, Naver street view dataset, and Road/Lane detection evaluation 2013 dataset, show that our method outperforms conventional methods.
Recognition of Damaged Arrow-Road Markings by Visible Light Camera Sensor Based on Convolutional Neural Network

PubMed Central

Vokhidov, Husan; Hong, Hyung Gil; Kang, Jin Kyu; Hoang, Toan Minh; Park, Kang Ryoung

2016-01-01

Automobile driver information as displayed on marked road signs indicates the state of the road, traffic conditions, proximity to schools, etc. These signs are important to insure the safety of the driver and pedestrians. They are also important input to the automated advanced driver assistance system (ADAS), installed in many automobiles. Over time, the arrow-road markings may be eroded or otherwise damaged by automobile contact, making it difficult for the driver to correctly identify the marking. Failure to properly identify an arrow-road marker creates a dangerous situation that may result in traffic accidents or pedestrian injury. Very little research exists that studies the problem of automated identification of damaged arrow-road marking painted on the road. In this study, we propose a method that uses a convolutional neural network (CNN) to recognize six types of arrow-road markings, possibly damaged, by visible light camera sensor. Experimental results with six databases of Road marking dataset, KITTI dataset, Málaga dataset 2009, Málaga urban dataset, Naver street view dataset, and Road/Lane detection evaluation 2013 dataset, show that our method outperforms conventional methods. PMID:27999301
EnviroAtlas - Austin, TX - Park Access by Block Group

EPA Pesticide Factsheets

This EnviroAtlas dataset shows the block group population that is within and beyond an easy walking distance (500m) of a park entrance. Park entrances were included in this analysis if they were within 5km of the EnviroAtlas community boundary. This dataset was produced by the US EPA to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text.

PubMed

Zhu, Qile; Li, Xiaolin; Conesa, Ana; Pereira, Cécile

2018-05-01

Best performing named entity recognition (NER) methods for biomedical literature are based on hand-crafted features or task-specific rules, which are costly to produce and difficult to generalize to other corpora. End-to-end neural networks achieve state-of-the-art performance without hand-crafted features and task-specific knowledge in non-biomedical NER tasks. However, in the biomedical domain, using the same architecture does not yield competitive performance compared with conventional machine learning models. We propose a novel end-to-end deep learning approach for biomedical NER tasks that leverages the local contexts based on n-gram character and word embeddings via Convolutional Neural Network (CNN). We call this approach GRAM-CNN. To automatically label a word, this method uses the local information around a word. Therefore, the GRAM-CNN method does not require any specific knowledge or feature engineering and can be theoretically applied to a wide range of existing NER problems. The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1-score of 87.26% on the Biocreative II dataset, 87.26% on the NCBI dataset and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods. To the best of our knowledge, we are the first to apply CNN based structures to BioNER problems. The GRAM-CNN source code, datasets and pre-trained model are available online at: https://github.com/valdersoul/GRAM-CNN. andyli@ece.ufl.edu or aconesa@ufl.edu. Supplementary data are available at Bioinformatics online.
GRAM-CNN: a deep learning approach with local context for named entity recognition in biomedical text

PubMed Central

Zhu, Qile; Li, Xiaolin; Conesa, Ana; Pereira, Cécile

2018-01-01

Abstract Motivation Best performing named entity recognition (NER) methods for biomedical literature are based on hand-crafted features or task-specific rules, which are costly to produce and difficult to generalize to other corpora. End-to-end neural networks achieve state-of-the-art performance without hand-crafted features and task-specific knowledge in non-biomedical NER tasks. However, in the biomedical domain, using the same architecture does not yield competitive performance compared with conventional machine learning models. Results We propose a novel end-to-end deep learning approach for biomedical NER tasks that leverages the local contexts based on n-gram character and word embeddings via Convolutional Neural Network (CNN). We call this approach GRAM-CNN. To automatically label a word, this method uses the local information around a word. Therefore, the GRAM-CNN method does not require any specific knowledge or feature engineering and can be theoretically applied to a wide range of existing NER problems. The GRAM-CNN approach was evaluated on three well-known biomedical datasets containing different BioNER entities. It obtained an F1-score of 87.26% on the Biocreative II dataset, 87.26% on the NCBI dataset and 72.57% on the JNLPBA dataset. Those results put GRAM-CNN in the lead of the biological NER methods. To the best of our knowledge, we are the first to apply CNN based structures to BioNER problems. Availability and implementation The GRAM-CNN source code, datasets and pre-trained model are available online at: https://github.com/valdersoul/GRAM-CNN. Contact andyli@ece.ufl.edu or aconesa@ufl.edu Supplementary information Supplementary data are available at Bioinformatics online. PMID:29272325
Application of multivariate statistical techniques in microbial ecology

PubMed Central

Paliy, O.; Shankar, V.

2016-01-01

Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large scale ecological datasets. Especially noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions, and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amounts of data, powerful statistical techniques of multivariate analysis are well suited to analyze and interpret these datasets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular dataset. In this review we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive, and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and dataset structure. PMID:26786791
Mean composite fire severity metrics computed with Google Earth engine offer improved accuracy and expanded mapping potential

Treesearch

Sean A. Parks; Lisa M. Holsinger; Morgan A. Voss; Rachel A. Loehman; Nathaniel P. Robinson

2018-01-01

Landsat-based fire severity datasets are an invaluable resource for monitoring and research purposes. These gridded fire severity datasets are generally produced with pre- and post-fire imagery to estimate the degree of fire-induced ecological change. Here, we introduce methods to produce three Landsat-based fire severity metrics using the Google Earth Engine (GEE)...
Processing and population genetic analysis of multigenic datasets with ProSeq3 software.

PubMed

Filatov, Dmitry A

2009-12-01

The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.
A novel method based on new adaptive LVQ neural network for predicting protein-protein interactions from protein sequences.

PubMed

Yousef, Abdulaziz; Moghadam Charkari, Nasrollah

2013-11-07

Protein-Protein interaction (PPI) is one of the most important data in understanding the cellular processes. Many interesting methods have been proposed in order to predict PPIs. However, the methods which are based on the sequence of proteins as a prior knowledge are more universal. In this paper, a sequence-based, fast, and adaptive PPI prediction method is introduced to assign two proteins to an interaction class (yes, no). First, in order to improve the presentation of the sequences, twelve physicochemical properties of amino acid have been used by different representation methods to transform the sequence of protein pairs into different feature vectors. Then, for speeding up the learning process and reducing the effect of noise PPI data, principal component analysis (PCA) is carried out as a proper feature extraction algorithm. Finally, a new and adaptive Learning Vector Quantization (LVQ) predictor is designed to deal with different models of datasets that are classified into balanced and imbalanced datasets. The accuracy of 93.88%, 90.03%, and 89.72% has been found on S. cerevisiae, H. pylori, and independent datasets, respectively. The results of various experiments indicate the efficiency and validity of the method. © 2013 Published by Elsevier Ltd.
Novel harmonic regularization approach for variable selection in Cox's proportional hazards model.

PubMed

Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan

2014-01-01

Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods.
Identifying Populace Susceptible to Flooding Using ArcGIS and Remote Sensing Datasets

NASA Astrophysics Data System (ADS)

Fernandez, Sim Joseph; Milano, Alan

2016-07-01

Remote sensing technologies are growing vastly as with its various applications. The Department of Science and Technology (DOST), Republic of the Philippines, has made projects exploiting LiDAR datasets from remote sensing technologies. The Phil-LiDAR 1 project of DOST is a flood hazard mapping project. Among the project's objectives is the identification of building features which can be associated to the flood-exposed population. The extraction of building features from the LiDAR dataset is arduous as it requires manual identification of building features on an elevation map. The mapping of building footprints is made meticulous in order to compensate the accuracy between building floor area and building height both of which are crucial in flood decision making. A building identification method was developed to generate a LiDAR derivative which will serve as a guide in mapping building footprints. The method utilizes several tools of a Geographic Information System (GIS) software called ArcGIS which can operate on physical attributes of buildings such as roofing curvature, slope and blueprint area in order to obtain the LiDAR derivative from LiDAR dataset. The method also uses an intermediary process called building removal process wherein buildings and other features lying below the defined minimum building height - 2 meters in the case of Phil-LiDAR 1 project - are removed. The building identification method was developed in the hope to hasten the identification of building features especially when orthophotographs and/or satellite imageries are not made available.

Benchmarking Foot Trajectory Estimation Methods for Mobile Gait Analysis

PubMed Central

Ollenschläger, Malte; Roth, Nils; Klucken, Jochen

2017-01-01

Mobile gait analysis systems based on inertial sensing on the shoe are applied in a wide range of applications. Especially for medical applications, they can give new insights into motor impairment in, e.g., neurodegenerative disease and help objectify patient assessment. One key component in these systems is the reconstruction of the foot trajectories from inertial data. In literature, various methods for this task have been proposed. However, performance is evaluated on a variety of datasets due to the lack of large, generally accepted benchmark datasets. This hinders a fair comparison of methods. In this work, we implement three orientation estimation and three double integration schemes for use in a foot trajectory estimation pipeline. All methods are drawn from literature and evaluated against a marker-based motion capture reference. We provide a fair comparison on the same dataset consisting of 735 strides from 16 healthy subjects. As a result, the implemented methods are ranked and we identify the most suitable processing pipeline for foot trajectory estimation in the context of mobile gait analysis. PMID:28832511
Picking vs Waveform based detection and location methods for induced seismicity monitoring

NASA Astrophysics Data System (ADS)

Grigoli, Francesco; Boese, Maren; Scarabello, Luca; Diehl, Tobias; Weber, Bernd; Wiemer, Stefan; Clinton, John F.

2017-04-01

Microseismic monitoring is a common operation in various industrial activities related to geo-resouces, such as oil and gas and mining operations or geothermal energy exploitation. In microseismic monitoring we generally deal with large datasets from dense monitoring networks that require robust automated analysis procedures. The seismic sequences being monitored are often characterized by very many events with short inter-event times that can even provide overlapped seismic signatures. In these situations, traditional approaches that identify seismic events using dense seismic networks based on detections, phase identification and event association can fail, leading to missed detections and/or reduced location resolution. In recent years, to improve the quality of automated catalogues, various waveform-based methods for the detection and location of microseismicity have been proposed. These methods exploit the coherence of the waveforms recorded at different stations and do not require any automated picking procedure. Although this family of methods have been applied to different induced seismicity datasets, an extensive comparison with sophisticated pick-based detection and location methods is still lacking. We aim here to perform a systematic comparison in term of performance using the waveform-based method LOKI and the pick-based detection and location methods (SCAUTOLOC and SCANLOC) implemented within the SeisComP3 software package. SCANLOC is a new detection and location method specifically designed for seismic monitoring at local scale. Although recent applications have proved an extensive test with induced seismicity datasets have been not yet performed. This method is based on a cluster search algorithm to associate detections to one or many potential earthquake sources. On the other hand, SCAUTOLOC is more a "conventional" method and is the basic tool for seismic event detection and location in SeisComp3. This approach was specifically designed for regional and teleseismic applications, thus its performance with microseismic data might be limited. We analyze the performance of the three methodologies for a synthetic dataset with realistic noise conditions as well as for the first hour of continuous waveform data, including the Ml 3.5 St. Gallen earthquake, recorded by a microseismic network deployed in the area. We finally compare the results obtained all these three methods with a manually revised catalogue.
Design of an audio advertisement dataset

NASA Astrophysics Data System (ADS)

Fu, Yutao; Liu, Jihong; Zhang, Qi; Geng, Yuting

2015-12-01

Since more and more advertisements swarm into radios, it is necessary to establish an audio advertising dataset which could be used to analyze and classify the advertisement. A method of how to establish a complete audio advertising dataset is presented in this paper. The dataset is divided into four different kinds of advertisements. Each advertisement's sample is given in *.wav file format, and annotated with a txt file which contains its file name, sampling frequency, channel number, broadcasting time and its class. The classifying rationality of the advertisements in this dataset is proved by clustering the different advertisements based on Principal Component Analysis (PCA). The experimental results show that this audio advertisement dataset offers a reliable set of samples for correlative audio advertisement experimental studies.
Identifying Social Learning in Animal Populations: A New ‘Option-Bias’ Method

PubMed Central

Kendal, Rachel L.; Kendal, Jeremy R.; Hoppitt, Will; Laland, Kevin N.

2009-01-01

Background Studies of natural animal populations reveal widespread evidence for the diffusion of novel behaviour patterns, and for intra- and inter-population variation in behaviour. However, claims that these are manifestations of animal ‘culture’ remain controversial because alternative explanations to social learning remain difficult to refute. This inability to identify social learning in social settings has also contributed to the failure to test evolutionary hypotheses concerning the social learning strategies that animals deploy. Methodology/Principal Findings We present a solution to this problem, in the form of a new means of identifying social learning in animal populations. The method is based on the well-established premise of social learning research, that - when ecological and genetic differences are accounted for - social learning will generate greater homogeneity in behaviour between animals than expected in its absence. Our procedure compares the observed level of homogeneity to a sampling distribution generated utilizing randomization and other procedures, allowing claims of social learning to be evaluated according to consensual standards. We illustrate the method on data from groups of monkeys provided with novel two-option extractive foraging tasks, demonstrating that social learning can indeed be distinguished from unlearned processes and asocial learning, and revealing that the monkeys only employed social learning for the more difficult tasks. The method is further validated against published datasets and through simulation, and exhibits higher statistical power than conventional inferential statistics. Conclusions/Significance The method is potentially a significant technological development, which could prove of considerable value in assessing the validity of claims for culturally transmitted behaviour in animal groups. It will also be of value in enabling investigation of the social learning strategies deployed in captive and natural animal populations. PMID:19657389
Multi-field query expansion is effective for biomedical dataset retrieval.

PubMed

Bouadjenek, Mohamed Reda; Verspoor, Karin

2017-01-01

In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one. © The Author(s) 2017. Published by Oxford University Press.
Multi-field query expansion is effective for biomedical dataset retrieval

PubMed Central

2017-01-01

Abstract In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one. PMID:29220457
Dynamic association rules for gene expression data analysis.

PubMed

Chen, Shu-Chuan; Tsai, Tsung-Hsien; Chung, Cheng-Han; Li, Wen-Hsiung

2015-10-14

The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted. We developed a statistical way, based on the concept of confidence interval, to determine the minimum support and minimum confidence for mining association relationships among items. With the minimum support and minimum confidence, one can find significant rules in one single step. The DAR algorithm was then developed for gene expression data analysis. Four gene expression datasets showed that the proposed DAR algorithm not only was able to identify a set of differentially expressed genes that largely agreed with that of other methods, but also provided an efficient and accurate way to find influential genes of a disease. In the paper, the well-established association rule mining technique from marketing has been successfully modified to determine the minimum support and minimum confidence based on the concept of confidence interval and hypothesis testing. It can be applied to gene expression data to mine significant association rules between gene regulation and phenotype. The proposed DAR algorithm provides an efficient way to find influential genes that underlie the phenotypic variance.
Developing a Resource for Implementing ArcSWAT Using Global Datasets

NASA Astrophysics Data System (ADS)

Taggart, M.; Caraballo Álvarez, I. O.; Mueller, C.; Palacios, S. L.; Schmidt, C.; Milesi, C.; Palmer-Moloney, L. J.

2015-12-01

This project developed a comprehensive user manual outlining methods for adapting and implementing global datasets for use within ArcSWAT for international and worldwide applications. The Soil and Water Assessment Tool (SWAT) is a hydrologic model that looks at a number of hydrologic variables including runoff and the chemical makeup of water at a given location on the Earth's surface using Digital Elevation Models (DEM), land cover, soil, and weather data. However, the application of ArcSWAT for projects outside of the United States is challenging as there is no standard framework for inputting global datasets into ArcSWAT. This project aims to remove this obstacle by outlining methods for adapting and implementing these global datasets via the user manual. The manual takes the user through the processes of data conditioning while providing solutions and suggestions for common errors. The efficacy of the manual was explored using examples from watersheds located in Puerto Rico, Mexico and Western Africa. Each run explored the various options for setting up a ArcSWAT project as well as a range of satellite data products and soil databases. Future work will incorporate in-situ data for validation and calibration of the model and outline additional resources to assist future users in efficiently implementing the model for worldwide applications. The capacity to manage and monitor freshwater availability is of critical importance in both developed and developing countries. As populations grow and climate changes, both the quality and quantity of freshwater are affected resulting in negative impacts on the health of the surrounding population. The use of hydrologic models such as ArcSWAT can help stakeholders and decision makers understand the future impacts of these changes enabling informed and substantiated decisions.
Evaluation of Normalization Methods to Pave the Way Towards Large-Scale LC-MS-Based Metabolomics Profiling Experiments

PubMed Central

Valkenborg, Dirk; Baggerman, Geert; Vanaerschot, Manu; Witters, Erwin; Dujardin, Jean-Claude; Burzykowski, Tomasz; Berg, Maya

2013-01-01

Abstract Combining liquid chromatography-mass spectrometry (LC-MS)-based metabolomics experiments that were collected over a long period of time remains problematic due to systematic variability between LC-MS measurements. Until now, most normalization methods for LC-MS data are model-driven, based on internal standards or intermediate quality control runs, where an external model is extrapolated to the dataset of interest. In the first part of this article, we evaluate several existing data-driven normalization approaches on LC-MS metabolomics experiments, which do not require the use of internal standards. According to variability measures, each normalization method performs relatively well, showing that the use of any normalization method will greatly improve data-analysis originating from multiple experimental runs. In the second part, we apply cyclic-Loess normalization to a Leishmania sample. This normalization method allows the removal of systematic variability between two measurement blocks over time and maintains the differential metabolites. In conclusion, normalization allows for pooling datasets from different measurement blocks over time and increases the statistical power of the analysis, hence paving the way to increase the scale of LC-MS metabolomics experiments. From our investigation, we recommend data-driven normalization methods over model-driven normalization methods, if only a few internal standards were used. Moreover, data-driven normalization methods are the best option to normalize datasets from untargeted LC-MS experiments. PMID:23808607
Spatio-temporal synchrony of influenza in cities across Israel: the "Israel is one city" hypothesis.

PubMed

Barnea, Oren; Huppert, Amit; Katriel, Guy; Stone, Lewi

2014-01-01

We analysed an 11-year dataset (1998-2009) of Influenza-Like Illness (ILI) that was based on surveillance of ∽23% of Israel's population. We examined whether the level of synchrony of ILI epidemics in Israel's 12 largest cities is high enough to view Israel as a single epidemiological unit. Two methods were developed to assess the synchrony: (1) City-specific attack rates were fitted to a simple model in order to estimate the temporal differences in attack rates and spatial differences in reporting rates of ILI. The model showed good fit to the data (R2 = 0.76) and revealed considerable differences in reporting rates of ILI in different cities (up to a factor of 2.2). (2) A statistical test was developed to examine the null hypothesis (H0) that ILI incidence curves in two cities are essentially identical, and was tested using ILI data. Upon examining all possible pairs of incidence curves, 77.4% of pairs were found not to be different (H0 was not rejected). It was concluded that all cities generally have the same attack rate and follow the same epidemic curve each season, although the attack rate changes from season to season, providing strong support for the "Israel is one city" hypothesis. The cities which were the most out of synchronization were Bnei Brak, Beersheba and Haifa, the latter two being geographically remote from all other cities in the dataset and the former geographically very close to several other cities but socially separate due to being populated almost exclusively by ultra-orthodox Jews. Further evidence of assortative mixing of the ultra-orthodox population can be found in the 2001-2002 season, when ultra-orthodox cities and neighborhoods showed distinctly different incidence curves compared to the general population.
An open, multi-vendor, multi-field-strength brain MR dataset and analysis of publicly available skull stripping methods agreement.

PubMed

Souza, Roberto; Lucena, Oeslle; Garrafa, Julia; Gobbi, David; Saluzzi, Marina; Appenzeller, Simone; Rittner, Letícia; Frayne, Richard; Lotufo, Roberto

2018-04-15

This paper presents an open, multi-vendor, multi-field strength magnetic resonance (MR) T1-weighted volumetric brain imaging dataset, named Calgary-Campinas-359 (CC-359). The dataset is composed of images of older healthy adults (29-80 years) acquired on scanners from three vendors (Siemens, Philips and General Electric) at both 1.5 T and 3 T. CC-359 is comprised of 359 datasets, approximately 60 subjects per vendor and magnetic field strength. The dataset is approximately age and gender balanced, subject to the constraints of the available images. It provides consensus brain extraction masks for all volumes generated using supervised classification. Manual segmentation results for twelve randomly selected subjects performed by an expert are also provided. The CC-359 dataset allows investigation of 1) the influences of both vendor and magnetic field strength on quantitative analysis of brain MR; 2) parameter optimization for automatic segmentation methods; and potentially 3) machine learning classifiers with big data, specifically those based on deep learning methods, as these approaches require a large amount of data. To illustrate the utility of this dataset, we compared to the results of a supervised classifier, the results of eight publicly available skull stripping methods and one publicly available consensus algorithm. A linear mixed effects model analysis indicated that vendor (p-value<0.001) and magnetic field strength (p-value<0.001) have statistically significant impacts on skull stripping results. Copyright © 2017 Elsevier Inc. All rights reserved.
Improved statistical method for temperature and salinity quality control

NASA Astrophysics Data System (ADS)

Gourrion, Jérôme; Szekely, Tanguy

2017-04-01

Climate research and Ocean monitoring benefit from the continuous development of global in-situ hydrographic networks in the last decades. Apart from the increasing volume of observations available on a large range of temporal and spatial scales, a critical aspect concerns the ability to constantly improve the quality of the datasets. In the context of the Coriolis Dataset for ReAnalysis (CORA) version 4.2, a new quality control method based on a local comparison to historical extreme values ever observed is developed, implemented and validated. Temperature, salinity and potential density validity intervals are directly estimated from minimum and maximum values from an historical reference dataset, rather than from traditional mean and standard deviation estimates. Such an approach avoids strong statistical assumptions on the data distributions such as unimodality, absence of skewness and spatially homogeneous kurtosis. As a new feature, it also allows addressing simultaneously the two main objectives of an automatic quality control strategy, i.e. maximizing the number of good detections while minimizing the number of false alarms. The reference dataset is presently built from the fusion of 1) all ARGO profiles up to late 2015, 2) 3 historical CTD datasets and 3) the Sea Mammals CTD profiles from the MEOP database. All datasets are extensively and manually quality controlled. In this communication, the latest method validation results are also presented. The method has already been implemented in the latest version of the delayed-time CMEMS in-situ dataset and will be deployed soon in the equivalent near-real time products.
A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

PubMed

Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

2014-01-01

As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.
Prediction complements explanation in understanding the developing brain.

PubMed

Rosenberg, Monica D; Casey, B J; Holmes, Avram J

2018-02-21

A central aim of human neuroscience is understanding the neurobiology of cognition and behavior. Although we have made significant progress towards this goal, reliance on group-level studies of the developed adult brain has limited our ability to explain population variability and developmental changes in neural circuitry and behavior. In this review, we suggest that predictive modeling, a method for predicting individual differences in behavior from brain features, can complement descriptive approaches and provide new ways to account for this variability. Highlighting the outsized scientific and clinical benefits of prediction in developmental populations including adolescence, we show that predictive brain-based models are already providing new insights on adolescent-specific risk-related behaviors. Together with large-scale developmental neuroimaging datasets and complementary analytic approaches, predictive modeling affords us the opportunity and obligation to identify novel treatment targets and individually tailor the course of interventions for developmental psychopathologies that impact so many young people today.
A Benchmark Dataset for SSVEP-Based Brain-Computer Interfaces.

PubMed

Wang, Yijun; Chen, Xiaogang; Gao, Xiaorong; Gao, Shangkai

2017-10-01

This paper presents a benchmark steady-state visual evoked potential (SSVEP) dataset acquired with a 40-target brain- computer interface (BCI) speller. The dataset consists of 64-channel Electroencephalogram (EEG) data from 35 healthy subjects (8 experienced and 27 naïve) while they performed a cue-guided target selecting task. The virtual keyboard of the speller was composed of 40 visual flickers, which were coded using a joint frequency and phase modulation (JFPM) approach. The stimulation frequencies ranged from 8 Hz to 15.8 Hz with an interval of 0.2 Hz. The phase difference between two adjacent frequencies was . For each subject, the data included six blocks of 40 trials corresponding to all 40 flickers indicated by a visual cue in a random order. The stimulation duration in each trial was five seconds. The dataset can be used as a benchmark dataset to compare the methods for stimulus coding and target identification in SSVEP-based BCIs. Through offline simulation, the dataset can be used to design new system diagrams and evaluate their BCI performance without collecting any new data. The dataset also provides high-quality data for computational modeling of SSVEPs. The dataset is freely available fromhttp://bci.med.tsinghua.edu.cn/download.html.
Investigating the Effects of Imputation Methods for Modelling Gene Networks Using a Dynamic Bayesian Network from Gene Expression Data

PubMed Central

CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md

2014-01-01

Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803
SisFall: A Fall and Movement Dataset

PubMed Central

Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco

2017-01-01

Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark. PMID:28117691
SisFall: A Fall and Movement Dataset.

PubMed

Sucerquia, Angela; López, José David; Vargas-Bonilla, Jesús Francisco

2017-01-20

Research on fall and movement detection with wearable devices has witnessed promising growth. However, there are few publicly available datasets, all recorded with smartphones, which are insufficient for testing new proposals due to their absence of objective population, lack of performed activities, and limited information. Here, we present a dataset of falls and activities of daily living (ADLs) acquired with a self-developed device composed of two types of accelerometer and one gyroscope. It consists of 19 ADLs and 15 fall types performed by 23 young adults, 15 ADL types performed by 14 healthy and independent participants over 62 years old, and data from one participant of 60 years old that performed all ADLs and falls. These activities were selected based on a survey and a literature analysis. We test the dataset with widely used feature extraction and a simple to implement threshold based classification, achieving up to 96% of accuracy in fall detection. An individual activity analysis demonstrates that most errors coincide in a few number of activities where new approaches could be focused. Finally, validation tests with elderly people significantly reduced the fall detection performance of the tested features. This validates findings of other authors and encourages developing new strategies with this new dataset as the benchmark.
Behavioral consequences of disasters: a five-stage model of population behavior.

PubMed

Rudenstine, Sasha; Galea, Sandro

2014-12-01

We propose a model of population behavior in the aftermath of disasters. We conducted a qualitative analysis of an empirical dataset of 339 disasters throughout the world spanning from 1950 to 2005. We developed a model of population behavior that is based on 2 fundamental assumptions: (i) behavior is predictable and (ii) population behavior will progress sequentially through 5 stages from the moment the hazard begins until is complete. Understanding the progression of population behavior during a disaster can improve the efficiency and appropriateness of institutional efforts aimed at population preservation after large-scale traumatic events. Additionally, the opportunity for population-level intervention in the aftermath of such events will improve population health.
Validation of automatic landmark identification for atlas-based segmentation for radiation treatment planning of the head-and-neck region

NASA Astrophysics Data System (ADS)

Leavens, Claudia; Vik, Torbjørn; Schulz, Heinrich; Allaire, Stéphane; Kim, John; Dawson, Laura; O'Sullivan, Brian; Breen, Stephen; Jaffray, David; Pekar, Vladimir

2008-03-01

Manual contouring of target volumes and organs at risk in radiation therapy is extremely time-consuming, in particular for treating the head-and-neck area, where a single patient treatment plan can take several hours to contour. As radiation treatment delivery moves towards adaptive treatment, the need for more efficient segmentation techniques will increase. We are developing a method for automatic model-based segmentation of the head and neck. This process can be broken down into three main steps: i) automatic landmark identification in the image dataset of interest, ii) automatic landmark-based initialization of deformable surface models to the patient image dataset, and iii) adaptation of the deformable models to the patient-specific anatomical boundaries of interest. In this paper, we focus on the validation of the first step of this method, quantifying the results of our automatic landmark identification method. We use an image atlas formed by applying thin-plate spline (TPS) interpolation to ten atlas datasets, using 27 manually identified landmarks in each atlas/training dataset. The principal variation modes returned by principal component analysis (PCA) of the landmark positions were used by an automatic registration algorithm, which sought the corresponding landmarks in the clinical dataset of interest using a controlled random search algorithm. Applying a run time of 60 seconds to the random search, a root mean square (rms) distance to the ground-truth landmark position of 9.5 +/- 0.6 mm was calculated for the identified landmarks. Automatic segmentation of the brain, mandible and brain stem, using the detected landmarks, is demonstrated.

Clustering Single-Cell Expression Data Using Random Forest Graphs.

PubMed

Pouyan, Maziyar Baran; Nourani, Mehrdad

2017-07-01

Complex tissues such as brain and bone marrow are made up of multiple cell types. As the study of biological tissue structure progresses, the role of cell-type-specific research becomes increasingly important. Novel sequencing technology such as single-cell cytometry provides researchers access to valuable biological data. Applying machine-learning techniques to these high-throughput datasets provides deep insights into the cellular landscape of the tissue where those cells are a part of. In this paper, we propose the use of random-forest-based single-cell profiling, a new machine-learning-based technique, to profile different cell types of intricate tissues using single-cell cytometry data. Our technique utilizes random forests to capture cell marker dependences and model the cellular populations using the cell network concept. This cellular network helps us discover what cell types are in the tissue. Our experimental results on public-domain datasets indicate promising performance and accuracy of our technique in extracting cell populations of complex tissues.
Multiband tangent space mapping and feature selection for classification of EEG during motor imagery.

PubMed

Islam, Md Rabiul; Tanaka, Toshihisa; Molla, Md Khademul Islam

2018-05-08

When designing multiclass motor imagery-based brain-computer interface (MI-BCI), a so-called tangent space mapping (TSM) method utilizing the geometric structure of covariance matrices is an effective technique. This paper aims to introduce a method using TSM for finding accurate operational frequency bands related brain activities associated with MI tasks. A multichannel electroencephalogram (EEG) signal is decomposed into multiple subbands, and tangent features are then estimated on each subband. A mutual information analysis-based effective algorithm is implemented to select subbands containing features capable of improving motor imagery classification accuracy. Thus obtained features of selected subbands are combined to get feature space. A principal component analysis-based approach is employed to reduce the features dimension and then the classification is accomplished by a support vector machine (SVM). Offline analysis demonstrates the proposed multiband tangent space mapping with subband selection (MTSMS) approach outperforms state-of-the-art methods. It acheives the highest average classification accuracy for all datasets (BCI competition dataset 2a, IIIa, IIIb, and dataset JK-HH1). The increased classification accuracy of MI tasks with the proposed MTSMS approach can yield effective implementation of BCI. The mutual information-based subband selection method is implemented to tune operation frequency bands to represent actual motor imagery tasks.
Digital Elevation Models of the Pre-Eruption 2000 Crater and 2004-07 Dome-Building Eruption at Mount St. Helens, Washington, USA

USGS Publications Warehouse

Messerich, J.A.; Schilling, S.P.; Thompson, R.A.

2008-01-01

Presented in this report are 27 digital elevation model (DEM) datasets for the crater area of Mount St. Helens. These datasets include pre-eruption baseline data collected in 2000, incremental model subsets collected during the 2004-07 dome building eruption, and associated shaded-relief image datasets. Each dataset was collected photogrammetrically with digital softcopy methods employing a combination of manual collection and iterative compilation of x,y,z coordinate triplets utilizing autocorrelation techniques. DEM data points collected using autocorrelation methods were rigorously edited in stereo and manually corrected to ensure conformity with the ground surface. Data were first collected as a triangulated irregular network (TIN) then interpolated to a grid format. DEM data are based on aerotriangulated photogrammetric solutions for aerial photograph strips flown at a nominal scale of 1:12,000 using a combination of surveyed ground control and photograph-identified control points. The 2000 DEM is based on aerotriangulation of four strips totaling 31 photographs. Subsequent DEMs collected during the course of the eruption are based on aerotriangulation of single aerial photograph strips consisting of between three and seven 1:12,000-scale photographs (two to six stereo pairs). Most datasets were based on three or four stereo pairs. Photogrammetric errors associated with each dataset are presented along with ground control used in the photogrammetric aerotriangulation. The temporal increase in area of deformation in the crater as a result of dome growth, deformation, and translation of glacial ice resulted in continual adoption of new ground control points and abandonment of others during the course of the eruption. Additionally, seasonal snow cover precluded the consistent use of some ground control points.
Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies.

PubMed

Thorsen, Jonathan; Brejnrod, Asker; Mortensen, Martin; Rasmussen, Morten A; Stokholm, Jakob; Al-Soud, Waleed Abu; Sørensen, Søren; Bisgaard, Hans; Waage, Johannes

2016-11-25

There is an immense scientific interest in the human microbiome and its effects on human physiology, health, and disease. A common approach for examining bacterial communities is high-throughput sequencing of 16S rRNA gene hypervariable regions, aggregating sequence-similar amplicons into operational taxonomic units (OTUs). Strategies for detecting differential relative abundance of OTUs between sample conditions include classical statistical approaches as well as a plethora of newer methods, many borrowing from the related field of RNA-seq analysis. This effort is complicated by unique data characteristics, including sparsity, sequencing depth variation, and nonconformity of read counts to theoretical distributions, which is often exacerbated by exploratory and/or unbalanced study designs. Here, we assess the robustness of available methods for (1) inference in differential relative abundance analysis and (2) beta-diversity-based sample separation, using a rigorous benchmarking framework based on large clinical 16S microbiome datasets from different sources. Running more than 380,000 full differential relative abundance tests on real datasets with permuted case/control assignments and in silico-spiked OTUs, we identify large differences in method performance on a range of parameters, including false positive rates, sensitivity to sparsity and case/control balances, and spike-in retrieval rate. In large datasets, methods with the highest false positive rates also tend to have the best detection power. For beta-diversity-based sample separation, we show that library size normalization has very little effect and that the distance metric is the most important factor in terms of separation power. Our results, generalizable to datasets from different sequencing platforms, demonstrate how the choice of method considerably affects analysis outcome. Here, we give recommendations for tools that exhibit low false positive rates, have good retrieval power across effect sizes and case/control proportions, and have low sparsity bias. Result output from some commonly used methods should be interpreted with caution. We provide an easily extensible framework for benchmarking of new methods and future microbiome datasets.
101 Labeled Brain Images and a Consistent Human Cortical Labeling Protocol

PubMed Central

Klein, Arno; Tourville, Jason

2012-01-01

We introduce the Mindboggle-101 dataset, the largest and most complete set of free, publicly accessible, manually labeled human brain images. To manually label the macroscopic anatomy in magnetic resonance images of 101 healthy participants, we created a new cortical labeling protocol that relies on robust anatomical landmarks and minimal manual edits after initialization with automated labels. The “Desikan–Killiany–Tourville” (DKT) protocol is intended to improve the ease, consistency, and accuracy of labeling human cortical areas. Given how difficult it is to label brains, the Mindboggle-101 dataset is intended to serve as brain atlases for use in labeling other brains, as a normative dataset to establish morphometric variation in a healthy population for comparison against clinical populations, and contribute to the development, training, testing, and evaluation of automated registration and labeling algorithms. To this end, we also introduce benchmarks for the evaluation of such algorithms by comparing our manual labels with labels automatically generated by probabilistic and multi-atlas registration-based approaches. All data and related software and updated information are available on the http://mindboggle.info/data website. PMID:23227001
A modified active appearance model based on an adaptive artificial bee colony.

PubMed

Abdulameer, Mohammed Hasan; Sheikh Abdullah, Siti Norul Huda; Othman, Zulaiha Ali

2014-01-01

Active appearance model (AAM) is one of the most popular model-based approaches that have been extensively used to extract features by highly accurate modeling of human faces under various physical and environmental circumstances. However, in such active appearance model, fitting the model with original image is a challenging task. State of the art shows that optimization method is applicable to resolve this problem. However, another common problem is applying optimization. Hence, in this paper we propose an AAM based face recognition technique, which is capable of resolving the fitting problem of AAM by introducing a new adaptive ABC algorithm. The adaptation increases the efficiency of fitting as against the conventional ABC algorithm. We have used three datasets: CASIA dataset, property 2.5D face dataset, and UBIRIS v1 images dataset in our experiments. The results have revealed that the proposed face recognition technique has performed effectively, in terms of accuracy of face recognition.
Automatic brain tissue segmentation based on graph filter.

PubMed

Kong, Youyong; Chen, Xiaopeng; Wu, Jiasong; Zhang, Pinzheng; Chen, Yang; Shu, Huazhong

2018-05-09

Accurate segmentation of brain tissues from magnetic resonance imaging (MRI) is of significant importance in clinical applications and neuroscience research. Accurate segmentation is challenging due to the tissue heterogeneity, which is caused by noise, bias filed and partial volume effects. To overcome this limitation, this paper presents a novel algorithm for brain tissue segmentation based on supervoxel and graph filter. Firstly, an effective supervoxel method is employed to generate effective supervoxels for the 3D MRI image. Secondly, the supervoxels are classified into different types of tissues based on filtering of graph signals. The performance is evaluated on the BrainWeb 18 dataset and the Internet Brain Segmentation Repository (IBSR) 18 dataset. The proposed method achieves mean dice similarity coefficient (DSC) of 0.94, 0.92 and 0.90 for the segmentation of white matter (WM), grey matter (GM) and cerebrospinal fluid (CSF) for BrainWeb 18 dataset, and mean DSC of 0.85, 0.87 and 0.57 for the segmentation of WM, GM and CSF for IBSR18 dataset. The proposed approach can well discriminate different types of brain tissues from the brain MRI image, which has high potential to be applied for clinical applications.
A new method for detecting, quantifying and monitoring diffuse contamination

NASA Astrophysics Data System (ADS)

Fabian, Karl; Reimann, Clemens; de Caritat, Patrice

2017-04-01

A new method is presented for detecting and quantifying diffuse contamination at the regional to continental scale. It is based on the analysis of cumulative distribution functions (CDFs) in cumulative probability (CP) plots for spatially representative datasets, preferably containing >1000 samples. Simulations demonstrate how different types of contamination influence elemental CDFs of different sample media. Contrary to common belief, diffuse contamination does not result in exceedingly high element concentrations in regional- to continental-scale datasets. Instead it produces a distinctive shift of concentrations in the background distribution of the studied element resulting in a steeper data distribution in the CP plot. Via either (1) comparing the distribution of an element in top soil samples to the distribution of the same element in bottom soil samples from the same area, taking soil forming processes into consideration, or (2) comparing the distribution of the contaminating element (e.g., Pb) to that of an element with a geochemically comparable behaviour but no contamination source (e.g., Rb or Ba in case of Pb), the relative impact of diffuse contamination on the element concentration can be estimated either graphically in the CP plot via a best fit estimate or quantitatively via a Kolmogorov-Smirnov or Cramer vonMiese test. This is demonstrated using continental-scale geochemical soil datasets from Europe, Australia, and the USA, and a regional scale dataset from Norway. Several different datasets from Europe deliver comparable results at regional to continental scales. The method is also suitable for monitoring diffuse contamination based on the statistical distribution of repeat datasets at the continental scale in a cost-effective manner.
Skimming Digits: Neuromorphic Classification of Spike-Encoded Images

PubMed Central

Cohen, Gregory K.; Orchard, Garrick; Leng, Sio-Hoi; Tapson, Jonathan; Benosman, Ryad B.; van Schaik, André

2016-01-01

The growing demands placed upon the field of computer vision have renewed the focus on alternative visual scene representations and processing paradigms. Silicon retinea provide an alternative means of imaging the visual environment, and produce frame-free spatio-temporal data. This paper presents an investigation into event-based digit classification using N-MNIST, a neuromorphic dataset created with a silicon retina, and the Synaptic Kernel Inverse Method (SKIM), a learning method based on principles of dendritic computation. As this work represents the first large-scale and multi-class classification task performed using the SKIM network, it explores different training patterns and output determination methods necessary to extend the original SKIM method to support multi-class problems. Making use of SKIM networks applied to real-world datasets, implementing the largest hidden layer sizes and simultaneously training the largest number of output neurons, the classification system achieved a best-case accuracy of 92.87% for a network containing 10,000 hidden layer neurons. These results represent the highest accuracies achieved against the dataset to date and serve to validate the application of the SKIM method to event-based visual classification tasks. Additionally, the study found that using a square pulse as the supervisory training signal produced the highest accuracy for most output determination methods, but the results also demonstrate that an exponential pattern is better suited to hardware implementations as it makes use of the simplest output determination method based on the maximum value. PMID:27199646
BIANCA (Brain Intensity AbNormality Classification Algorithm): A new tool for automated segmentation of white matter hyperintensities.

PubMed

Griffanti, Ludovica; Zamboni, Giovanna; Khan, Aamira; Li, Linxin; Bonifacio, Guendalina; Sundaresan, Vaanathi; Schulz, Ursula G; Kuker, Wilhelm; Battaglini, Marco; Rothwell, Peter M; Jenkinson, Mark

2016-11-01

Reliable quantification of white matter hyperintensities of presumed vascular origin (WMHs) is increasingly needed, given the presence of these MRI findings in patients with several neurological and vascular disorders, as well as in elderly healthy subjects. We present BIANCA (Brain Intensity AbNormality Classification Algorithm), a fully automated, supervised method for WMH detection, based on the k-nearest neighbour (k-NN) algorithm. Relative to previous k-NN based segmentation methods, BIANCA offers different options for weighting the spatial information, local spatial intensity averaging, and different options for the choice of the number and location of the training points. BIANCA is multimodal and highly flexible so that the user can adapt the tool to their protocol and specific needs. We optimised and validated BIANCA on two datasets with different MRI protocols and patient populations (a "predominantly neurodegenerative" and a "predominantly vascular" cohort). BIANCA was first optimised on a subset of images for each dataset in terms of overlap and volumetric agreement with a manually segmented WMH mask. The correlation between the volumes extracted with BIANCA (using the optimised set of options), the volumes extracted from the manual masks and visual ratings showed that BIANCA is a valid alternative to manual segmentation. The optimised set of options was then applied to the whole cohorts and the resulting WMH volume estimates showed good correlations with visual ratings and with age. Finally, we performed a reproducibility test, to evaluate the robustness of BIANCA, and compared BIANCA performance against existing methods. Our findings suggest that BIANCA, which will be freely available as part of the FSL package, is a reliable method for automated WMH segmentation in large cross-sectional cohort studies. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Improving average ranking precision in user searches for biomedical research datasets

PubMed Central

Gobeill, Julien; Gaudinat, Arnaud; Vachon, Thérèse; Ruch, Patrick

2017-01-01

Abstract Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data PMID:29220475
Dealing with AFLP genotyping errors to reveal genetic structure in Plukenetia volubilis (Euphorbiaceae) in the Peruvian Amazon

PubMed Central

Vašek, Jakub; Viehmannová, Iva; Ocelák, Martin; Cachique Huansi, Danter; Vejl, Pavel

2017-01-01

An analysis of the population structure and genetic diversity for any organism often depends on one or more molecular marker techniques. Nonetheless, these techniques are not absolutely reliable because of various sources of errors arising during the genotyping process. Thus, a complex analysis of genotyping error was carried out with the AFLP method in 169 samples of the oil seed plant Plukenetia volubilis L. from small isolated subpopulations in the Peruvian Amazon. Samples were collected in nine localities from the region of San Martin. Analysis was done in eight datasets with a genotyping error from 0 to 5%. Using eleven primer combinations, 102 to 275 markers were obtained according to the dataset. It was found that it is only possible to obtain the most reliable and robust results through a multiple-level filtering process. Genotyping error and software set up influence both the estimation of population structure and genetic diversity, where in our case population number (K) varied between 2–9 depending on the dataset and statistical method used. Surprisingly, discrepancies in K number were caused more by statistical approaches than by genotyping errors themselves. However, for estimation of genetic diversity, the degree of genotyping error was critical because descriptive parameters (He, FST, PLP 5%) varied substantially (by at least 25%). Due to low gene flow, P. volubilis mostly consists of small isolated subpopulations (ΦPT = 0.252–0.323) with some degree of admixture given by socio-economic connectivity among the sites; a direct link between the genetic and geographic distances was not confirmed. The study illustrates the successful application of AFLP to infer genetic structure in non-model plants. PMID:28910307
Increasing conclusiveness of clinical breath analysis by improved baseline correction of multi capillary column - ion mobility spectrometry (MCC-IMS) data.

PubMed

Szymańska, Ewa; Tinnevelt, Gerjen H; Brodrick, Emma; Williams, Mark; Davies, Antony N; van Manen, Henk-Jan; Buydens, Lutgarde M C

2016-08-05

Current challenges of clinical breath analysis include large data size and non-clinically relevant variations observed in exhaled breath measurements, which should be urgently addressed with competent scientific data tools. In this study, three different baseline correction methods are evaluated within a previously developed data size reduction strategy for multi capillary column - ion mobility spectrometry (MCC-IMS) datasets. Introduced for the first time in breath data analysis, the Top-hat method is presented as the optimum baseline correction method. A refined data size reduction strategy is employed in the analysis of a large breathomic dataset on a healthy and respiratory disease population. New insights into MCC-IMS spectra differences associated with respiratory diseases are provided, demonstrating the additional value of the refined data analysis strategy in clinical breath analysis. Copyright © 2016 Elsevier B.V. All rights reserved.
Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts.

PubMed

Dashtban, M; Balafar, Mohammadali

2017-03-01

Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset. Copyright © 2017 Elsevier Inc. All rights reserved.
Supervised Machine Learning for Population Genetics: A New Paradigm

PubMed Central

Schrider, Daniel R.; Kern, Andrew D.

2018-01-01

As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics. PMID:29331490
Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.

PubMed

Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi

2014-01-01

In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.
Causal inference and the data-fusion problem

PubMed Central

Bareinboim, Elias; Pearl, Judea

2016-01-01

We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion—piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks. PMID:27382148
a Spatiotemporal Aggregation Query Method Using Multi-Thread Parallel Technique Based on Regional Division

NASA Astrophysics Data System (ADS)

Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.

2015-07-01

Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.
The fine-scale genetic structure and evolution of the Japanese population.

PubMed

Takeuchi, Fumihiko; Katsuya, Tomohiro; Kimura, Ryosuke; Nabika, Toru; Isomura, Minoru; Ohkubo, Takayoshi; Tabara, Yasuharu; Yamamoto, Ken; Yokota, Mitsuhiro; Liu, Xuanyao; Saw, Woei-Yuh; Mamatyusupu, Dolikun; Yang, Wenjun; Xu, Shuhua; Teo, Yik-Ying; Kato, Norihiro

2017-01-01

The contemporary Japanese populations largely consist of three genetically distinct groups-Hondo, Ryukyu and Ainu. By principal-component analysis, while the three groups can be clearly separated, the Hondo people, comprising 99% of the Japanese, form one almost indistinguishable cluster. To understand fine-scale genetic structure, we applied powerful haplotype-based statistical methods to genome-wide single nucleotide polymorphism data from 1600 Japanese individuals, sampled from eight distinct regions in Japan. We then combined the Japanese data with 26 other Asian populations data to analyze the shared ancestry and genetic differentiation. We found that the Japanese could be separated into nine genetic clusters in our dataset, showing a marked concordance with geography; and that major components of ancestry profile of Japanese were from the Korean and Han Chinese clusters. We also detected and dated admixture in the Japanese. While genetic differentiation between Ryukyu and Hondo was suggested to be caused in part by positive selection, genetic differentiation among the Hondo clusters appeared to result principally from genetic drift. Notably, in Asians, we found the possibility that positive selection accentuated genetic differentiation among distant populations but attenuated genetic differentiation among close populations. These findings are significant for studies of human evolution and medical genetics.
Behavior Correlates of Post-Stroke Disability Using Data Mining and Infographics

PubMed Central

Yoon, Sunmoo; Gutierrez, Jose

2015-01-01

Purpose Disability is a potential risk for stroke survivors. This study aims to identify disability risk factors associated with stroke and their relative importance and relationships from a national behavioral risk factor dataset. Methods Data of post-stroke individuals in the U.S (n=19,603) including 397 variables were extracted from a publically available national dataset and analyzed. Data mining algorithms including C4.5 and linear regression with M5s methods were applied to build association models for post-stroke disability using Weka software. The relative importance and relationship of 70 variables associated with disability were presented in infographics for clinicians to understand easily. Results Fifty-five percent of post-stroke patients experience disability. Exercise, employment and satisfaction of life were relatively important factors associated with disability among stroke patients. Modifiable behavior factors strongly associated with disability include exercise (OR: 0.46, P<0.01) and good rest (OR 0.37, P<0.01). Conclusions Data mining is promising to discover factors associated with post-stroke disability from a large population dataset. The findings can be potentially valuable for establishing the priorities for clinicians and researchers and for stroke patient education. The methods may generalize to other health conditions. PMID:26835413

Development of a global historic monthly mean precipitation dataset

NASA Astrophysics Data System (ADS)

Yang, Su; Xu, Wenhui; Xu, Yan; Li, Qingxiang

2016-04-01

Global historic precipitation dataset is the base for climate and water cycle research. There have been several global historic land surface precipitation datasets developed by international data centers such as the US National Climatic Data Center (NCDC), European Climate Assessment & Dataset project team, Met Office, etc., but so far there are no such datasets developed by any research institute in China. In addition, each dataset has its own focus of study region, and the existing global precipitation datasets only contain sparse observational stations over China, which may result in uncertainties in East Asian precipitation studies. In order to take into account comprehensive historic information, users might need to employ two or more datasets. However, the non-uniform data formats, data units, station IDs, and so on add extra difficulties for users to exploit these datasets. For this reason, a complete historic precipitation dataset that takes advantages of various datasets has been developed and produced in the National Meteorological Information Center of China. Precipitation observations from 12 sources are aggregated, and the data formats, data units, and station IDs are unified. Duplicated stations with the same ID are identified, with duplicated observations removed. Consistency test, correlation coefficient test, significance t-test at the 95% confidence level, and significance F-test at the 95% confidence level are conducted first to ensure the data reliability. Only those datasets that satisfy all the above four criteria are integrated to produce the China Meteorological Administration global precipitation (CGP) historic precipitation dataset version 1.0. It contains observations at 31 thousand stations with 1.87 × 107 data records, among which 4152 time series of precipitation are longer than 100 yr. This dataset plays a critical role in climate research due to its advantages in large data volume and high density of station network, compared to other datasets. Using the Penalized Maximal t-test method, significant inhomogeneity has been detected in historic precipitation datasets at 340 stations. The ratio method is then employed to effectively remove these remarkable change points. Global precipitation analysis based on CGP v1.0 shows that rainfall has been increasing during 1901-2013 with an increasing rate of 3.52 ± 0.5 mm (10 yr)-1, slightly higher than that in the NCDC data. Analysis also reveals distinguished long-term changing trends at different latitude zones.
Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains.

PubMed

Bulashevska, Alla; Eils, Roland

2006-06-14

The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.
High resolution global gridded data for use in population studies

NASA Astrophysics Data System (ADS)

Lloyd, Christopher T.; Sorichetta, Alessandro; Tatem, Andrew J.

2017-01-01

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website.
High resolution global gridded data for use in population studies.

PubMed

Lloyd, Christopher T; Sorichetta, Alessandro; Tatem, Andrew J

2017-01-31

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website.
Quantifying Cartilage Contact Modulus, Tension Modulus, and Permeability With Hertzian Biphasic Creep

PubMed Central

Moore, A. C.; DeLucca, J. F.; Elliott, D. M.; Burris, D. L.

2016-01-01

This paper describes a new method, based on a recent analytical model (Hertzian biphasic theory (HBT)), to simultaneously quantify cartilage contact modulus, tension modulus, and permeability. Standard Hertzian creep measurements were performed on 13 osteochondral samples from three mature bovine stifles. Each creep dataset was fit for material properties using HBT. A subset of the dataset (N = 4) was also fit using Oyen's method and FEBio, an open-source finite element package designed for soft tissue mechanics. The HBT method demonstrated statistically significant sensitivity to differences between cartilage from the tibial plateau and cartilage from the femoral condyle. Based on the four samples used for comparison, no statistically significant differences were detected between properties from the HBT and FEBio methods. While the finite element method is considered the gold standard for analyzing this type of contact, the expertise and time required to setup and solve can be prohibitive, especially for large datasets. The HBT method agreed quantitatively with FEBio but also offers ease of use by nonexperts, rapid solutions, and exceptional fit quality (R2 = 0.999 ± 0.001, N = 13). PMID:27536012
mRAISE: an alternative algorithmic approach to ligand-based virtual screening

NASA Astrophysics Data System (ADS)

von Behren, Mathias M.; Bietz, Stefan; Nittinger, Eva; Rarey, Matthias

2016-08-01

Ligand-based virtual screening is a well established method to find new lead molecules in todays drug discovery process. In order to be applicable in day to day practice, such methods have to face multiple challenges. The most important part is the reliability of the results, which can be shown and compared in retrospective studies. Furthermore, in the case of 3D methods, they need to provide biologically relevant molecular alignments of the ligands, that can be further investigated by a medicinal chemist. Last but not least, they have to be able to screen large databases in reasonable time. Many algorithms for ligand-based virtual screening have been proposed in the past, most of them based on pairwise comparisons. Here, a new method is introduced called mRAISE. Based on structural alignments, it uses a descriptor-based bitmap search engine (RAISE) to achieve efficiency. Alignments created on the fly by the search engine get evaluated with an independent shape-based scoring function also used for ranking of compounds. The correct ranking as well as the alignment quality of the method are evaluated and compared to other state of the art methods. On the commonly used Directory of Useful Decoys dataset mRAISE achieves an average area under the ROC curve of 0.76, an average enrichment factor at 1 % of 20.2 and an average hit rate at 1 % of 55.5. With these results, mRAISE is always among the top performing methods with available data for comparison. To access the quality of the alignments calculated by ligand-based virtual screening methods, we introduce a new dataset containing 180 prealigned ligands for 11 diverse targets. Within the top ten ranked conformations, the alignment closest to X-ray structure calculated with mRAISE has a root-mean-square deviation of less than 2.0 Å for 80.8 % of alignment pairs and achieves a median of less than 2.0 Å for eight of the 11 cases. The dataset used to rate the quality of the calculated alignments is freely available at http://www.zbh.uni-hamburg.de/mraise-dataset.html. The table of all PDB codes contained in the ensembles can be found in the supplementary material. The software tool mRAISE is freely available for evaluation purposes and academic use (see http://www.zbh.uni-hamburg.de/raise).
The interplay between human population dynamics and flooding in Bangladesh: a spatial analysis

NASA Astrophysics Data System (ADS)

di Baldassarre, G.; Yan, K.; Ferdous, MD. R.; Brandimarte, L.

2014-09-01

In Bangladesh, socio-economic and hydrological processes are both extremely dynamic and inter-related. Human population patterns are often explained as a response, or adaptation strategy, to physical events, e.g. flooding, salt-water intrusion, and erosion. Meanwhile, these physical processes are exacerbated, or mitigated, by diverse human interventions, e.g. river diversion, levees and polders. In this context, this paper describes an attempt to explore the complex interplay between floods and societies in Bangladeshi floodplains. In particular, we performed a spatially-distributed analysis of the interactions between the dynamics of human settlements and flood inundation patterns. To this end, we used flooding simulation results from inundation modelling, LISFLOOD-FP, as well as global datasets of population distribution data, such as the Gridded Population of the World (20 years, from 1990 to 2010) and HYDE datasets (310 years, from 1700 to 2010). The outcomes of this work highlight the behaviour of Bangladeshi floodplains as complex human-water systems and indicate the need to go beyond the traditional narratives based on one-way cause-effects, e.g. climate change leading to migrations.
A Bayesian trans-dimensional approach for the fusion of multiple geophysical datasets

NASA Astrophysics Data System (ADS)

JafarGandomi, Arash; Binley, Andrew

2013-09-01

We propose a Bayesian fusion approach to integrate multiple geophysical datasets with different coverage and sensitivity. The fusion strategy is based on the capability of various geophysical methods to provide enough resolution to identify either subsurface material parameters or subsurface structure, or both. We focus on electrical resistivity as the target material parameter and electrical resistivity tomography (ERT), electromagnetic induction (EMI), and ground penetrating radar (GPR) as the set of geophysical methods. However, extending the approach to different sets of geophysical parameters and methods is straightforward. Different geophysical datasets are entered into a trans-dimensional Markov chain Monte Carlo (McMC) search-based joint inversion algorithm. The trans-dimensional property of the McMC algorithm allows dynamic parameterisation of the model space, which in turn helps to avoid bias of the post-inversion results towards a particular model. Given that we are attempting to develop an approach that has practical potential, we discretize the subsurface into an array of one-dimensional earth-models. Accordingly, the ERT data that are collected by using two-dimensional acquisition geometry are re-casted to a set of equivalent vertical electric soundings. Different data are inverted either individually or jointly to estimate one-dimensional subsurface models at discrete locations. We use Shannon's information measure to quantify the information obtained from the inversion of different combinations of geophysical datasets. Information from multiple methods is brought together via introducing joint likelihood function and/or constraining the prior information. A Bayesian maximum entropy approach is used for spatial fusion of spatially dispersed estimated one-dimensional models and mapping of the target parameter. We illustrate the approach with a synthetic dataset and then apply it to a field dataset. We show that the proposed fusion strategy is successful not only in enhancing the subsurface information but also as a survey design tool to identify the appropriate combination of the geophysical tools and show whether application of an individual method for further investigation of a specific site is beneficial.
Using mixture tuned match filtering to measure changes in subpixel vegetation area in Las Vegas, Nevada

NASA Astrophysics Data System (ADS)

Brelsford, Christa; Shepherd, Doug

2013-09-01

In desert cities, securing sufficient water supply to meet the needs of both existing population and future growth is a complex problem with few easy solutions. Grass lawns are a major driver of water consumption and accurate measurements of vegetation area are necessary to understand drivers of changes in household water consumption. Measuring vegetation change in a heterogeneous urban environment requires sub-pixel estimation of vegetation area. Mixture Tuned Match Filtering has been successfully applied to target detection for materials that only cover small portions of a satellite image pixel. There have been few successful applications of MTMF to fractional area estimation, despite theory that suggests feasibility. We use a ground truth dataset over ten times larger than that available for any previous MTMF application to estimate the bias between ground truth data and matched filter results. We find that the MTMF algorithm underestimates the fractional area of vegetation by 5-10%, and calculate that averaging over 20 to 30 pixels is necessary to correct this bias. We conclude that with a large ground truth dataset, using MTMF for fractional area estimation is possible when results can be estimated at a lower spatial resolution than the base image. When this method is applied to estimating vegetation area in Las Vegas, NV spatial and temporal trends are consistent with expectations from known population growth and policy goals.
A hybrid approach for fusing 4D-MRI temporal information with 3D-CT for the study of lung and lung tumor motion

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yang, Y. X.; Van Reeth, E.; Poh, C. L., E-mail: clpoh@ntu.edu.sg

2015-08-15

Purpose: Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors’ proposed approach. Methods: A novel hybrid approach based on deformable image registration (DIR) and finite elementmore » method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. Results: The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors’ proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. Conclusions: The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.« less
Application of 3D Zernike descriptors to shape-based ligand similarity searching.

PubMed

Venkatraman, Vishwesh; Chakravarthy, Padmasini Ramji; Kihara, Daisuke

2009-12-17

The identification of promising drug leads from a large database of compounds is an important step in the preliminary stages of drug design. Although shape is known to play a key role in the molecular recognition process, its application to virtual screening poses significant hurdles both in terms of the encoding scheme and speed. In this study, we have examined the efficacy of the alignment independent three-dimensional Zernike descriptor (3DZD) for fast shape based similarity searching. Performance of this approach was compared with several other methods including the statistical moments based ultrafast shape recognition scheme (USR) and SIMCOMP, a graph matching algorithm that compares atom environments. Three benchmark datasets are used to thoroughly test the methods in terms of their ability for molecular classification, retrieval rate, and performance under the situation that simulates actual virtual screening tasks over a large pharmaceutical database. The 3DZD performed better than or comparable to the other methods examined, depending on the datasets and evaluation metrics used. Reasons for the success and the failure of the shape based methods for specific cases are investigated. Based on the results for the three datasets, general conclusions are drawn with regard to their efficiency and applicability. The 3DZD has unique ability for fast comparison of three-dimensional shape of compounds. Examples analyzed illustrate the advantages and the room for improvements for the 3DZD.
Application of 3D Zernike descriptors to shape-based ligand similarity searching

PubMed Central

2009-01-01

Background The identification of promising drug leads from a large database of compounds is an important step in the preliminary stages of drug design. Although shape is known to play a key role in the molecular recognition process, its application to virtual screening poses significant hurdles both in terms of the encoding scheme and speed. Results In this study, we have examined the efficacy of the alignment independent three-dimensional Zernike descriptor (3DZD) for fast shape based similarity searching. Performance of this approach was compared with several other methods including the statistical moments based ultrafast shape recognition scheme (USR) and SIMCOMP, a graph matching algorithm that compares atom environments. Three benchmark datasets are used to thoroughly test the methods in terms of their ability for molecular classification, retrieval rate, and performance under the situation that simulates actual virtual screening tasks over a large pharmaceutical database. The 3DZD performed better than or comparable to the other methods examined, depending on the datasets and evaluation metrics used. Reasons for the success and the failure of the shape based methods for specific cases are investigated. Based on the results for the three datasets, general conclusions are drawn with regard to their efficiency and applicability. Conclusion The 3DZD has unique ability for fast comparison of three-dimensional shape of compounds. Examples analyzed illustrate the advantages and the room for improvements for the 3DZD. PMID:20150998
Novel Harmonic Regularization Approach for Variable Selection in Cox's Proportional Hazards Model

PubMed Central

Chu, Ge-Jin; Liang, Yong; Wang, Jia-Xuan

2014-01-01

Variable selection is an important issue in regression and a number of variable selection methods have been proposed involving nonconvex penalty functions. In this paper, we investigate a novel harmonic regularization method, which can approximate nonconvex Lq (1/2 < q < 1) regularizations, to select key risk factors in the Cox's proportional hazards model using microarray gene expression data. The harmonic regularization method can be efficiently solved using our proposed direct path seeking approach, which can produce solutions that closely approximate those for the convex loss function and the nonconvex regularization. Simulation results based on the artificial datasets and four real microarray gene expression datasets, such as real diffuse large B-cell lymphoma (DCBCL), the lung cancer, and the AML datasets, show that the harmonic regularization method can be more accurate for variable selection than existing Lasso series methods. PMID:25506389
Mapping monkeypox transmission risk through time and space in the Congo Basin

USGS Publications Warehouse

Nakazawa, Yoshinori J.; Lash, R. Ryan; Carroll, Darin S.; Damon, Inger K.; Karem, Kevin L.; Reynolds, Mary G.; Osorio, Jorge E.; Rocke, Tonie E.; Malekani, Jean; Muyembe, Jean-Jacques; Formenty, Pierre; Peterson, A. Townsend

2013-01-01

Monkeypox is a major public health concern in the Congo Basin area, with changing patterns of human case occurrences reported in recent years. Whether this trend results from better surveillance and detection methods, reduced proportions of vaccinated vs. non-vaccinated human populations, or changing environmental conditions remains unclear. Our objective is to examine potential correlations between environment and transmission of monkeypox events in the Congo Basin. We created ecological niche models based on human cases reported in the Congo Basin by the World Health Organization at the end of the smallpox eradication campaign, in relation to remotely-sensed Normalized Difference Vegetation Index datasets from the same time period. These models predicted independent spatial subsets of monkeypox occurrences with high confidence; models were then projected onto parallel environmental datasets for the 2000s to create present-day monkeypox suitability maps. Recent trends in human monkeypox infection are associated with broad environmental changes across the Congo Basin. Our results demonstrate that ecological niche models provide useful tools for identification of areas suitable for transmission, even for poorly-known diseases like monkeypox.
Mapping monkeypox transmission risk through time and space in the Congo Basin.

PubMed

Nakazawa, Yoshinori; Lash, R Ryan; Carroll, Darin S; Damon, Inger K; Karem, Kevin L; Reynolds, Mary G; Osorio, Jorge E; Rocke, Tonie E; Malekani, Jean M; Muyembe, Jean-Jacques; Formenty, Pierre; Peterson, A Townsend

2013-01-01

Monkeypox is a major public health concern in the Congo Basin area, with changing patterns of human case occurrences reported in recent years. Whether this trend results from better surveillance and detection methods, reduced proportions of vaccinated vs. non-vaccinated human populations, or changing environmental conditions remains unclear. Our objective is to examine potential correlations between environment and transmission of monkeypox events in the Congo Basin. We created ecological niche models based on human cases reported in the Congo Basin by the World Health Organization at the end of the smallpox eradication campaign, in relation to remotely-sensed Normalized Difference Vegetation Index datasets from the same time period. These models predicted independent spatial subsets of monkeypox occurrences with high confidence; models were then projected onto parallel environmental datasets for the 2000s to create present-day monkeypox suitability maps. Recent trends in human monkeypox infection are associated with broad environmental changes across the Congo Basin. Our results demonstrate that ecological niche models provide useful tools for identification of areas suitable for transmission, even for poorly-known diseases like monkeypox.
Mapping Monkeypox Transmission Risk through Time and Space in the Congo Basin

PubMed Central

Nakazawa, Yoshinori; Lash, R. Ryan; Carroll, Darin S.; Damon, Inger K.; Karem, Kevin L.; Reynolds, Mary G.; Osorio, Jorge E.; Rocke, Tonie E.; Malekani, Jean M.; Muyembe, Jean-Jacques; Formenty, Pierre; Peterson, A. Townsend

2013-01-01

Monkeypox is a major public health concern in the Congo Basin area, with changing patterns of human case occurrences reported in recent years. Whether this trend results from better surveillance and detection methods, reduced proportions of vaccinated vs. non-vaccinated human populations, or changing environmental conditions remains unclear. Our objective is to examine potential correlations between environment and transmission of monkeypox events in the Congo Basin. We created ecological niche models based on human cases reported in the Congo Basin by the World Health Organization at the end of the smallpox eradication campaign, in relation to remotely-sensed Normalized Difference Vegetation Index datasets from the same time period. These models predicted independent spatial subsets of monkeypox occurrences with high confidence; models were then projected onto parallel environmental datasets for the 2000s to create present-day monkeypox suitability maps. Recent trends in human monkeypox infection are associated with broad environmental changes across the Congo Basin. Our results demonstrate that ecological niche models provide useful tools for identification of areas suitable for transmission, even for poorly-known diseases like monkeypox. PMID:24040344
An algorithm for direct causal learning of influences on patient outcomes.

PubMed

Rathnam, Chandramouli; Lee, Sanghoon; Jiang, Xia

2017-01-01

This study aims at developing and introducing a new algorithm, called direct causal learner (DCL), for learning the direct causal influences of a single target. We applied it to both simulated and real clinical and genome wide association study (GWAS) datasets and compared its performance to classic causal learning algorithms. The DCL algorithm learns the causes of a single target from passive data using Bayesian-scoring, instead of using independence checks, and a novel deletion algorithm. We generate 14,400 simulated datasets and measure the number of datasets for which DCL correctly and partially predicts the direct causes. We then compare its performance with the constraint-based path consistency (PC) and conservative PC (CPC) algorithms, the Bayesian-score based fast greedy search (FGS) algorithm, and the partial ancestral graphs algorithm fast causal inference (FCI). In addition, we extend our comparison of all five algorithms to both a real GWAS dataset and real breast cancer datasets over various time-points in order to observe how effective they are at predicting the causal influences of Alzheimer's disease and breast cancer survival. DCL consistently outperforms FGS, PC, CPC, and FCI in discovering the parents of the target for the datasets simulated using a simple network. Overall, DCL predicts significantly more datasets correctly (McNemar's test significance: p<0.0001) than any of the other algorithms for these network types. For example, when assessing overall performance (simple and complex network results combined), DCL correctly predicts approximately 1400 more datasets than the top FGS method, 1600 more datasets than the top CPC method, 4500 more datasets than the top PC method, and 5600 more datasets than the top FCI method. Although FGS did correctly predict more datasets than DCL for the complex networks, and DCL correctly predicted only a few more datasets than CPC for these networks, there is no significant difference in performance between these three algorithms for this network type. However, when we use a more continuous measure of accuracy, we find that all the DCL methods are able to better partially predict more direct causes than FGS and CPC for the complex networks. In addition, DCL consistently had faster runtimes than the other algorithms. In the application to the real datasets, DCL identified rs6784615, located on the NISCH gene, and rs10824310, located on the PRKG1 gene, as direct causes of late onset Alzheimer's disease (LOAD) development. In addition, DCL identified ER category as a direct predictor of breast cancer mortality within 5 years, and HER2 status as a direct predictor of 10-year breast cancer mortality. These predictors have been identified in previous studies to have a direct causal relationship with their respective phenotypes, supporting the predictive power of DCL. When the other algorithms discovered predictors from the real datasets, these predictors were either also found by DCL or could not be supported by previous studies. Our results show that DCL outperforms FGS, PC, CPC, and FCI in almost every case, demonstrating its potential to advance causal learning. Furthermore, our DCL algorithm effectively identifies direct causes in the LOAD and Metabric GWAS datasets, which indicates its potential for clinical applications. Copyright © 2016 Elsevier B.V. All rights reserved.
A new time-series methodology for estimating relationships between elderly frailty, remaining life expectancy, and ambient air quality.

PubMed

Murray, Christian J; Lipfert, Frederick W

2012-01-01

Many publications estimate short-term air pollution-mortality risks, but few estimate the associated changes in life-expectancies. We present a new methodology for analyzing time series of health effects, in which prior frailty is assumed to precede short-term elderly nontraumatic mortality. The model is based on a subpopulation of frail individuals whose entries and exits (deaths) are functions of daily and lagged environmental conditions: ambient temperature/season, airborne particles, and ozone. This frail susceptible population is unknown; its fluctuations cannot be observed but are estimated using maximum-likelihood methods with the Kalman filter. We used an existing 14-y set of daily data to illustrate the model and then tested the assumption of prior frailty with a new generalized model that estimates the portion of the daily death count allocated to nonfrail individuals. In this demonstration dataset, new entries into the high-risk pool are associated with lower ambient temperatures and higher concentrations of particulate matter and ozone. Accounting for these effects on antecedent frailty reduces this at-risk population, yielding frail life expectancies of 5-7 days. Associations between environmental factors and entries to the at-risk pool are about twice as strong as for mortality. Nonfrail elderly deaths are seen to make only small contributions. This new model predicts a small short-lived frail population-at-risk that is stable over a wide range of environmental conditions. The predicted effects of pollution on new entries and deaths are robust and consistent with conventional morbidity/mortality times-series studies. We recommend model verification using other suitable datasets.
Population Pharmacokinetics of Intravenous Paracetamol (Acetaminophen) in Preterm and Term Neonates: Model Development and External Evaluation

PubMed Central

Cook, Sarah F.; Roberts, Jessica K.; Samiee-Zafarghandy, Samira; Stockmann, Chris; King, Amber D.; Deutsch, Nina; Williams, Elaine F.; Allegaert, Karel; Sherwin, Catherine M. T.; van den Anker, John N.

2017-01-01

Objectives The aims of this study were to develop a population pharmacokinetic model for intravenous paracetamol in preterm and term neonates and to assess the generalizability of the model by testing its predictive performance in an external dataset. Methods Nonlinear mixed-effects models were constructed from paracetamol concentration–time data in NONMEM 7.2. Potential covariates included body weight, gestational age, postnatal age, postmenstrual age, sex, race, total bilirubin, and estimated glomerular filtration rate. An external dataset was used to test the predictive performance of the model through calculation of bias, precision, and normalized prediction distribution errors. Results The model-building dataset included 260 observations from 35 neonates with a mean gestational age of 33.6 weeks [standard deviation (SD) 6.6]. Data were well-described by a one-compartment model with first-order elimination. Weight predicted paracetamol clearance and volume of distribution, which were estimated as 0.348 L/h (5.5 % relative standard error; 30.8 % coefficient of variation) and 2.46 L (3.5 % relative standard error; 14.3 % coefficient of variation), respectively, at the mean subject weight of 2.30 kg. An external evaluation was performed on an independent dataset that included 436 observations from 60 neonates with a mean gestational age of 35.6 weeks (SD 4.3). The median prediction error was 10.1 % [95 % confidence interval (CI) 6.1–14.3] and the median absolute prediction error was 25.3 % (95 % CI 23.1–28.1). Conclusions Weight predicted intravenous paracetamol pharmacokinetics in neonates ranging from extreme preterm to full-term gestational status. External evaluation suggested that these findings should be generalizable to other similar patient populations. PMID:26201306
Risk factors, prevalence trend, and clustering of hypospadias cases in Puerto Rico.

PubMed

Avilés, Luis A; Alvelo-Maldonado, Laureane; Padró-Mojica, Irmari; Seguinot, José; Jorge, Juan Carlos

2014-12-01

The aim was to determine the distribution pattern of hypospadias cases across a well-defined geographic space. The dataset for this study was produced by the Birth Defects Prevention and Surveillance System of the Department of Health of Puerto Rico (BDSS-PR), which linked the information of male newborns of the Puerto Rico Birth Cohort dataset (PRBC; n=92,285) from 2007 to 2010. A population-based case-control study was conducted to determine prevalence trend and to estimate the potential effects of maternal age, paternal age, birth-related variables, and health insurance status on hypospadias. Two types of geographic information systems (GIS) methods (Anselin Local Moran's I and Getis-Ord G) were used to determine the spatial distribution of hypospadias prevalence. Birthweight (<2500 g), age of mother (40+years), and private health insurance were associated with hypospadias as confirmed with univariate and multivariate analyses at 95% CI. A cluster of hypospadias cases was detected in the north-central region of Puerto Rico with both GIS methods (p≤0.05). The clustering of hypospadias prevalence provides an opportunity to assess the underlying causes of the condition and their relationships with geographical space. Copyright © 2014 Journal of Pediatric Urology Company. Published by Elsevier Ltd. All rights reserved.

A bias-corrected CMIP5 dataset for Africa using the CDF-t method - a contribution to agricultural impact studies

NASA Astrophysics Data System (ADS)

Moise Famien, Adjoua; Janicot, Serge; Delfin Ochou, Abe; Vrac, Mathieu; Defrance, Dimitri; Sultan, Benjamin; Noël, Thomas

2018-03-01

The objective of this paper is to present a new dataset of bias-corrected CMIP5 global climate model (GCM) daily data over Africa. This dataset was obtained using the cumulative distribution function transform (CDF-t) method, a method that has been applied to several regions and contexts but never to Africa. Here CDF-t has been applied over the period 1950-2099 combining Historical runs and climate change scenarios for six variables: precipitation, mean near-surface air temperature, near-surface maximum air temperature, near-surface minimum air temperature, surface downwelling shortwave radiation, and wind speed, which are critical variables for agricultural purposes. WFDEI has been used as the reference dataset to correct the GCMs. Evaluation of the results over West Africa has been carried out on a list of priority user-based metrics that were discussed and selected with stakeholders. It includes simulated yield using a crop model simulating maize growth. These bias-corrected GCM data have been compared with another available dataset of bias-corrected GCMs using WATCH Forcing Data as the reference dataset. The impact of WFD, WFDEI, and also EWEMBI reference datasets has been also examined in detail. It is shown that CDF-t is very effective at removing the biases and reducing the high inter-GCM scattering. Differences with other bias-corrected GCM data are mainly due to the differences among the reference datasets. This is particularly true for surface downwelling shortwave radiation, which has a significant impact in terms of simulated maize yields. Projections of future yields over West Africa are quite different, depending on the bias-correction method used. However all these projections show a similar relative decreasing trend over the 21st century.
BACOM2.0 facilitates absolute normalization and quantification of somatic copy number alterations in heterogeneous tumor

NASA Astrophysics Data System (ADS)

Fu, Yi; Yu, Guoqiang; Levine, Douglas A.; Wang, Niya; Shih, Ie-Ming; Zhang, Zhen; Clarke, Robert; Wang, Yue

2015-09-01

Most published copy number datasets on solid tumors were obtained from specimens comprised of mixed cell populations, for which the varying tumor-stroma proportions are unknown or unreported. The inability to correct for signal mixing represents a major limitation on the use of these datasets for subsequent analyses, such as discerning deletion types or detecting driver aberrations. We describe the BACOM2.0 method with enhanced accuracy and functionality to normalize copy number signals, detect deletion types, estimate tumor purity, quantify true copy numbers, and calculate average-ploidy value. While BACOM has been validated and used with promising results, subsequent BACOM analysis of the TCGA ovarian cancer dataset found that the estimated average tumor purity was lower than expected. In this report, we first show that this lowered estimate of tumor purity is the combined result of imprecise signal normalization and parameter estimation. Then, we describe effective allele-specific absolute normalization and quantification methods that can enhance BACOM applications in many biological contexts while in the presence of various confounders. Finally, we discuss the advantages of BACOM in relation to alternative approaches. Here we detail this revised computational approach, BACOM2.0, and validate its performance in real and simulated datasets.
Unsupervised learning on scientific ocean drilling datasets from the South China Sea

NASA Astrophysics Data System (ADS)

Tse, Kevin C.; Chiu, Hon-Chim; Tsang, Man-Yin; Li, Yiliang; Lam, Edmund Y.

2018-06-01

Unsupervised learning methods were applied to explore data patterns in multivariate geophysical datasets collected from ocean floor sediment core samples coming from scientific ocean drilling in the South China Sea. Compared to studies on similar datasets, but using supervised learning methods which are designed to make predictions based on sample training data, unsupervised learning methods require no a priori information and focus only on the input data. In this study, popular unsupervised learning methods including K-means, self-organizing maps, hierarchical clustering and random forest were coupled with different distance metrics to form exploratory data clusters. The resulting data clusters were externally validated with lithologic units and geologic time scales assigned to the datasets by conventional methods. Compact and connected data clusters displayed varying degrees of correspondence with existing classification by lithologic units and geologic time scales. K-means and self-organizing maps were observed to perform better with lithologic units while random forest corresponded best with geologic time scales. This study sets a pioneering example of how unsupervised machine learning methods can be used as an automatic processing tool for the increasingly high volume of scientific ocean drilling data.
A multiscale Bayesian data integration approach for mapping air dose rates around the Fukushima Daiichi Nuclear Power Plant.

PubMed

Wainwright, Haruko M; Seki, Akiyuki; Chen, Jinsong; Saito, Kimiaki

2017-02-01

This paper presents a multiscale data integration method to estimate the spatial distribution of air dose rates in the regional scale around the Fukushima Daiichi Nuclear Power Plant. We integrate various types of datasets, such as ground-based walk and car surveys, and airborne surveys, all of which have different scales, resolutions, spatial coverage, and accuracy. This method is based on geostatistics to represent spatial heterogeneous structures, and also on Bayesian hierarchical models to integrate multiscale, multi-type datasets in a consistent manner. The Bayesian method allows us to quantify the uncertainty in the estimates, and to provide the confidence intervals that are critical for robust decision-making. Although this approach is primarily data-driven, it has great flexibility to include mechanistic models for representing radiation transport or other complex correlations. We demonstrate our approach using three types of datasets collected at the same time over Fukushima City in Japan: (1) coarse-resolution airborne surveys covering the entire area, (2) car surveys along major roads, and (3) walk surveys in multiple neighborhoods. Results show that the method can successfully integrate three types of datasets and create an integrated map (including the confidence intervals) of air dose rates over the domain in high resolution. Moreover, this study provides us with various insights into the characteristics of each dataset, as well as radiocaesium distribution. In particular, the urban areas show high heterogeneity in the contaminant distribution due to human activities as well as large discrepancy among different surveys due to such heterogeneity. Copyright © 2016 Elsevier Ltd. All rights reserved.
A new collaborative recommendation approach based on users clustering using artificial bee colony algorithm.

PubMed

Ju, Chunhua; Xu, Chonghuan

2013-01-01

Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.
A New Collaborative Recommendation Approach Based on Users Clustering Using Artificial Bee Colony Algorithm

PubMed Central

Ju, Chunhua

2013-01-01

Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods. PMID:24381525
Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation.

PubMed

Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S; Langs, Georg; Simader, Christian; Waldstein, Sebastian M; Schmidt-Erfurth, Ursula M

2016-01-01

Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge.
Multivendor Spectral-Domain Optical Coherence Tomography Dataset, Observer Annotation Performance Evaluation, and Standardized Evaluation Framework for Intraretinal Cystoid Fluid Segmentation

PubMed Central

Wu, Jing; Philip, Ana-Maria; Podkowinski, Dominika; Gerendas, Bianca S.; Langs, Georg; Simader, Christian

2016-01-01

Development of image analysis and machine learning methods for segmentation of clinically significant pathology in retinal spectral-domain optical coherence tomography (SD-OCT), used in disease detection and prediction, is limited due to the availability of expertly annotated reference data. Retinal segmentation methods use datasets that either are not publicly available, come from only one device, or use different evaluation methodologies making them difficult to compare. Thus we present and evaluate a multiple expert annotated reference dataset for the problem of intraretinal cystoid fluid (IRF) segmentation, a key indicator in exudative macular disease. In addition, a standardized framework for segmentation accuracy evaluation, applicable to other pathological structures, is presented. Integral to this work is the dataset used which must be fit for purpose for IRF segmentation algorithm training and testing. We describe here a multivendor dataset comprised of 30 scans. Each OCT scan for system training has been annotated by multiple graders using a proprietary system. Evaluation of the intergrader annotations shows a good correlation, thus making the reproducibly annotated scans suitable for the training and validation of image processing and machine learning based segmentation methods. The dataset will be made publicly available in the form of a segmentation Grand Challenge. PMID:27579177
A trust-based recommendation method using network diffusion processes

NASA Astrophysics Data System (ADS)

Chen, Ling-Jiao; Gao, Jian

2018-09-01

A variety of rating-based recommendation methods have been extensively studied including the well-known collaborative filtering approaches and some network diffusion-based methods, however, social trust relations are not sufficiently considered when making recommendations. In this paper, we contribute to the literature by proposing a trust-based recommendation method, named CosRA+T, after integrating the information of trust relations into the resource-redistribution process. Specifically, a tunable parameter is used to scale the resources received by trusted users before the redistribution back to the objects. Interestingly, we find an optimal scaling parameter for the proposed CosRA+T method to achieve its best recommendation accuracy, and the optimal value seems to be universal under several evaluation metrics across different datasets. Moreover, results of extensive experiments on the two real-world rating datasets with trust relations, Epinions and FriendFeed, suggest that CosRA+T has a remarkable improvement in overall accuracy, diversity and novelty. Our work takes a step towards designing better recommendation algorithms by employing multiple resources of social network information.
Deep learning and texture-based semantic label fusion for brain tumor segmentation

NASA Astrophysics Data System (ADS)

Vidyaratne, L.; Alam, M.; Shboul, Z.; Iftekharuddin, K. M.

2018-02-01

Brain tumor segmentation is a fundamental step in surgical treatment and therapy. Many hand-crafted and learning based methods have been proposed for automatic brain tumor segmentation from MRI. Studies have shown that these approaches have their inherent advantages and limitations. This work proposes a semantic label fusion algorithm by combining two representative state-of-the-art segmentation algorithms: texture based hand-crafted, and deep learning based methods to obtain robust tumor segmentation. We evaluate the proposed method using publicly available BRATS 2017 brain tumor segmentation challenge dataset. The results show that the proposed method offers improved segmentation by alleviating inherent weaknesses: extensive false positives in texture based method, and the false tumor tissue classification problem in deep learning method, respectively. Furthermore, we investigate the effect of patient's gender on the segmentation performance using a subset of validation dataset. Note the substantial improvement in brain tumor segmentation performance proposed in this work has recently enabled us to secure the first place by our group in overall patient survival prediction task at the BRATS 2017 challenge.
Deep Learning and Texture-Based Semantic Label Fusion for Brain Tumor Segmentation.

PubMed

Vidyaratne, L; Alam, M; Shboul, Z; Iftekharuddin, K M

2018-01-01

Brain tumor segmentation is a fundamental step in surgical treatment and therapy. Many hand-crafted and learning based methods have been proposed for automatic brain tumor segmentation from MRI. Studies have shown that these approaches have their inherent advantages and limitations. This work proposes a semantic label fusion algorithm by combining two representative state-of-the-art segmentation algorithms: texture based hand-crafted, and deep learning based methods to obtain robust tumor segmentation. We evaluate the proposed method using publicly available BRATS 2017 brain tumor segmentation challenge dataset. The results show that the proposed method offers improved segmentation by alleviating inherent weaknesses: extensive false positives in texture based method, and the false tumor tissue classification problem in deep learning method, respectively. Furthermore, we investigate the effect of patient's gender on the segmentation performance using a subset of validation dataset. Note the substantial improvement in brain tumor segmentation performance proposed in this work has recently enabled us to secure the first place by our group in overall patient survival prediction task at the BRATS 2017 challenge.
A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data.

PubMed

Nasejje, Justine B; Mwambi, Henry; Dheda, Keertan; Lesosky, Maia

2017-07-28

Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate. In this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points). The study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points. Although survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.
Object-based random forest classification of Landsat ETM+ and WorldView-2 satellite imagery for mapping lowland native grassland communities in Tasmania, Australia

NASA Astrophysics Data System (ADS)

Melville, Bethany; Lucieer, Arko; Aryal, Jagannath

2018-04-01

This paper presents a random forest classification approach for identifying and mapping three types of lowland native grassland communities found in the Tasmanian Midlands region. Due to the high conservation priority assigned to these communities, there has been an increasing need to identify appropriate datasets that can be used to derive accurate and frequently updateable maps of community extent. Therefore, this paper proposes a method employing repeat classification and statistical significance testing as a means of identifying the most appropriate dataset for mapping these communities. Two datasets were acquired and analysed; a Landsat ETM+ scene, and a WorldView-2 scene, both from 2010. Training and validation data were randomly subset using a k-fold (k = 50) approach from a pre-existing field dataset. Poa labillardierei, Themeda triandra and lowland native grassland complex communities were identified in addition to dry woodland and agriculture. For each subset of randomly allocated points, a random forest model was trained based on each dataset, and then used to classify the corresponding imagery. Validation was performed using the reciprocal points from the independent subset that had not been used to train the model. Final training and classification accuracies were reported as per class means for each satellite dataset. Analysis of Variance (ANOVA) was undertaken to determine whether classification accuracy differed between the two datasets, as well as between classifications. Results showed mean class accuracies between 54% and 87%. Class accuracy only differed significantly between datasets for the dry woodland and Themeda grassland classes, with the WorldView-2 dataset showing higher mean classification accuracies. The results of this study indicate that remote sensing is a viable method for the identification of lowland native grassland communities in the Tasmanian Midlands, and that repeat classification and statistical significant testing can be used to identify optimal datasets for vegetation community mapping.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Jurrus, Elizabeth R.; Hodas, Nathan O.; Baker, Nathan A.

Forensic analysis of nanoparticles is often conducted through the collection and identifi- cation of electron microscopy images to determine the origin of suspected nuclear material. Each image is carefully studied by experts for classification of materials based on texture, shape, and size. Manually inspecting large image datasets takes enormous amounts of time. However, automatic classification of large image datasets is a challenging problem due to the complexity involved in choosing image features, the lack of training data available for effective machine learning methods, and the availability of user interfaces to parse through images. Therefore, a significant need exists for automatedmore » and semi-automated methods to help analysts perform accurate image classification in large image datasets. We present INStINCt, our Intelligent Signature Canvas, as a framework for quickly organizing image data in a web based canvas framework. Images are partitioned using small sets of example images, chosen by users, and presented in an optimal layout based on features derived from convolutional neural networks.« less
The Evolution of Campylobacter jejuni and Campylobacter coli

PubMed Central

Sheppard, Samuel K.; Maiden, Martin C.J.

2015-01-01

The global significance of Campylobacter jejuni and Campylobacter coli as gastrointestinal human pathogens has motivated numerous studies to characterize their population biology and evolution. These bacteria are a common component of the intestinal microbiota of numerous bird and mammal species and cause disease in humans, typically via consumption of contaminated meat products, especially poultry meat. Sequence-based molecular typing methods, such as multilocus sequence typing (MLST) and whole genome sequencing (WGS), have been instructive for understanding the epidemiology and evolution of these bacteria and how phenotypic variation relates to the high degree of genetic structuring in C. coli and C. jejuni populations. Here, we describe aspects of the relatively short history of coevolution between humans and pathogenic Campylobacter, by reviewing research investigating how mutation and lateral or horizontal gene transfer (LGT or HGT, respectively) interact to create the observed population structure. These genetic changes occur in a complex fitness landscape with divergent ecologies, including multiple host species, which can lead to rapid adaptation, for example, through frame-shift mutations that alter gene expression or the acquisition of novel genetic elements by HGT. Recombination is a particularly strong evolutionary force in Campylobacter, leading to the emergence of new lineages and even large-scale genome-wide interspecies introgression between C. jejuni and C. coli. The increasing availability of large genome datasets is enhancing understanding of Campylobacter evolution through the application of methods, such as genome-wide association studies, but MLST-derived clonal complex designations remain a useful method for describing population structure. PMID:26101080
Mapping populations at risk: improving spatial demographic data for infectious disease modeling and metric derivation

PubMed Central

2012-01-01

The use of Global Positioning Systems (GPS) and Geographical Information Systems (GIS) in disease surveys and reporting is becoming increasingly routine, enabling a better understanding of spatial epidemiology and the improvement of surveillance and control strategies. In turn, the greater availability of spatially referenced epidemiological data is driving the rapid expansion of disease mapping and spatial modeling methods, which are becoming increasingly detailed and sophisticated, with rigorous handling of uncertainties. This expansion has, however, not been matched by advancements in the development of spatial datasets of human population distribution that accompany disease maps or spatial models. Where risks are heterogeneous across population groups or space or dependent on transmission between individuals, spatial data on human population distributions and demographic structures are required to estimate infectious disease risks, burdens, and dynamics. The disease impact in terms of morbidity, mortality, and speed of spread varies substantially with demographic profiles, so that identifying the most exposed or affected populations becomes a key aspect of planning and targeting interventions. Subnational breakdowns of population counts by age and sex are routinely collected during national censuses and maintained in finer detail within microcensus data. Moreover, demographic and health surveys continue to collect representative and contemporary samples from clusters of communities in low-income countries where census data may be less detailed and not collected regularly. Together, these freely available datasets form a rich resource for quantifying and understanding the spatial variations in the sizes and distributions of those most at risk of disease in low income regions, yet at present, they remain unconnected data scattered across national statistical offices and websites. In this paper we discuss the deficiencies of existing spatial population datasets and their limitations on epidemiological analyses. We review sources of detailed, contemporary, freely available and relevant spatial demographic data focusing on low income regions where such data are often sparse and highlight the value of incorporating these through a set of examples of their application in disease studies. Moreover, the importance of acknowledging, measuring, and accounting for uncertainty in spatial demographic datasets is outlined. Finally, a strategy for building an open-access database of spatial demographic data that is tailored to epidemiological applications is put forward. PMID:22591595
Improving Generalization Based on l1-Norm Regularization for EEG-Based Motor Imagery Classification

PubMed Central

Zhao, Yuwei; Han, Jiuqi; Chen, Yushu; Sun, Hongji; Chen, Jiayun; Ke, Ang; Han, Yao; Zhang, Peng; Zhang, Yi; Zhou, Jin; Wang, Changyong

2018-01-01

Multichannel electroencephalography (EEG) is widely used in typical brain-computer interface (BCI) systems. In general, a number of parameters are essential for a EEG classification algorithm due to redundant features involved in EEG signals. However, the generalization of the EEG method is often adversely affected by the model complexity, considerably coherent with its number of undetermined parameters, further leading to heavy overfitting. To decrease the complexity and improve the generalization of EEG method, we present a novel l1-norm-based approach to combine the decision value obtained from each EEG channel directly. By extracting the information from different channels on independent frequency bands (FB) with l1-norm regularization, the method proposed fits the training data with much less parameters compared to common spatial pattern (CSP) methods in order to reduce overfitting. Moreover, an effective and efficient solution to minimize the optimization object is proposed. The experimental results on dataset IVa of BCI competition III and dataset I of BCI competition IV show that, the proposed method contributes to high classification accuracy and increases generalization performance for the classification of MI EEG. As the training set ratio decreases from 80 to 20%, the average classification accuracy on the two datasets changes from 85.86 and 86.13% to 84.81 and 76.59%, respectively. The classification performance and generalization of the proposed method contribute to the practical application of MI based BCI systems. PMID:29867307
A knowledge-driven approach to biomedical document conceptualization.

PubMed

Zheng, Hai-Tao; Borchert, Charles; Jiang, Yong

2010-06-01

Biomedical document conceptualization is the process of clustering biomedical documents based on ontology-represented domain knowledge. The result of this process is the representation of the biomedical documents by a set of key concepts and their relationships. Most of clustering methods cluster documents based on invariant domain knowledge. The objective of this work is to develop an effective method to cluster biomedical documents based on various user-specified ontologies, so that users can exploit the concept structures of documents more effectively. We develop a flexible framework to allow users to specify the knowledge bases, in the form of ontologies. Based on the user-specified ontologies, we develop a key concept induction algorithm, which uses latent semantic analysis to identify key concepts and cluster documents. A corpus-related ontology generation algorithm is developed to generate the concept structures of documents. Based on two biomedical datasets, we evaluate the proposed method and five other clustering algorithms. The clustering results of the proposed method outperform the five other algorithms, in terms of key concept identification. With respect to the first biomedical dataset, our method has the F-measure values 0.7294 and 0.5294 based on the MeSH ontology and gene ontology (GO), respectively. With respect to the second biomedical dataset, our method has the F-measure values 0.6751 and 0.6746 based on the MeSH ontology and GO, respectively. Both results outperforms the five other algorithms in terms of F-measure. Based on the MeSH ontology and GO, the generated corpus-related ontologies show informative conceptual structures. The proposed method enables users to specify the domain knowledge to exploit the conceptual structures of biomedical document collections. In addition, the proposed method is able to extract the key concepts and cluster the documents with a relatively high precision. Copyright 2010 Elsevier B.V. All rights reserved.
The European Glaucoma Society Glaucocard project: improved digital documentation of medical data for glaucoma patients based on standardized structured international datasets.

PubMed

Schargus, Marc; Grehn, Franz; Glaucocard Workgroup

2008-12-01

To evaluate existing international IT-based ophthalmological medical data projects, and to define a glaucoma data set based on existing international standards of medical and ophthalmological documentation. To develop the technical environment for easy data mining and data exchange in different countries in Europe. Existing clinical and IT-based projects for documentation of medical data in general medicine and ophthalmology were analyzed to create new data sets for medical documentation in glaucoma patients. Different types of data transfer methods were evaluated to find the best method of data exchange between ophthalmologists in different European countries. Data sets from existing IT projects showed a wide variability in specifications, use of codes, terms and graphical data (perimetry, optic nerve analysis etc.) in glaucoma patients. New standardized digital datasets for glaucoma patients were defined, based on existing standards, which can be used by general ophthalmologists for follow-up examinations and for glaucoma specialists to perform teleconsultation, also across country borders. Datasets are available in different languages. Different types of data exchange methods using secure medical data transfer by internet, USB stick and smartcard were tested for different countries with regard to legal acceptance, practicability and technical realization (e.g. compatibility with EMR systems). By creating new standardized glaucoma specific cross-national datasets, it is now possible to develop an electronic glaucoma patient record system for data storage and transfer based on internet, smartcard or USB stick. The digital data can be used for referrals and for teleconsultation of glaucoma specialists in order to optimize glaucoma treatment. This should lead to an increase of quality in glaucoma care, and prevent expenses in health care costs by unnecessary repeated examinations.
Genome-Wide association study identifies candidate genes for Parkinson's disease in an Ashkenazi Jewish population

PubMed Central

2011-01-01

Background To date, nine Parkinson disease (PD) genome-wide association studies in North American, European and Asian populations have been published. The majority of studies have confirmed the association of the previously identified genetic risk factors, SNCA and MAPT, and two studies have identified three new PD susceptibility loci/genes (PARK16, BST1 and HLA-DRB5). In a recent meta-analysis of datasets from five of the published PD GWAS an additional 6 novel candidate genes (SYT11, ACMSD, STK39, MCCC1/LAMP3, GAK and CCDC62/HIP1R) were identified. Collectively the associations identified in these GWAS account for only a small proportion of the estimated total heritability of PD suggesting that an 'unknown' component of the genetic architecture of PD remains to be identified. Methods We applied a GWAS approach to a relatively homogeneous Ashkenazi Jewish (AJ) population from New York to search for both 'rare' and 'common' genetic variants that confer risk of PD by examining any SNPs with allele frequencies exceeding 2%. We have focused on a genetic isolate, the AJ population, as a discovery dataset since this cohort has a higher sharing of genetic background and historically experienced a significant bottleneck. We also conducted a replication study using two publicly available datasets from dbGaP. The joint analysis dataset had a combined sample size of 2,050 cases and 1,836 controls. Results We identified the top 57 SNPs showing the strongest evidence of association in the AJ dataset (p < 9.9 × 10-5). Six SNPs located within gene regions had positive signals in at least one other independent dbGaP dataset: LOC100505836 (Chr3p24), LOC153328/SLC25A48 (Chr5q31.1), UNC13B (9p13.3), SLCO3A1(15q26.1), WNT3(17q21.3) and NSF (17q21.3). We also replicated published associations for the gene regions SNCA (Chr4q21; rs3775442, p = 0.037), PARK16 (Chr1q32.1; rs823114 (NUCKS1), p = 6.12 × 10-4), BST1 (Chr4p15; rs12502586, p = 0.027), STK39 (Chr2q24.3; rs3754775, p = 0.005), and LAMP3 (Chr3; rs12493050, p = 0.005) in addition to the two most common PD susceptibility genes in the AJ population LRRK2 (Chr12q12; rs34637584, p = 1.56 × 10-4) and GBA (Chr1q21; rs2990245, p = 0.015). Conclusions We have demonstrated the utility of the AJ dataset in PD candidate gene and SNP discovery both by replication in dbGaP datasets with a larger sample size and by replicating association of previously identified PD susceptibility genes. Our GWAS study has identified candidate gene regions for PD that are implicated in neuronal signalling and the dopamine pathway. PMID:21812969

DMirNet: Inferring direct microRNA-mRNA association networks.

PubMed

Lee, Minsu; Lee, HyungJune

2016-12-05

MicroRNAs (miRNAs) play important regulatory roles in the wide range of biological processes by inducing target mRNA degradation or translational repression. Based on the correlation between expression profiles of a miRNA and its target mRNA, various computational methods have previously been proposed to identify miRNA-mRNA association networks by incorporating the matched miRNA and mRNA expression profiles. However, there remain three major issues to be resolved in the conventional computation approaches for inferring miRNA-mRNA association networks from expression profiles. 1) Inferred correlations from the observed expression profiles using conventional correlation-based methods include numerous erroneous links or over-estimated edge weight due to the transitive information flow among direct associations. 2) Due to the high-dimension-low-sample-size problem on the microarray dataset, it is difficult to obtain an accurate and reliable estimate of the empirical correlations between all pairs of expression profiles. 3) Because the previously proposed computational methods usually suffer from varying performance across different datasets, a more reliable model that guarantees optimal or suboptimal performance across different datasets is highly needed. In this paper, we present DMirNet, a new framework for identifying direct miRNA-mRNA association networks. To tackle the aforementioned issues, DMirNet incorporates 1) three direct correlation estimation methods (namely Corpcor, SPACE, Network deconvolution) to infer direct miRNA-mRNA association networks, 2) the bootstrapping method to fully utilize insufficient training expression profiles, and 3) a rank-based Ensemble aggregation to build a reliable and robust model across different datasets. Our empirical experiments on three datasets demonstrate the combinatorial effects of necessary components in DMirNet. Additional performance comparison experiments show that DMirNet outperforms the state-of-the-art Ensemble-based model [1] which has shown the best performance across the same three datasets, with a factor of up to 1.29. Further, we identify 43 putative novel multi-cancer-related miRNA-mRNA association relationships from an inferred Top 1000 direct miRNA-mRNA association network. We believe that DMirNet is a promising method to identify novel direct miRNA-mRNA relations and to elucidate the direct miRNA-mRNA association networks. Since DMirNet infers direct relationships from the observed data, DMirNet can contribute to reconstructing various direct regulatory pathways, including, but not limited to, the direct miRNA-mRNA association networks.
Pathological brain detection based on wavelet entropy and Hu moment invariants.

PubMed

Zhang, Yudong; Wang, Shuihua; Sun, Ping; Phillips, Preetha

2015-01-01

With the aim of developing an accurate pathological brain detection system, we proposed a novel automatic computer-aided diagnosis (CAD) to detect pathological brains from normal brains obtained by magnetic resonance imaging (MRI) scanning. The problem still remained a challenge for technicians and clinicians, since MR imaging generated an exceptionally large information dataset. A new two-step approach was proposed in this study. We used wavelet entropy (WE) and Hu moment invariants (HMI) for feature extraction, and the generalized eigenvalue proximal support vector machine (GEPSVM) for classification. To further enhance classification accuracy, the popular radial basis function (RBF) kernel was employed. The 10 runs of k-fold stratified cross validation result showed that the proposed "WE + HMI + GEPSVM + RBF" method was superior to existing methods w.r.t. classification accuracy. It obtained the average classification accuracies of 100%, 100%, and 99.45% over Dataset-66, Dataset-160, and Dataset-255, respectively. The proposed method is effective and can be applied to realistic use.
Linear discriminant analysis based on L1-norm maximization.

PubMed

Zhong, Fujin; Zhang, Jiashu

2013-08-01

Linear discriminant analysis (LDA) is a well-known dimensionality reduction technique, which is widely used for many purposes. However, conventional LDA is sensitive to outliers because its objective function is based on the distance criterion using L2-norm. This paper proposes a simple but effective robust LDA version based on L1-norm maximization, which learns a set of local optimal projection vectors by maximizing the ratio of the L1-norm-based between-class dispersion and the L1-norm-based within-class dispersion. The proposed method is theoretically proved to be feasible and robust to outliers while overcoming the singular problem of the within-class scatter matrix for conventional LDA. Experiments on artificial datasets, standard classification datasets and three popular image databases demonstrate the efficacy of the proposed method.
Evaluation of the Soil Conservation Service curve number methodology using data from agricultural plots

NASA Astrophysics Data System (ADS)

Lal, Mohan; Mishra, S. K.; Pandey, Ashish; Pandey, R. P.; Meena, P. K.; Chaudhary, Anubhav; Jha, Ranjit Kumar; Shreevastava, Ajit Kumar; Kumar, Yogendra

2017-01-01

The Soil Conservation Service curve number (SCS-CN) method, also known as the Natural Resources Conservation Service curve number (NRCS-CN) method, is popular for computing the volume of direct surface runoff for a given rainfall event. The performance of the SCS-CN method, based on large rainfall (P) and runoff (Q) datasets of United States watersheds, is evaluated using a large dataset of natural storm events from 27 agricultural plots in India. On the whole, the CN estimates from the National Engineering Handbook (chapter 4) tables do not match those derived from the observed P and Q datasets. As a result, the runoff prediction using former CNs was poor for the data of 22 (out of 24) plots. However, the match was little better for higher CN values, consistent with the general notion that the existing SCS-CN method performs better for high rainfall-runoff (high CN) events. Infiltration capacity (fc) was the main explanatory variable for runoff (or CN) production in study plots as it exhibited the expected inverse relationship between CN and fc. The plot-data optimization yielded initial abstraction coefficient (λ) values from 0 to 0.659 for the ordered dataset and 0 to 0.208 for the natural dataset (with 0 as the most frequent value). Mean and median λ values were, respectively, 0.030 and 0 for the natural rainfall-runoff dataset and 0.108 and 0 for the ordered rainfall-runoff dataset. Runoff estimation was very sensitive to λ and it improved consistently as λ changed from 0.2 to 0.03.
Base-Calling Algorithm with Vocabulary (BCV) Method for Analyzing Population Sequencing Chromatograms

PubMed Central

Fantin, Yuri S.; Neverov, Alexey D.; Favorov, Alexander V.; Alvarez-Figueroa, Maria V.; Braslavskaya, Svetlana I.; Gordukova, Maria A.; Karandashova, Inga V.; Kuleshov, Konstantin V.; Myznikova, Anna I.; Polishchuk, Maya S.; Reshetov, Denis A.; Voiciehovskaya, Yana A.; Mironov, Andrei A.; Chulanov, Vladimir P.

2013-01-01

Sanger sequencing is a common method of reading DNA sequences. It is less expensive than high-throughput methods, and it is appropriate for numerous applications including molecular diagnostics. However, sequencing mixtures of similar DNA of pathogens with this method is challenging. This is important because most clinical samples contain such mixtures, rather than pure single strains. The traditional solution is to sequence selected clones of PCR products, a complicated, time-consuming, and expensive procedure. Here, we propose the base-calling with vocabulary (BCV) method that computationally deciphers Sanger chromatograms obtained from mixed DNA samples. The inputs to the BCV algorithm are a chromatogram and a dictionary of sequences that are similar to those we expect to obtain. We apply the base-calling function on a test dataset of chromatograms without ambiguous positions, as well as one with 3–14% sequence degeneracy. Furthermore, we use BCV to assemble a consensus sequence for an HIV genome fragment in a sample containing a mixture of viral DNA variants and to determine the positions of the indels. Finally, we detect drug-resistant Mycobacterium tuberculosis strains carrying frameshift mutations mixed with wild-type bacteria in the pncA gene, and roughly characterize bacterial communities in clinical samples by direct 16S rRNA sequencing. PMID:23382983
Improving binding mode and binding affinity predictions of docking by ligand-based search of protein conformations: evaluation in D3R grand challenge 2015

NASA Astrophysics Data System (ADS)

Xu, Xianjin; Yan, Chengfei; Zou, Xiaoqin

2017-08-01

The growing number of protein-ligand complex structures, particularly the structures of proteins co-bound with different ligands, in the Protein Data Bank helps us tackle two major challenges in molecular docking studies: the protein flexibility and the scoring function. Here, we introduced a systematic strategy by using the information embedded in the known protein-ligand complex structures to improve both binding mode and binding affinity predictions. Specifically, a ligand similarity calculation method was employed to search a receptor structure with a bound ligand sharing high similarity with the query ligand for the docking use. The strategy was applied to the two datasets (HSP90 and MAP4K4) in recent D3R Grand Challenge 2015. In addition, for the HSP90 dataset, a system-specific scoring function (ITScore2_hsp90) was generated by recalibrating our statistical potential-based scoring function (ITScore2) using the known protein-ligand complex structures and the statistical mechanics-based iterative method. For the HSP90 dataset, better performances were achieved for both binding mode and binding affinity predictions comparing with the original ITScore2 and with ensemble docking. For the MAP4K4 dataset, although there were only eight known protein-ligand complex structures, our docking strategy achieved a comparable performance with ensemble docking. Our method for receptor conformational selection and iterative method for the development of system-specific statistical potential-based scoring functions can be easily applied to other protein targets that have a number of protein-ligand complex structures available to improve predictions on binding.
EnviroAtlas - Big Game Hunting Recreation Demand by 12-Digit HUC in the Conterminous United States

EPA Pesticide Factsheets

This EnviroAtlas dataset includes the total number of recreational days per year demanded by people ages 18 and over for big game hunting by location in the contiguous United States. Big game includes deer, elk, bear, and wild turkey. These values are based on 2010 population distribution, 2011 U.S. Fish and Wildlife Service (FWS) Fish, Hunting, and Wildlife-Associated Recreation (FHWAR) survey data, and 2011 U.S. Department of Agriculture (USDA) Forest Service National Visitor Use Monitoring program data, and have been summarized by 12-digit hydrologic unit code (HUC). This dataset was produced by the US EPA to support research and online mapping activities related to the EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Innovations in user-defined analysis: dynamic grouping and customized user datasets in VistaPHw.

PubMed

Solet, David; Glusker, Ann; Laurent, Amy; Yu, Tianji

2006-01-01

Flexible, ready access to community health assessment data is a feature of innovative Web-based data query systems. An example is VistaPHw, which provides access to Washington state data and statistics used in community health assessment. Because of its flexible analysis options, VistaPHw customizes local, population-based results to be relevant to public health decision-making. The advantages of two innovations, dynamic grouping and the Custom Data Module, are described. Dynamic grouping permits the creation of user-defined aggregations of geographic areas, age groups, race categories, and years. Standard VistaPHw measures such as rates, confidence intervals, and other statistics may then be calculated for the new groups. Dynamic grouping has provided data for major, successful grant proposals, building partnerships with local governments and organizations, and informing program planning for community organizations. The Custom Data Module allows users to prepare virtually any dataset so it may be analyzed in VistaPHw. Uses for this module may include datasets too sensitive to be placed on a Web server or datasets that are not standardized across the state. Limitations and other system needs are also discussed.
Five types of home-visit nursing agencies in Japan based on characteristics of service delivery: cluster analysis of three nationwide surveys.

PubMed

Fukui, Sakiko; Yamamoto-Mitani, Noriko; Fujita, Junko

2014-12-20

The number of home-visit nursing agencies in Japan has greatly increased over the past 20 years since the Japanese government first introduced it in 1992 to meet the increased needs of home-bound elderly. Since then, home-visit nursing has come to serve for a variety of populations such as those with terminal-stage cancer, neurological diseases, psychiatric conditions, or children with chronic conditions; currently the number of agencies has reached 6,801 (as of April 2013). Yet little has been known about the details of their characteristics in terms of patient types or differences/similarities across regions. In this study, we developed a method to categorize home-visit nursing agencies throughout Japan based on their actual service delivery, in order to help improve healthcare policies allowing better services by those agencies. We performed a cluster analysis on data from two national databases (Survey of Institutions and Establishments for Long-term Care which is annually administered by the Ministry of Health, Labour and Welfare [dataset 1; n = 5,161] and Information Publication System for Long-term Care which is annually reported by home-visit nursing agencies to their respective prefectural governments [dataset 2; n = 4,400, matching rate to data set 1: 84.4%]), in addition to the results from our original nationwide Fax survey of the service delivery system of home-visit nursing agencies (dataset 3; n = 2,049 matching rate to data set 1: 39.3%). The cluster analysis suggested five categories for home-visit nursing agencies based on the type of service delivery system. For deciding of these categories, we held 13 panel discussions with specialists to confirm that the categorization of the home-visit nursing agencies appropriately reflected their actual delivery systems. The five categories were: nurse-centered (560, 10.9%), rehabilitation-centered (211, 4.1%), psychiatric-centered (360, 7.0%), urban-centered (1,784, 34.5%), and rural-centered (2246, 43.5%). This five categorization system of home-visit nursing agencies could ensure appropriate healthcare policies that will allow agencies to provide better home-visit nursing services based on their patient and staff characteristics and regional needs. The findings would be valuable both in Japan as well as in other countries with rapidly growing aging populations.
Sleep spindle detection using deep learning: A validation study based on crowdsourcing.

PubMed

Dakun Tan; Rui Zhao; Jinbo Sun; Wei Qin

2015-08-01

Sleep spindles are significant transient oscillations observed on the electroencephalogram (EEG) in stage 2 of non-rapid eye movement sleep. Deep belief network (DBN) gaining great successes in images and speech is still a novel method to develop sleep spindle detection system. In this paper, crowdsourcing replacing gold standard was applied to generate three different labeled samples and constructed three classes of datasets with a combination of these samples. An F1-score measure was estimated to compare the performance of DBN to other three classifiers on classifying these samples, with the DBN obtaining an result of 92.78%. Then a comparison of two feature extraction methods based on power spectrum density was made on same dataset using DBN. In addition, the DBN trained in dataset was applied to detect sleep spindle from raw EEG recordings and performed a comparable capacity to expert group consensus.
Protein interaction network topology uncovers melanogenesis regulatory network components within functional genomics datasets.

PubMed

Ho, Hsiang; Milenković, Tijana; Memisević, Vesna; Aruri, Jayavani; Przulj, Natasa; Ganesan, Anand K

2010-06-15

RNA-mediated interference (RNAi)-based functional genomics is a systems-level approach to identify novel genes that control biological phenotypes. Existing computational approaches can identify individual genes from RNAi datasets that regulate a given biological process. However, currently available methods cannot identify which RNAi screen "hits" are novel components of well-characterized biological pathways known to regulate the interrogated phenotype. In this study, we describe a method to identify genes from RNAi datasets that are novel components of known biological pathways. We experimentally validate our approach in the context of a recently completed RNAi screen to identify novel regulators of melanogenesis. In this study, we utilize a PPI network topology-based approach to identify targets within our RNAi dataset that may be components of known melanogenesis regulatory pathways. Our computational approach identifies a set of screen targets that cluster topologically in a human PPI network with the known pigment regulator Endothelin receptor type B (EDNRB). Validation studies reveal that these genes impact pigment production and EDNRB signaling in pigmented melanoma cells (MNT-1) and normal melanocytes. We present an approach that identifies novel components of well-characterized biological pathways from functional genomics datasets that could not have been identified by existing statistical and computational approaches.
Protein interaction network topology uncovers melanogenesis regulatory network components within functional genomics datasets

PubMed Central

2010-01-01

Background RNA-mediated interference (RNAi)-based functional genomics is a systems-level approach to identify novel genes that control biological phenotypes. Existing computational approaches can identify individual genes from RNAi datasets that regulate a given biological process. However, currently available methods cannot identify which RNAi screen "hits" are novel components of well-characterized biological pathways known to regulate the interrogated phenotype. In this study, we describe a method to identify genes from RNAi datasets that are novel components of known biological pathways. We experimentally validate our approach in the context of a recently completed RNAi screen to identify novel regulators of melanogenesis. Results In this study, we utilize a PPI network topology-based approach to identify targets within our RNAi dataset that may be components of known melanogenesis regulatory pathways. Our computational approach identifies a set of screen targets that cluster topologically in a human PPI network with the known pigment regulator Endothelin receptor type B (EDNRB). Validation studies reveal that these genes impact pigment production and EDNRB signaling in pigmented melanoma cells (MNT-1) and normal melanocytes. Conclusions We present an approach that identifies novel components of well-characterized biological pathways from functional genomics datasets that could not have been identified by existing statistical and computational approaches. PMID:20550706
Including α s1 casein gene information in genomic evaluations of French dairy goats.

PubMed

Carillier-Jacquin, Céline; Larroque, Hélène; Robert-Granié, Christèle

2016-08-04

Genomic best linear unbiased prediction methods assume that all markers explain the same fraction of the genetic variance and do not account effectively for genes with major effects such as the α s1 casein polymorphism in dairy goats. In this study, we investigated methods to include the available α s1 casein genotype effect in genomic evaluations of French dairy goats. First, the α s1 casein genotype was included as a fixed effect in genomic evaluation models based only on bucks that were genotyped at the α s1 casein locus. Less than 1 % of the females with phenotypes were genotyped at the α s1 casein gene. Thus, to incorporate these female phenotypes in the genomic evaluation, two methods that allowed for this large number of missing α s1 casein genotypes were investigated. Probabilities for each possible α s1 casein genotype were first estimated for each female of unknown genotype based on iterative peeling equations. The second method is based on a multiallelic gene content approach. For each model tested, we used three datasets each divided into a training and a validation set: (1) two-breed population (Alpine + Saanen), (2) Alpine population, and (3) Saanen population. The α s1 casein genotype had a significant effect on milk yield, fat content and protein content. Including an α s1 casein effect in genetic and genomic evaluations based only on male known α s1 casein genotypes improved accuracies (from 6 to 27 %). In genomic evaluations based on all female phenotypes, the gene content approach performed better than the other tested methods but the improvement in accuracy was only slightly better (from 1 to 14 %) than that of a genomic model without the α s1 casein effect. Including the α s1 casein effect in a genomic evaluation model for French dairy goats is possible and useful to improve accuracy. Difficulties in predicting the genotypes for ungenotyped animals limited the improvement in accuracy of the obtained estimated breeding values.
Prediction of brain tissue temperature using near-infrared spectroscopy.

PubMed

Holper, Lisa; Mitra, Subhabrata; Bale, Gemma; Robertson, Nicola; Tachtsidis, Ilias

2017-04-01

Broadband near-infrared spectroscopy (NIRS) can provide an endogenous indicator of tissue temperature based on the temperature dependence of the water absorption spectrum. We describe a first evaluation of the calibration and prediction of brain tissue temperature obtained during hypothermia in newborn piglets (animal dataset) and rewarming in newborn infants (human dataset) based on measured body (rectal) temperature. The calibration using partial least squares regression proved to be a reliable method to predict brain tissue temperature with respect to core body temperature in the wavelength interval of 720 to 880 nm with a strong mean predictive power of [Formula: see text] (animal dataset) and [Formula: see text] (human dataset). In addition, we applied regression receiver operating characteristic curves for the first time to evaluate the temperature prediction, which provided an overall mean error bias between NIRS predicted brain temperature and body temperature of [Formula: see text] (animal dataset) and [Formula: see text] (human dataset). We discuss main methodological aspects, particularly the well-known aspect of over- versus underestimation between brain and body temperature, which is relevant for potential clinical applications.
Biases of STRUCTURE software when exploring introduction routes of invasive species.

PubMed

Lombaert, Eric; Guillemaud, Thomas; Deleury, Emeline

2018-06-01

Population genetic methods are widely used to retrace the introduction routes of invasive species. The unsupervised Bayesian clustering algorithm implemented in STRUCTURE is amongst the most frequently used of these methods, but its ability to provide reliable information about introduction routes has never been assessed. We simulated microsatellite datasets to evaluate the extent to which the results provided by STRUCTURE were misleading for the inference of introduction routes. We focused on an invasion scenario involving one native and two independently introduced populations, because it is the sole scenario that can be rejected when obtaining a particular clustering with a STRUCTURE analysis at K = 2 (two clusters). Results were classified as "misleading" or "non-misleading". We investigated the influence of effective size, bottleneck severity and number of loci on the type and frequency of misleading results. We showed that misleading STRUCTURE results were obtained for 10% of all simulated datasets. Our results highlighted two categories of misleading output. The first occurs when the native population has a low level of diversity. In this case, the two introduced populations may be very similar, despite their independent introduction histories. The second category results from convergence issues in STRUCTURE for K = 2, with strong bottleneck severity and/or large numbers of loci resulting in high levels of differentiation between the three populations. Overall, the risk of being misled by STRUCTURE in the context of introduction routes inferences is moderate, but it is important to remain cautious when low genetic diversity or genuine multimodality between runs are involved.
Are We Predicting the Actual or Apparent Distribution of Temperate Marine Fishes?

PubMed Central

Monk, Jacquomo; Ierodiaconou, Daniel; Harvey, Euan; Rattray, Alex; Versace, Vincent L.

2012-01-01

Planning for resilience is the focus of many marine conservation programs and initiatives. These efforts aim to inform conservation strategies for marine regions to ensure they have inbuilt capacity to retain biological diversity and ecological function in the face of global environmental change – particularly changes in climate and resource exploitation. In the absence of direct biological and ecological information for many marine species, scientists are increasingly using spatially-explicit, predictive-modeling approaches. Through the improved access to multibeam sonar and underwater video technology these models provide spatial predictions of the most suitable regions for an organism at resolutions previously not possible. However, sensible-looking, well-performing models can provide very different predictions of distribution depending on which occurrence dataset is used. To examine this, we construct species distribution models for nine temperate marine sedentary fishes for a 25.7 km2 study region off the coast of southeastern Australia. We use generalized linear model (GLM), generalized additive model (GAM) and maximum entropy (MAXENT) to build models based on co-located occurrence datasets derived from two underwater video methods (i.e. baited and towed video) and fine-scale multibeam sonar based seafloor habitat variables. Overall, this study found that the choice of modeling approach did not considerably influence the prediction of distributions based on the same occurrence dataset. However, greater dissimilarity between model predictions was observed across the nine fish taxa when the two occurrence datasets were compared (relative to models based on the same dataset). Based on these results it is difficult to draw any general trends in regards to which video method provides more reliable occurrence datasets. Nonetheless, we suggest predictions reflecting the species apparent distribution (i.e. a combination of species distribution and the probability of detecting it). Consequently, we also encourage researchers and marine managers to carefully interpret model predictions. PMID:22536325
Statistical power calculations for mixed pharmacokinetic study designs using a population approach.

PubMed

Kloprogge, Frank; Simpson, Julie A; Day, Nicholas P J; White, Nicholas J; Tarning, Joel

2014-09-01

Simultaneous modelling of dense and sparse pharmacokinetic data is possible with a population approach. To determine the number of individuals required to detect the effect of a covariate, simulation-based power calculation methodologies can be employed. The Monte Carlo Mapped Power method (a simulation-based power calculation methodology using the likelihood ratio test) was extended in the current study to perform sample size calculations for mixed pharmacokinetic studies (i.e. both sparse and dense data collection). A workflow guiding an easy and straightforward pharmacokinetic study design, considering also the cost-effectiveness of alternative study designs, was used in this analysis. Initially, data were simulated for a hypothetical drug and then for the anti-malarial drug, dihydroartemisinin. Two datasets (sampling design A: dense; sampling design B: sparse) were simulated using a pharmacokinetic model that included a binary covariate effect and subsequently re-estimated using (1) the same model and (2) a model not including the covariate effect in NONMEM 7.2. Power calculations were performed for varying numbers of patients with sampling designs A and B. Study designs with statistical power >80% were selected and further evaluated for cost-effectiveness. The simulation studies of the hypothetical drug and the anti-malarial drug dihydroartemisinin demonstrated that the simulation-based power calculation methodology, based on the Monte Carlo Mapped Power method, can be utilised to evaluate and determine the sample size of mixed (part sparsely and part densely sampled) study designs. The developed method can contribute to the design of robust and efficient pharmacokinetic studies.
Utilizing novel diversity estimators to quantify multiple dimensions of microbial biodiversity across domains

PubMed Central

2013-01-01

Background Microbial ecologists often employ methods from classical community ecology to analyze microbial community diversity. However, these methods have limitations because microbial communities differ from macro-organismal communities in key ways. This study sought to quantify microbial diversity using methods that are better suited for data spanning multiple domains of life and dimensions of diversity. Diversity profiles are one novel, promising way to analyze microbial datasets. Diversity profiles encompass many other indices, provide effective numbers of diversity (mathematical generalizations of previous indices that better convey the magnitude of differences in diversity), and can incorporate taxa similarity information. To explore whether these profiles change interpretations of microbial datasets, diversity profiles were calculated for four microbial datasets from different environments spanning all domains of life as well as viruses. Both similarity-based profiles that incorporated phylogenetic relatedness and naïve (not similarity-based) profiles were calculated. Simulated datasets were used to examine the robustness of diversity profiles to varying phylogenetic topology and community composition. Results Diversity profiles provided insights into microbial datasets that were not detectable with classical univariate diversity metrics. For all datasets analyzed, there were key distinctions between calculations that incorporated phylogenetic diversity as a measure of taxa similarity and naïve calculations. The profiles also provided information about the effects of rare species on diversity calculations. Additionally, diversity profiles were used to examine thousands of simulated microbial communities, showing that similarity-based and naïve diversity profiles only agreed approximately 50% of the time in their classification of which sample was most diverse. This is a strong argument for incorporating similarity information and calculating diversity with a range of emphases on rare and abundant species when quantifying microbial community diversity. Conclusions For many datasets, diversity profiles provided a different view of microbial community diversity compared to analyses that did not take into account taxa similarity information, effective diversity, or multiple diversity metrics. These findings are a valuable contribution to data analysis methodology in microbial ecology. PMID:24238386
Deep Convolutional Neural Network-Based Early Automated Detection of Diabetic Retinopathy Using Fundus Image.

PubMed

Xu, Kele; Feng, Dawei; Mi, Haibo

2017-11-23

The automatic detection of diabetic retinopathy is of vital importance, as it is the main cause of irreversible vision loss in the working-age population in the developed world. The early detection of diabetic retinopathy occurrence can be very helpful for clinical treatment; although several different feature extraction approaches have been proposed, the classification task for retinal images is still tedious even for those trained clinicians. Recently, deep convolutional neural networks have manifested superior performance in image classification compared to previous handcrafted feature-based image classification methods. Thus, in this paper, we explored the use of deep convolutional neural network methodology for the automatic classification of diabetic retinopathy using color fundus image, and obtained an accuracy of 94.5% on our dataset, outperforming the results obtained by using classical approaches.
Gene-Based Genome-Wide Association Analysis in European and Asian Populations Identified Novel Genes for Rheumatoid Arthritis.

PubMed

Zhu, Hong; Xia, Wei; Mo, Xing-Bo; Lin, Xiang; Qiu, Ying-Hua; Yi, Neng-Jun; Zhang, Yong-Hong; Deng, Fei-Yan; Lei, Shu-Feng

2016-01-01

Rheumatoid arthritis (RA) is a complex autoimmune disease. Using a gene-based association research strategy, the present study aims to detect unknown susceptibility to RA and to address the ethnic differences in genetic susceptibility to RA between European and Asian populations. Gene-based association analyses were performed with KGG 2.5 by using publicly available large RA datasets (14,361 RA cases and 43,923 controls of European subjects, 4,873 RA cases and 17,642 controls of Asian Subjects). For the newly identified RA-associated genes, gene set enrichment analyses and protein-protein interactions analyses were carried out with DAVID and STRING version 10.0, respectively. Differential expression verification was conducted using 4 GEO datasets. The expression levels of three selected 'highly verified' genes were measured by ELISA among our in-house RA cases and controls. A total of 221 RA-associated genes were newly identified by gene-based association study, including 71'overlapped', 76 'European-specific' and 74 'Asian-specific' genes. Among them, 105 genes had significant differential expressions between RA patients and health controls at least in one dataset, especially for 20 genes including 11 'overlapped' (ABCF1, FLOT1, HLA-F, IER3, TUBB, ZKSCAN4, BTN3A3, HSP90AB1, CUTA, BRD2, HLA-DMA), 5 'European-specific' (PHTF1, RPS18, BAK1, TNFRSF14, SUOX) and 4 'Asian-specific' (RNASET2, HFE, BTN2A2, MAPK13) genes whose differential expressions were significant at least in three datasets. The protein expressions of two selected genes FLOT1 (P value = 1.70E-02) and HLA-DMA (P value = 4.70E-02) in plasma were significantly different in our in-house samples. Our study identified 221 novel RA-associated genes and especially highlighted the importance of 20 candidate genes on RA. The results addressed ethnic genetic background differences for RA susceptibility between European and Asian populations and detected a long list of overlapped or ethnic specific RA genes. The study not only greatly increases our understanding of genetic susceptibility to RA, but also provides important insights into the ethno-genetic homogeneity and heterogeneity of RA in both ethnicities.

Ground-Cover Measurements: Assessing Correlation Among Aerial and Ground-Based Methods

NASA Astrophysics Data System (ADS)

Booth, D. Terrance; Cox, Samuel E.; Meikle, Tim; Zuuring, Hans R.

2008-12-01

Wyoming’s Green Mountain Common Allotment is public land providing livestock forage, wildlife habitat, and unfenced solitude, amid other ecological services. It is also the center of ongoing debate over USDI Bureau of Land Management’s (BLM) adjudication of land uses. Monitoring resource use is a BLM responsibility, but conventional monitoring is inadequate for the vast areas encompassed in this and other public-land units. New monitoring methods are needed that will reduce monitoring costs. An understanding of data-set relationships among old and new methods is also needed. This study compared two conventional methods with two remote sensing methods using images captured from two meters and 100 meters above ground level from a camera stand (a ground, image-based method) and a light airplane (an aerial, image-based method). Image analysis used SamplePoint or VegMeasure software. Aerial methods allowed for increased sampling intensity at low cost relative to the time and travel required by ground methods. Costs to acquire the aerial imagery and measure ground cover on 162 aerial samples representing 9000 ha were less than 3000. The four highest correlations among data sets for bare ground—the ground-cover characteristic yielding the highest correlations (r)—ranged from 0.76 to 0.85 and included ground with ground, ground with aerial, and aerial with aerial data-set associations. We conclude that our aerial surveys are a cost-effective monitoring method, that ground with aerial data-set correlations can be equal to, or greater than those among ground-based data sets, and that bare ground should continue to be investigated and tested for use as a key indicator of rangeland health.
Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies.

PubMed

Boulesteix, Anne-Laure; Wilson, Rory; Hapfelmeier, Alexander

2017-09-09

The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Skeleton-based human action recognition using multiple sequence alignment

NASA Astrophysics Data System (ADS)

Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong

2015-05-01

Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.
A hybrid technique for speech segregation and classification using a sophisticated deep neural network

PubMed Central

Nawaz, Tabassam; Mehmood, Zahid; Rashid, Muhammad; Habib, Hafiz Adnan

2018-01-01

Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods. PMID:29558485
Estimation of time-varying growth, uptake and excretion rates from dynamic metabolomics data.

PubMed

Cinquemani, Eugenio; Laroute, Valérie; Cocaign-Bousquet, Muriel; de Jong, Hidde; Ropers, Delphine

2017-07-15

Technological advances in metabolomics have made it possible to monitor the concentration of extracellular metabolites over time. From these data, it is possible to compute the rates of uptake and excretion of the metabolites by a growing cell population, providing precious information on the functioning of intracellular metabolism. The computation of the rate of these exchange reactions, however, is difficult to achieve in practice for a number of reasons, notably noisy measurements, correlations between the concentration profiles of the different extracellular metabolites, and discontinuties in the profiles due to sudden changes in metabolic regime. We present a method for precisely estimating time-varying uptake and excretion rates from time-series measurements of extracellular metabolite concentrations, specifically addressing all of the above issues. The estimation problem is formulated in a regularized Bayesian framework and solved by a combination of extended Kalman filtering and smoothing. The method is shown to improve upon methods based on spline smoothing of the data. Moreover, when applied to two actual datasets, the method recovers known features of overflow metabolism in Escherichia coli and Lactococcus lactis , and provides evidence for acetate uptake by L. lactis after glucose exhaustion. The results raise interesting perspectives for further work on rate estimation from measurements of intracellular metabolites. The Matlab code for the estimation method is available for download at https://team.inria.fr/ibis/rate-estimation-software/ , together with the datasets. eugenio.cinquemani@inria.fr. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
On piecewise interpolation techniques for estimating solar radiation missing values in Kedah

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saaban, Azizan; Zainudin, Lutfi; Bakar, Mohd Nazari Abu

2014-12-04

This paper discusses the use of piecewise interpolation method based on cubic Ball and Bézier curves representation to estimate the missing value of solar radiation in Kedah. An hourly solar radiation dataset is collected at Alor Setar Meteorology Station that is taken from Malaysian Meteorology Deparment. The piecewise cubic Ball and Bézier functions that interpolate the data points are defined on each hourly intervals of solar radiation measurement and is obtained by prescribing first order derivatives at the starts and ends of the intervals. We compare the performance of our proposed method with existing methods using Root Mean Squared Errormore » (RMSE) and Coefficient of Detemination (CoD) which is based on missing values simulation datasets. The results show that our method is outperformed the other previous methods.« less
A new approach to optic disc detection in human retinal images using the firefly algorithm.

PubMed

Rahebi, Javad; Hardalaç, Fırat

2016-03-01

There are various methods and algorithms to detect the optic discs in retinal images. In recent years, much attention has been given to the utilization of the intelligent algorithms. In this paper, we present a new automated method of optic disc detection in human retinal images using the firefly algorithm. The firefly intelligent algorithm is an emerging intelligent algorithm that was inspired by the social behavior of fireflies. The population in this algorithm includes the fireflies, each of which has a specific rate of lighting or fitness. In this method, the insects are compared two by two, and the less attractive insects can be observed to move toward the more attractive insects. Finally, one of the insects is selected as the most attractive, and this insect presents the optimum response to the problem in question. Here, we used the light intensity of the pixels of the retinal image pixels instead of firefly lightings. The movement of these insects due to local fluctuations produces different light intensity values in the images. Because the optic disc is the brightest area in the retinal images, all of the insects move toward brightest area and thus specify the location of the optic disc in the image. The results of implementation show that proposed algorithm could acquire an accuracy rate of 100 % in DRIVE dataset, 95 % in STARE dataset, and 94.38 % in DiaRetDB1 dataset. The results of implementation reveal high capability and accuracy of proposed algorithm in the detection of the optic disc from retinal images. Also, recorded required time for the detection of the optic disc in these images is 2.13 s for DRIVE dataset, 2.81 s for STARE dataset, and 3.52 s for DiaRetDB1 dataset accordingly. These time values are average value.
Evolutionary history of Ichthyosaura alpestris (Caudata, Salamandridae) inferred from the combined analysis of nuclear and mitochondrial markers.

PubMed

Recuero, Ernesto; Buckley, David; García-París, Mario; Arntzen, Jan W; Cogălniceanu, Dan; Martínez-Solano, Iñigo

2014-12-01

Widespread species with morphologically and ecologically differentiated populations are key to understand speciation because they allow investigating the different stages of the continuous process of population divergence. The alpine newt, Ichthyosaura alpestris, with a range that covers a large part of Central Europe as well as isolated regions in all three European Mediterranean peninsulas, and with strong ecological and life-history differences among populations, is an excellent system for such studies. We sampled individuals across most of the range of the species, and analyzed mitochondrial (1442 bp) and nuclear (two nuclear genes -1554 bp- and 35 allozyme loci) markers to produce a time-calibrated phylogeny and reconstruct the historical biogeography of the species. Phylogenetic analyses of mtDNA data produced a fully resolved topology, with an endemic, Balkan clade (Vlasina) which is sister to a clade comprising an eastern and a western group. Within the former, one clade (subspecies I. a. veluchiensis) is sister to a clade containing subspecies I. a. montenegrina and I. a. serdara as well as samples from southern Romania, Bosnia-Herzegovina, Serbia and Bulgaria (subspecies I. a. reiseri and part of I. a. alpestris). Within the western group, populations from the Italian peninsula (subspecies I. a. apuana and I. a. inexpectata) are sister to a clade containing samples from the Iberian Peninsula (subspecies I. a. cyreni) and the remainder of the samples from subspecies I. a. alpestris (populations from Hungary, Austria, Poland, France, Germany and the larger part of Romania). Results of (∗)BEAST analyses on a combined mtDNA and nDNA dataset consistently recovered with high statistical support four lineages with unresolved inter-relationships: (1) subspecies I. a. veluchiensis; (2) subspecies I. a. apuana+I. a. inexpectata; (3) subspecies I. a. cyreni+part of subspecies I. a. alpestris (the westernmost populations, plus most Romanian populations); and (4) the remaining populations, including subspecies I. a. serdara, I. a. reiseri and I. a. montenegrina and part of subspecies I. a. alpestris, plus samples from Vlasina. Our time estimates are consistent with ages based on the fossil record and suggest a widespread distribution for the I. alpestris ancestor, with the split of the major eastern and western lineages during the Miocene, in the Tortonian. Our study provides a solid, comprehensive background on the evolutionary history of the species based on the most complete combined (mtDNA+nDNA+allozymes) dataset to date. The combination of the historical perspective provided by coalescent-based analyses of mitochondrial and nuclear DNA variation with individual-based multilocus assignment methods based on multiple nuclear markers (allozymes) also allowed identification of instances of discordance across markers that highlight the complexity and dynamism of past and ongoing evolutionary processes in the species. Copyright © 2014 Elsevier Inc. All rights reserved.
A Statistical Method to Distinguish Functional Brain Networks

PubMed Central

Fujita, André; Vidal, Maciel C.; Takahashi, Daniel Y.

2017-01-01

One major problem in neuroscience is the comparison of functional brain networks of different populations, e.g., distinguishing the networks of controls and patients. Traditional algorithms are based on search for isomorphism between networks, assuming that they are deterministic. However, biological networks present randomness that cannot be well modeled by those algorithms. For instance, functional brain networks of distinct subjects of the same population can be different due to individual characteristics. Moreover, networks of subjects from different populations can be generated through the same stochastic process. Thus, a better hypothesis is that networks are generated by random processes. In this case, subjects from the same group are samples from the same random process, whereas subjects from different groups are generated by distinct processes. Using this idea, we developed a statistical test called ANOGVA to test whether two or more populations of graphs are generated by the same random graph model. Our simulations' results demonstrate that we can precisely control the rate of false positives and that the test is powerful to discriminate random graphs generated by different models and parameters. The method also showed to be robust for unbalanced data. As an example, we applied ANOGVA to an fMRI dataset composed of controls and patients diagnosed with autism or Asperger. ANOGVA identified the cerebellar functional sub-network as statistically different between controls and autism (p < 0.001). PMID:28261045
A Statistical Method to Distinguish Functional Brain Networks.

PubMed

Fujita, André; Vidal, Maciel C; Takahashi, Daniel Y

2017-01-01

One major problem in neuroscience is the comparison of functional brain networks of different populations, e.g., distinguishing the networks of controls and patients. Traditional algorithms are based on search for isomorphism between networks, assuming that they are deterministic. However, biological networks present randomness that cannot be well modeled by those algorithms. For instance, functional brain networks of distinct subjects of the same population can be different due to individual characteristics. Moreover, networks of subjects from different populations can be generated through the same stochastic process. Thus, a better hypothesis is that networks are generated by random processes. In this case, subjects from the same group are samples from the same random process, whereas subjects from different groups are generated by distinct processes. Using this idea, we developed a statistical test called ANOGVA to test whether two or more populations of graphs are generated by the same random graph model. Our simulations' results demonstrate that we can precisely control the rate of false positives and that the test is powerful to discriminate random graphs generated by different models and parameters. The method also showed to be robust for unbalanced data. As an example, we applied ANOGVA to an fMRI dataset composed of controls and patients diagnosed with autism or Asperger. ANOGVA identified the cerebellar functional sub-network as statistically different between controls and autism ( p < 0.001).
Insular Celtic population structure and genomic footprints of migration

PubMed Central

Hellenthal, Garrett

2018-01-01

Previous studies of the genetic landscape of Ireland have suggested homogeneity, with population substructure undetectable using single-marker methods. Here we have harnessed the haplotype-based method fineSTRUCTURE in an Irish genome-wide SNP dataset, identifying 23 discrete genetic clusters which segregate with geographical provenance. Cluster diversity is pronounced in the west of Ireland but reduced in the east where older structure has been eroded by historical migrations. Accordingly, when populations from the neighbouring island of Britain are included, a west-east cline of Celtic-British ancestry is revealed along with a particularly striking correlation between haplotypes and geography across both islands. A strong relationship is revealed between subsets of Northern Irish and Scottish populations, where discordant genetic and geographic affinities reflect major migrations in recent centuries. Additionally, Irish genetic proximity of all Scottish samples likely reflects older strata of communication across the narrowest inter-island crossing. Using GLOBETROTTER we detected Irish admixture signals from Britain and Europe and estimated dates for events consistent with the historical migrations of the Norse-Vikings, the Anglo-Normans and the British Plantations. The influence of the former is greater than previously estimated from Y chromosome haplotypes. In all, we paint a new picture of the genetic landscape of Ireland, revealing structure which should be considered in the design of studies examining rare genetic variation and its association with traits. PMID:29370172
Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

PubMed Central

Lou, Wangchao; Wang, Xiaoqing; Chen, Fan; Chen, Yixiao; Jiang, Bo; Zhang, Hua

2014-01-01

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins. PMID:24475169
Availability of state-based obesity surveillance data on high school students with disabilities in the United States.

PubMed

Yamaki, Kiyoshi; Lowry, Brienne Davis; Buscaj, Emilie; Zisko, Leigh; Rimmer, James H

2015-05-01

The aim of this study was to assess the availability of public health surveillance data on obesity among American children with disabilities in state-based surveillance programs. We reviewed annual cross-sectional datasets in state-level surveillance programs for high school students, implemented 2001-2011, for the inclusion of weight and height and disability screening questions. When datasets included a disability screen, its content and consistency of use across years were examined. We identified 54 surveillance programs with 261 annual datasets containing obesity data. Twelve surveillance programs in 11 states included a disability screening question that could be used to extract obesity data for high school students with disabilities, leaving the other 39 states with no state-level obesity data for students with disabilities. A total of 43 annual datasets, 16.5 % of the available datasets, could be used to estimate the obesity status of students with disabilities. The frequency of use of disability questions varied across states, and the content of the questions often changed across years and within a state. We concluded that state surveillance programs rarely contained questions that could be used to identify high school students with disabilities. This limits the availability of data that can be used to monitor obesity and related health statuses among this population in the majority of states.
Efficiency of ITS Sequences for DNA Barcoding in Passiflora (Passifloraceae)

PubMed Central

Giudicelli, Giovanna Câmara; Mäder, Geraldo; de Freitas, Loreta Brandão

2015-01-01

DNA barcoding is a technique for discriminating and identifying species using short, variable, and standardized DNA regions. Here, we tested for the first time the performance of plastid and nuclear regions as DNA barcodes in Passiflora. This genus is a largely variable, with more than 900 species of high ecological, commercial, and ornamental importance. We analyzed 1034 accessions of 222 species representing the four subgenera of Passiflora and evaluated the effectiveness of five plastid regions and three nuclear datasets currently employed as DNA barcodes in plants using barcoding gap, applied similarity-, and tree-based methods. The plastid regions were able to identify less than 45% of species, whereas the nuclear datasets were efficient for more than 50% using “best match” and “best close match” methods of TaxonDNA software. All subgenera presented higher interspecific pairwise distances and did not fully overlap with the intraspecific distance, and similarity-based methods showed better results than tree-based methods. The nuclear ribosomal internal transcribed spacer 1 (ITS1) region presented a higher discrimination power than the other datasets and also showed other desirable characteristics as a DNA barcode for this genus. Therefore, we suggest that this region should be used as a starting point to identify Passiflora species. PMID:25837628
Repeatability, Reproducibility, Separative Power and Subjectivity of Different Fish Morphometric Analysis Methods

PubMed Central

Takács, Péter

2016-01-01

We compared the repeatability, reproducibility (intra- and inter-measurer similarity), separative power and subjectivity (measurer effect on results) of four morphometric methods frequently used in ichthyological research, the “traditional” caliper-based (TRA) and truss-network (TRU) distance methods and two geometric methods that compare landmark coordinates on the body (GMB) and scales (GMS). In each case, measurements were performed three times by three measurers on the same specimen of three common cyprinid species (roach Rutilus rutilus (Linnaeus, 1758), bleak Alburnus alburnus (Linnaeus, 1758) and Prussian carp Carassius gibelio (Bloch, 1782)) collected from three closely-situated sites in the Lake Balaton catchment (Hungary) in 2014. TRA measurements were made on conserved specimens using a digital caliper, while TRU, GMB and GMS measurements were undertaken on digital images of the bodies and scales. In most cases, intra-measurer repeatability was similar. While all four methods were able to differentiate the source populations, significant differences were observed in their repeatability, reproducibility and subjectivity. GMB displayed highest overall repeatability and reproducibility and was least burdened by measurer effect. While GMS showed similar repeatability to GMB when fish scales had a characteristic shape, it showed significantly lower reproducability (compared with its repeatability) for each species than the other methods. TRU showed similar repeatability as the GMS. TRA was the least applicable method as measurements were obtained from the fish itself, resulting in poor repeatability and reproducibility. Although all four methods showed some degree of subjectivity, TRA was the only method where population-level detachment was entirely overwritten by measurer effect. Based on these results, we recommend a) avoidance of aggregating different measurer’s datasets when using TRA and GMS methods; and b) use of image-based methods for morphometric surveys. Automation of the morphometric workflow would also reduce any measurer effect and eliminate measurement and data-input errors. PMID:27327896
Learning-based 3T brain MRI segmentation with guidance from 7T MRI labeling.

PubMed

Deng, Minghui; Yu, Renping; Wang, Li; Shi, Feng; Yap, Pew-Thian; Shen, Dinggang

2016-12-01

Segmentation of brain magnetic resonance (MR) images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is crucial for brain structural measurement and disease diagnosis. Learning-based segmentation methods depend largely on the availability of good training ground truth. However, the commonly used 3T MR images are of insufficient image quality and often exhibit poor intensity contrast between WM, GM, and CSF. Therefore, they are not ideal for providing good ground truth label data for training learning-based methods. Recent advances in ultrahigh field 7T imaging make it possible to acquire images with excellent intensity contrast and signal-to-noise ratio. In this paper, the authors propose an algorithm based on random forest for segmenting 3T MR images by training a series of classifiers based on reliable labels obtained semiautomatically from 7T MR images. The proposed algorithm iteratively refines the probability maps of WM, GM, and CSF via a cascade of random forest classifiers for improved tissue segmentation. The proposed method was validated on two datasets, i.e., 10 subjects collected at their institution and 797 3T MR images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Specifically, for the mean Dice ratio of all 10 subjects, the proposed method achieved 94.52% ± 0.9%, 89.49% ± 1.83%, and 79.97% ± 4.32% for WM, GM, and CSF, respectively, which are significantly better than the state-of-the-art methods (p-values < 0.021). For the ADNI dataset, the group difference comparisons indicate that the proposed algorithm outperforms state-of-the-art segmentation methods. The authors have developed and validated a novel fully automated method for 3T brain MR image segmentation. © 2016 American Association of Physicists in Medicine.
Learning-based 3T brain MRI segmentation with guidance from 7T MRI labeling.

PubMed

Deng, Minghui; Yu, Renping; Wang, Li; Shi, Feng; Yap, Pew-Thian; Shen, Dinggang

2016-12-01

Segmentation of brain magnetic resonance (MR) images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is crucial for brain structural measurement and disease diagnosis. Learning-based segmentation methods depend largely on the availability of good training ground truth. However, the commonly used 3T MR images are of insufficient image quality and often exhibit poor intensity contrast between WM, GM, and CSF. Therefore, they are not ideal for providing good ground truth label data for training learning-based methods. Recent advances in ultrahigh field 7T imaging make it possible to acquire images with excellent intensity contrast and signal-to-noise ratio. In this paper, the authors propose an algorithm based on random forest for segmenting 3T MR images by training a series of classifiers based on reliable labels obtained semiautomatically from 7T MR images. The proposed algorithm iteratively refines the probability maps of WM, GM, and CSF via a cascade of random forest classifiers for improved tissue segmentation. The proposed method was validated on two datasets, i.e., 10 subjects collected at their institution and 797 3T MR images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Specifically, for the mean Dice ratio of all 10 subjects, the proposed method achieved 94.52% ± 0.9%, 89.49% ± 1.83%, and 79.97% ± 4.32% for WM, GM, and CSF, respectively, which are significantly better than the state-of-the-art methods (p-values < 0.021). For the ADNI dataset, the group difference comparisons indicate that the proposed algorithm outperforms state-of-the-art segmentation methods. The authors have developed and validated a novel fully automated method for 3T brain MR image segmentation.
Fully-automated segmentation of fluid regions in exudative age-related macular degeneration subjects: Kernel graph cut in neutrosophic domain

PubMed Central

Rashno, Abdolreza; Nazari, Behzad; Koozekanani, Dara D.; Drayna, Paul M.; Sadri, Saeed; Rabbani, Hossein

2017-01-01

A fully-automated method based on graph shortest path, graph cut and neutrosophic (NS) sets is presented for fluid segmentation in OCT volumes for exudative age related macular degeneration (EAMD) subjects. The proposed method includes three main steps: 1) The inner limiting membrane (ILM) and the retinal pigment epithelium (RPE) layers are segmented using proposed methods based on graph shortest path in NS domain. A flattened RPE boundary is calculated such that all three types of fluid regions, intra-retinal, sub-retinal and sub-RPE, are located above it. 2) Seed points for fluid (object) and tissue (background) are initialized for graph cut by the proposed automated method. 3) A new cost function is proposed in kernel space, and is minimized with max-flow/min-cut algorithms, leading to a binary segmentation. Important properties of the proposed steps are proven and quantitative performance of each step is analyzed separately. The proposed method is evaluated using a publicly available dataset referred as Optima and a local dataset from the UMN clinic. For fluid segmentation in 2D individual slices, the proposed method outperforms the previously proposed methods by 18%, 21% with respect to the dice coefficient and sensitivity, respectively, on the Optima dataset, and by 16%, 11% and 12% with respect to the dice coefficient, sensitivity and precision, respectively, on the local UMN dataset. Finally, for 3D fluid volume segmentation, the proposed method achieves true positive rate (TPR) and false positive rate (FPR) of 90% and 0.74%, respectively, with a correlation of 95% between automated and expert manual segmentations using linear regression analysis. PMID:29059257
SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome.

PubMed

Li, Yiwei; Ilie, Lucian

2017-11-15

Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .
LOGISMOS-B for primates: primate cortical surface reconstruction and thickness measurement

NASA Astrophysics Data System (ADS)

Oguz, Ipek; Styner, Martin; Sanchez, Mar; Shi, Yundi; Sonka, Milan

2015-03-01

Cortical thickness and surface area are important morphological measures with implications for many psychiatric and neurological conditions. Automated segmentation and reconstruction of the cortical surface from 3D MRI scans is challenging due to the variable anatomy of the cortex and its highly complex geometry. While many methods exist for this task in the context of the human brain, these methods are typically not readily applicable to the primate brain. We propose an innovative approach based on our recently proposed human cortical reconstruction algorithm, LOGISMOS-B, and the Laplace-based thickness measurement method. Quantitative evaluation of our approach was performed based on a dataset of T1- and T2-weighted MRI scans from 12-month-old macaques where labeling by our anatomical experts was used as independent standard. In this dataset, LOGISMOS-B has an average signed surface error of 0.01 +/- 0.03mm and an unsigned surface error of 0.42 +/- 0.03mm over the whole brain. Excluding the rather problematic temporal pole region further improves unsigned surface distance to 0.34 +/- 0.03mm. This high level of accuracy reached by our algorithm even in this challenging developmental dataset illustrates its robustness and its potential for primate brain studies.

A methodological investigation of hominoid craniodental morphology and phylogenetics.

PubMed

Bjarnason, Alexander; Chamberlain, Andrew T; Lockwood, Charles A

2011-01-01

The evolutionary relationships of extant great apes and humans have been largely resolved by molecular studies, yet morphology-based phylogenetic analyses continue to provide conflicting results. In order to further investigate this discrepancy we present bootstrap clade support of morphological data based on two quantitative datasets, one dataset consisting of linear measurements of the whole skull from 5 hominoid genera and the second dataset consisting of 3D landmark data from the temporal bone of 5 hominoid genera, including 11 sub-species. Using similar protocols for both datasets, we were able to 1) compare distance-based phylogenetic methods to cladistic parsimony of quantitative data converted into discrete character states, 2) vary outgroup choice to observe its effect on phylogenetic inference, and 3) analyse male and female data separately to observe the effect of sexual dimorphism on phylogenies. Phylogenetic analysis was sensitive to methodological decisions, particularly outgroup selection, where designation of Pongo as an outgroup and removal of Hylobates resulted in greater congruence with the proposed molecular phylogeny. The performance of distance-based methods also justifies their use in phylogenetic analysis of morphological data. It is clear from our analyses that hominoid phylogenetics ought not to be used as an example of conflict between the morphological and molecular, but as an example of how outgroup and methodological choices can affect the outcome of phylogenetic analysis. Copyright © 2010 Elsevier Ltd. All rights reserved.
An Efficient Optimization Method for Solving Unsupervised Data Classification Problems.

PubMed

Shabanzadeh, Parvaneh; Yusof, Rubiyah

2015-01-01

Unsupervised data classification (or clustering) analysis is one of the most useful tools and a descriptive task in data mining that seeks to classify homogeneous groups of objects based on similarity and is used in many medical disciplines and various applications. In general, there is no single algorithm that is suitable for all types of data, conditions, and applications. Each algorithm has its own advantages, limitations, and deficiencies. Hence, research for novel and effective approaches for unsupervised data classification is still active. In this paper a heuristic algorithm, Biogeography-Based Optimization (BBO) algorithm, was adapted for data clustering problems by modifying the main operators of BBO algorithm, which is inspired from the natural biogeography distribution of different species. Similar to other population-based algorithms, BBO algorithm starts with an initial population of candidate solutions to an optimization problem and an objective function that is calculated for them. To evaluate the performance of the proposed algorithm assessment was carried on six medical and real life datasets and was compared with eight well known and recent unsupervised data classification algorithms. Numerical results demonstrate that the proposed evolutionary optimization algorithm is efficient for unsupervised data classification.
The allometric exponent for scaling clearance varies with age: a study on seven propofol datasets ranging from preterm neonates to adults.

PubMed

Wang, Chenguang; Allegaert, Karel; Peeters, Mariska Y M; Tibboel, Dick; Danhof, Meindert; Knibbe, Catherijne A J

2014-01-01

For scaling clearance between adults and children, allometric scaling with a fixed exponent of 0.75 is often applied. In this analysis, we performed a systematic study on the allometric exponent for scaling propofol clearance between two subpopulations selected from neonates, infants, toddlers, children, adolescents and adults. Seven propofol studies were included in the analysis (neonates, infants, toddlers, children, adolescents, adults1 and adults2). In a systematic manner, two out of the six study populations were selected resulting in 15 combined datasets. In addition, the data of the seven studies were regrouped into five age groups (FDA Guidance 1998), from which four combined datasets were prepared consisting of one paediatric age group and the adult group. In each of these 19 combined datasets, the allometric scaling exponent for clearance was estimated using population pharmacokinetic modelling (nonmem 7.2). The allometric exponent for propofol clearance varied between 1.11 and 2.01 in cases where the neonate dataset was included. When two paediatric datasets were analyzed, the exponent varied between 0.2 and 2.01, while it varied between 0.56 and 0.81 when the adult population and a paediatric dataset except for neonates were selected. Scaling from adults to adolescents, children, infants and neonates resulted in exponents of 0.74, 0.70, 0.60 and 1.11 respectively. For scaling clearance, ¾ allometric scaling may be of value for scaling between adults and adolescents or children, while it can neither be used for neonates nor for two paediatric populations. For scaling to neonates an exponent between 1 and 2 was identified. © 2013 The British Pharmacological Society.
Does standardised structured reporting contribute to quality in diagnostic pathology? The importance of evidence-based datasets.

PubMed

Ellis, D W; Srigley, J

2016-01-01

Key quality parameters in diagnostic pathology include timeliness, accuracy, completeness, conformance with current agreed standards, consistency and clarity in communication. In this review, we argue that with worldwide developments in eHealth and big data, generally, there are two further, often overlooked, parameters if our reports are to be fit for purpose. Firstly, population-level studies have clearly demonstrated the value of providing timely structured reporting data in standardised electronic format as part of system-wide quality improvement programmes. Moreover, when combined with multiple health data sources through eHealth and data linkage, structured pathology reports become central to population-level quality monitoring, benchmarking, interventions and benefit analyses in public health management. Secondly, population-level studies, particularly for benchmarking, require a single agreed international and evidence-based standard to ensure interoperability and comparability. This has been taken for granted in tumour classification and staging for many years, yet international standardisation of cancer datasets is only now underway through the International Collaboration on Cancer Reporting (ICCR). In this review, we present evidence supporting the role of structured pathology reporting in quality improvement for both clinical care and population-level health management. Although this review of available evidence largely relates to structured reporting of cancer, it is clear that the same principles can be applied throughout anatomical pathology generally, as they are elsewhere in the health system.
Development of vulnerability curves to typhoon hazards based on insurance policy and claim dataset

NASA Astrophysics Data System (ADS)

Mo, Wanmei; Fang, Weihua; li, Xinze; Wu, Peng; Tong, Xingwei

2016-04-01

Vulnerability refers to the characteristics and circumstances of an exposure that make it vulnerable to the effects of some certain hazards. It can be divided into physical vulnerability, social vulnerability, economic vulnerabilities and environmental vulnerability. Physical vulnerability indicates the potential physical damage of exposure caused by natural hazards. Vulnerability curves, quantifying the loss ratio against hazard intensity with a horizontal axis for the intensity and a vertical axis for the Mean Damage Ratio (MDR), is essential to the vulnerability assessment and quantitative evaluation of disasters. Fragility refers to the probability of diverse damage states under different hazard intensity, revealing a kind of characteristic of the exposure. Fragility curves are often used to quantify the probability of a given set of exposure at or exceeding a certain damage state. The development of quantitative fragility and vulnerability curves is the basis of catastrophe modeling. Generally, methods for quantitative fragility and vulnerability assessment can be categorized into empirical, analytical and expert opinion or judgment-based ones. Empirical method is one of the most popular methods and it relies heavily on the availability and quality of historical hazard and loss dataset, which has always been a great challenge. Analytical method is usually based on the engineering experiments and it is time-consuming and lacks built-in validation, so its credibility is also sometimes criticized widely. Expert opinion or judgment-based method is quite effective in the absence of data but the results could be too subjective so that the uncertainty is likely to be underestimated. In this study, we will present the fragility and vulnerability curves developed with empirical method based on simulated historical typhoon wind, rainfall and induced flood, and insurance policy and claim datasets of more than 100 historical typhoon events. Firstly, an insurance exposure classification system is built according to structure type, occupation type and insurance coverage. Then MDR estimation method based on considering insurance policy structure and claim information is proposed and validated. Following that, fragility and vulnerability curves of the major exposure types for construction, homeowner insurance and enterprise property insurance are fitted with empirical function based on the historical dataset. The results of this study can not only help understand catastrophe risk and mange insured disaster risks, but can also be applied in other disaster risk reduction efforts.
Crowd density estimation based on convolutional neural networks with mixed pooling

NASA Astrophysics Data System (ADS)

Zhang, Li; Zheng, Hong; Zhang, Ying; Zhang, Dongming

2017-09-01

Crowd density estimation is an important topic in the fields of machine learning and video surveillance. Existing methods do not provide satisfactory classification accuracy; moreover, they have difficulty in adapting to complex scenes. Therefore, we propose a method based on convolutional neural networks (CNNs). The proposed method improves performance of crowd density estimation in two key ways. First, we propose a feature pooling method named mixed pooling to regularize the CNNs. It replaces deterministic pooling operations with a parameter that, by studying the algorithm, could combine the conventional max pooling with average pooling methods. Second, we present a classification strategy, in which an image is divided into two cells and respectively categorized. The proposed approach was evaluated on three datasets: two ground truth image sequences and the University of California, San Diego, anomaly detection dataset. The results demonstrate that the proposed approach performs more effectively and easily than other methods.
An Active Patch Model for Real World Texture and Appearance Classification

PubMed Central

Mao, Junhua; Zhu, Jun; Yuille, Alan L.

2014-01-01

This paper addresses the task of natural texture and appearance classification. Our goal is to develop a simple and intuitive method that performs at state of the art on datasets ranging from homogeneous texture (e.g., material texture), to less homogeneous texture (e.g., the fur of animals), and to inhomogeneous texture (the appearance patterns of vehicles). Our method uses a bag-of-words model where the features are based on a dictionary of active patches. Active patches are raw intensity patches which can undergo spatial transformations (e.g., rotation and scaling) and adjust themselves to best match the image regions. The dictionary of active patches is required to be compact and representative, in the sense that we can use it to approximately reconstruct the images that we want to classify. We propose a probabilistic model to quantify the quality of image reconstruction and design a greedy learning algorithm to obtain the dictionary. We classify images using the occurrence frequency of the active patches. Feature extraction is fast (about 100 ms per image) using the GPU. The experimental results show that our method improves the state of the art on a challenging material texture benchmark dataset (KTH-TIPS2). To test our method on less homogeneous or inhomogeneous images, we construct two new datasets consisting of appearance image patches of animals and vehicles cropped from the PASCAL VOC dataset. Our method outperforms competing methods on these datasets. PMID:25531013
High resolution global gridded data for use in population studies

PubMed Central

Lloyd, Christopher T.; Sorichetta, Alessandro; Tatem, Andrew J.

2017-01-01

Recent years have seen substantial growth in openly available satellite and other geospatial data layers, which represent a range of metrics relevant to global human population mapping at fine spatial scales. The specifications of such data differ widely and therefore the harmonisation of data layers is a prerequisite to constructing detailed and contemporary spatial datasets which accurately describe population distributions. Such datasets are vital to measure impacts of population growth, monitor change, and plan interventions. To this end the WorldPop Project has produced an open access archive of 3 and 30 arc-second resolution gridded data. Four tiled raster datasets form the basis of the archive: (i) Viewfinder Panoramas topography clipped to Global ADMinistrative area (GADM) coastlines; (ii) a matching ISO 3166 country identification grid; (iii) country area; (iv) and slope layer. Further layers include transport networks, landcover, nightlights, precipitation, travel time to major cities, and waterways. Datasets and production methodology are here described. The archive can be downloaded both from the WorldPop Dataverse Repository and the WorldPop Project website. PMID:28140386
Multi-observation PET image analysis for patient follow-up quantitation and therapy assessment

NASA Astrophysics Data System (ADS)

David, S.; Visvikis, D.; Roux, C.; Hatt, M.

2011-09-01

In positron emission tomography (PET) imaging, an early therapeutic response is usually characterized by variations of semi-quantitative parameters restricted to maximum SUV measured in PET scans during the treatment. Such measurements do not reflect overall tumor volume and radiotracer uptake variations. The proposed approach is based on multi-observation image analysis for merging several PET acquisitions to assess tumor metabolic volume and uptake variations. The fusion algorithm is based on iterative estimation using a stochastic expectation maximization (SEM) algorithm. The proposed method was applied to simulated and clinical follow-up PET images. We compared the multi-observation fusion performance to threshold-based methods, proposed for the assessment of the therapeutic response based on functional volumes. On simulated datasets the adaptive threshold applied independently on both images led to higher errors than the ASEM fusion and on clinical datasets it failed to provide coherent measurements for four patients out of seven due to aberrant delineations. The ASEM method demonstrated improved and more robust estimation of the evaluation leading to more pertinent measurements. Future work will consist in extending the methodology and applying it to clinical multi-tracer datasets in order to evaluate its potential impact on the biological tumor volume definition for radiotherapy applications.
Classification of osteoporosis by artificial neural network based on monarch butterfly optimisation algorithm.

PubMed

Devikanniga, D; Joshua Samuel Raj, R

2018-04-01

Osteoporosis is a life threatening disease which commonly affects women mostly after their menopause. It primarily causes mild bone fractures, which on advanced stage leads to the death of an individual. The diagnosis of osteoporosis is done based on bone mineral density (BMD) values obtained through various clinical methods experimented from various skeletal regions. The main objective of the authors' work is to develop a hybrid classifier model that discriminates the osteoporotic patient from healthy person, based on BMD values. In this Letter, the authors propose the monarch butterfly optimisation-based artificial neural network classifier which helps in earlier diagnosis and prevention of osteoporosis. The experiments were conducted using 10-fold cross-validation method for two datasets lumbar spine and femoral neck. The results were compared with other similar hybrid approaches. The proposed method resulted with the accuracy, specificity and sensitivity of 97.9% ± 0.14, 98.33% ± 0.03 and 95.24% ± 0.08, respectively, for lumbar spine dataset and 99.3% ± 0.16%, 99.2% ± 0.13 and 100, respectively, for femoral neck dataset. Further, its performance is compared using receiver operating characteristics analysis and Wilcoxon signed-rank test. The results proved that the proposed classifier is efficient and it outperformed the other approaches in all the cases.
Estimation of trabecular bone parameters in children from multisequence MRI using texture-based regression

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lekadir, Karim, E-mail: karim.lekadir@upf.edu; Hoogendoorn, Corné; Armitage, Paul

Purpose: This paper presents a statistical approach for the prediction of trabecular bone parameters from low-resolution multisequence magnetic resonance imaging (MRI) in children, thus addressing the limitations of high-resolution modalities such as HR-pQCT, including the significant exposure of young patients to radiation and the limited applicability of such modalities to peripheral bones in vivo. Methods: A statistical predictive model is constructed from a database of MRI and HR-pQCT datasets, to relate the low-resolution MRI appearance in the cancellous bone to the trabecular parameters extracted from the high-resolution images. The description of the MRI appearance is achieved between subjects by usingmore » a collection of feature descriptors, which describe the texture properties inside the cancellous bone, and which are invariant to the geometry and size of the trabecular areas. The predictive model is built by fitting to the training data a nonlinear partial least square regression between the input MRI features and the output trabecular parameters. Results: Detailed validation based on a sample of 96 datasets shows correlations >0.7 between the trabecular parameters predicted from low-resolution multisequence MRI based on the proposed statistical model and the values extracted from high-resolution HRp-QCT. Conclusions: The obtained results indicate the promise of the proposed predictive technique for the estimation of trabecular parameters in children from multisequence MRI, thus reducing the need for high-resolution radiation-based scans for a fragile population that is under development and growth.« less
kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity.

PubMed

Murray, Kevin D; Webers, Christfried; Ong, Cheng Soon; Borevitz, Justin; Warthmann, Norman

2017-09-01

Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or "samples") in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.
Temporal variation and scale in movement-based resource selection functions

USGS Publications Warehouse

Hooten, M.B.; Hanks, E.M.; Johnson, D.S.; Alldredge, M.W.

2013-01-01

A common population characteristic of interest in animal ecology studies pertains to the selection of resources. That is, given the resources available to animals, what do they ultimately choose to use? A variety of statistical approaches have been employed to examine this question and each has advantages and disadvantages with respect to the form of available data and the properties of estimators given model assumptions. A wealth of high resolution telemetry data are now being collected to study animal population movement and space use and these data present both challenges and opportunities for statistical inference. We summarize traditional methods for resource selection and then describe several extensions to deal with measurement uncertainty and an explicit movement process that exists in studies involving high-resolution telemetry data. Our approach uses a correlated random walk movement model to obtain temporally varying use and availability distributions that are employed in a weighted distribution context to estimate selection coefficients. The temporally varying coefficients are then weighted by their contribution to selection and combined to provide inference at the population level. The result is an intuitive and accessible statistical procedure that uses readily available software and is computationally feasible for large datasets. These methods are demonstrated using data collected as part of a large-scale mountain lion monitoring study in Colorado, USA.
Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes.

PubMed

Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T

2015-10-12

Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.
A transversal approach for patch-based label fusion via matrix completion

PubMed Central

Sanroma, Gerard; Wu, Guorong; Gao, Yaozong; Thung, Kim-Han; Guo, Yanrong; Shen, Dinggang

2015-01-01

Recently, multi-atlas patch-based label fusion has received an increasing interest in the medical image segmentation field. After warping the anatomical labels from the atlas images to the target image by registration, label fusion is the key step to determine the latent label for each target image point. Two popular types of patch-based label fusion approaches are (1) reconstruction-based approaches that compute the target labels as a weighted average of atlas labels, where the weights are derived by reconstructing the target image patch using the atlas image patches; and (2) classification-based approaches that determine the target label as a mapping of the target image patch, where the mapping function is often learned using the atlas image patches and their corresponding labels. Both approaches have their advantages and limitations. In this paper, we propose a novel patch-based label fusion method to combine the above two types of approaches via matrix completion (and hence, we call it transversal). As we will show, our method overcomes the individual limitations of both reconstruction-based and classification-based approaches. Since the labeling confidences may vary across the target image points, we further propose a sequential labeling framework that first labels the highly confident points and then gradually labels more challenging points in an iterative manner, guided by the label information determined in the previous iterations. We demonstrate the performance of our novel label fusion method in segmenting the hippocampus in the ADNI dataset, subcortical and limbic structures in the LONI dataset, and mid-brain structures in the SATA dataset. We achieve more accurate segmentation results than both reconstruction-based and classification-based approaches. Our label fusion method is also ranked 1st in the online SATA Multi-Atlas Segmentation Challenge. PMID:26160394
Historical gridded reconstruction of potential evapotranspiration for the UK

NASA Astrophysics Data System (ADS)

Tanguy, Maliko; Prudhomme, Christel; Smith, Katie; Hannaford, Jamie

2018-06-01

Potential evapotranspiration (PET) is a necessary input data for most hydrological models and is often needed at a daily time step. An accurate estimation of PET requires many input climate variables which are, in most cases, not available prior to the 1960s for the UK, nor indeed most parts of the world. Therefore, when applying hydrological models to earlier periods, modellers have to rely on PET estimations derived from simplified methods. Given that only monthly observed temperature data is readily available for the late 19th and early 20th century at a national scale for the UK, the objective of this work was to derive the best possible UK-wide gridded PET dataset from the limited data available.To that end, firstly, a combination of (i) seven temperature-based PET equations, (ii) four different calibration approaches and (iii) seven input temperature data were evaluated. For this evaluation, a gridded daily PET product based on the physically based Penman-Monteith equation (the CHESS PET dataset) was used, the rationale being that this provides a reliable ground truth PET dataset for evaluation purposes, given that no directly observed, distributed PET datasets exist. The performance of the models was also compared to a naïve method, which is defined as the simplest possible estimation of PET in the absence of any available climate data. The naïve method used in this study is the CHESS PET daily long-term average (the period from 1961 to 1990 was chosen), or CHESS-PET daily climatology.The analysis revealed that the type of calibration and the input temperature dataset had only a minor effect on the accuracy of the PET estimations at catchment scale. From the seven equations tested, only the calibrated version of the McGuinness-Bordne equation was able to outperform the naïve method and was therefore used to derive the gridded, reconstructed dataset. The equation was calibrated using 43 catchments across Great Britain.The dataset produced is a 5 km gridded PET dataset for the period 1891 to 2015, using the Met Office 5 km monthly gridded temperature data available for that time period as input data for the PET equation. The dataset includes daily and monthly PET grids and is complemented with a suite of mapped performance metrics to help users assess the quality of the data spatially.This dataset is expected to be particularly valuable as input to hydrological models for any catchment in the UK. The data can be accessed at https://doi.org/10.5285/17b9c4f7-1c30-4b6f-b2fe-f7780159939c.
a Metadata Based Approach for Analyzing Uav Datasets for Photogrammetric Applications

NASA Astrophysics Data System (ADS)

Dhanda, A.; Remondino, F.; Santana Quintero, M.

2018-05-01

This paper proposes a methodology for pre-processing and analysing Unmanned Aerial Vehicle (UAV) datasets before photogrammetric processing. In cases where images are gathered without a detailed flight plan and at regular acquisition intervals the datasets can be quite large and be time consuming to process. This paper proposes a method to calculate the image overlap and filter out images to reduce large block sizes and speed up photogrammetric processing. The python-based algorithm that implements this methodology leverages the metadata in each image to determine the end and side overlap of grid-based UAV flights. Utilizing user input, the algorithm filters out images that are unneeded for photogrammetric processing. The result is an algorithm that can speed up photogrammetric processing and provide valuable information to the user about the flight path.
A Unified Probabilistic Framework for Dose–Response Assessment of Human Health Effects

PubMed Central

Slob, Wout

2015-01-01

Background When chemical health hazards have been identified, probabilistic dose–response assessment (“hazard characterization”) quantifies uncertainty and/or variability in toxicity as a function of human exposure. Existing probabilistic approaches differ for different types of endpoints or modes-of-action, lacking a unifying framework. Objectives We developed a unified framework for probabilistic dose–response assessment. Methods We established a framework based on four principles: a) individual and population dose responses are distinct; b) dose–response relationships for all (including quantal) endpoints can be recast as relating to an underlying continuous measure of response at the individual level; c) for effects relevant to humans, “effect metrics” can be specified to define “toxicologically equivalent” sizes for this underlying individual response; and d) dose–response assessment requires making adjustments and accounting for uncertainty and variability. We then derived a step-by-step probabilistic approach for dose–response assessment of animal toxicology data similar to how nonprobabilistic reference doses are derived, illustrating the approach with example non-cancer and cancer datasets. Results Probabilistically derived exposure limits are based on estimating a “target human dose” (HDMI), which requires risk management–informed choices for the magnitude (M) of individual effect being protected against, the remaining incidence (I) of individuals with effects ≥ M in the population, and the percent confidence. In the example datasets, probabilistically derived 90% confidence intervals for HDMI values span a 40- to 60-fold range, where I = 1% of the population experiences ≥ M = 1%–10% effect sizes. Conclusions Although some implementation challenges remain, this unified probabilistic framework can provide substantially more complete and transparent characterization of chemical hazards and support better-informed risk management decisions. Citation Chiu WA, Slob W. 2015. A unified probabilistic framework for dose–response assessment of human health effects. Environ Health Perspect 123:1241–1254; http://dx.doi.org/10.1289/ehp.1409385 PMID:26006063
A Modified Active Appearance Model Based on an Adaptive Artificial Bee Colony

PubMed Central

Othman, Zulaiha Ali

2014-01-01

Active appearance model (AAM) is one of the most popular model-based approaches that have been extensively used to extract features by highly accurate modeling of human faces under various physical and environmental circumstances. However, in such active appearance model, fitting the model with original image is a challenging task. State of the art shows that optimization method is applicable to resolve this problem. However, another common problem is applying optimization. Hence, in this paper we propose an AAM based face recognition technique, which is capable of resolving the fitting problem of AAM by introducing a new adaptive ABC algorithm. The adaptation increases the efficiency of fitting as against the conventional ABC algorithm. We have used three datasets: CASIA dataset, property 2.5D face dataset, and UBIRIS v1 images dataset in our experiments. The results have revealed that the proposed face recognition technique has performed effectively, in terms of accuracy of face recognition. PMID:25165748
Analysing and correcting the differences between multi-source and multi-scale spatial remote sensing observations.

PubMed

Dong, Yingying; Luo, Ruisen; Feng, Haikuan; Wang, Jihua; Zhao, Jinling; Zhu, Yining; Yang, Guijun

2014-01-01

Differences exist among analysis results of agriculture monitoring and crop production based on remote sensing observations, which are obtained at different spatial scales from multiple remote sensors in same time period, and processed by same algorithms, models or methods. These differences can be mainly quantitatively described from three aspects, i.e. multiple remote sensing observations, crop parameters estimation models, and spatial scale effects of surface parameters. Our research proposed a new method to analyse and correct the differences between multi-source and multi-scale spatial remote sensing surface reflectance datasets, aiming to provide references for further studies in agricultural application with multiple remotely sensed observations from different sources. The new method was constructed on the basis of physical and mathematical properties of multi-source and multi-scale reflectance datasets. Theories of statistics were involved to extract statistical characteristics of multiple surface reflectance datasets, and further quantitatively analyse spatial variations of these characteristics at multiple spatial scales. Then, taking the surface reflectance at small spatial scale as the baseline data, theories of Gaussian distribution were selected for multiple surface reflectance datasets correction based on the above obtained physical characteristics and mathematical distribution properties, and their spatial variations. This proposed method was verified by two sets of multiple satellite images, which were obtained in two experimental fields located in Inner Mongolia and Beijing, China with different degrees of homogeneity of underlying surfaces. Experimental results indicate that differences of surface reflectance datasets at multiple spatial scales could be effectively corrected over non-homogeneous underlying surfaces, which provide database for further multi-source and multi-scale crop growth monitoring and yield prediction, and their corresponding consistency analysis evaluation.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.