Science.gov

Sample records for gene set statistics

  1. Self-Contained Statistical Analysis of Gene Sets

    PubMed Central

    Cannon, Judy L.; Ricoy, Ulises M.; Johnson, Christopher

    2016-01-01

    Microarrays are a powerful tool for studying differential gene expression. However, lists of many differentially expressed genes are often generated, and unraveling meaningful biological processes from the lists can be challenging. For this reason, investigators have sought to quantify the statistical probability of compiled gene sets rather than individual genes. The gene sets typically are organized around a biological theme or pathway. We compute correlations between different gene set tests and elect to use Fisher’s self-contained method for gene set analysis. We improve Fisher’s differential expression analysis of a gene set by limiting the p-value of an individual gene within the gene set to prevent a small percentage of genes from determining the statistical significance of the entire set. In addition, we also compute dependencies among genes within the set to determine which genes are statistically linked. The method is applied to T-ALL (T-lineage Acute Lymphoblastic Leukemia) to identify differentially expressed gene sets between T-ALL and normal patients and T-ALL and AML (Acute Myeloid Leukemia) patients. PMID:27711232

  2. GeneSetDB: A comprehensive meta-database, statistical and visualisation framework for gene set analysis

    PubMed Central

    Araki, Hiromitsu; Knapp, Christoph; Tsai, Peter; Print, Cristin

    2012-01-01

    Most “omics” experiments require comprehensive interpretation of the biological meaning of gene lists. To address this requirement, a number of gene set analysis (GSA) tools have been developed. Although the biological value of GSA is strictly limited by the breadth of the gene sets used, very few methods exist for simultaneously analysing multiple publically available gene set databases. Therefore, we constructed GeneSetDB (http://genesetdb.auckland.ac.nz/haeremai.html), a comprehensive meta-database, which integrates 26 public databases containing diverse biological information with a particular focus on human disease and pharmacology. GeneSetDB enables users to search for gene sets containing a gene identifier or keyword, generate their own gene sets, or statistically test for enrichment of an uploaded gene list across all gene sets, and visualise gene set enrichment and overlap using a clustered heat map. PMID:23650583

  3. GeneSetDB: A comprehensive meta-database, statistical and visualisation framework for gene set analysis.

    PubMed

    Araki, Hiromitsu; Knapp, Christoph; Tsai, Peter; Print, Cristin

    2012-01-01

    Most "omics" experiments require comprehensive interpretation of the biological meaning of gene lists. To address this requirement, a number of gene set analysis (GSA) tools have been developed. Although the biological value of GSA is strictly limited by the breadth of the gene sets used, very few methods exist for simultaneously analysing multiple publically available gene set databases. Therefore, we constructed GeneSetDB (http://genesetdb.auckland.ac.nz/haeremai.html), a comprehensive meta-database, which integrates 26 public databases containing diverse biological information with a particular focus on human disease and pharmacology. GeneSetDB enables users to search for gene sets containing a gene identifier or keyword, generate their own gene sets, or statistically test for enrichment of an uploaded gene list across all gene sets, and visualise gene set enrichment and overlap using a clustered heat map.

  4. Effect of the absolute statistic on gene-sampling gene-set analysis methods.

    PubMed

    Nam, Dougu

    2015-03-02

    Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.

  5. XGSA: A statistical method for cross-species gene set analysis.

    PubMed

    Djordjevic, Djordje; Kusumi, Kenro; Ho, Joshua W K

    2016-09-01

    Gene set analysis is a powerful tool for determining whether an experimentally derived set of genes is statistically significantly enriched for genes in other pre-defined gene sets, such as known pathways, gene ontology terms, or other experimentally derived gene sets. Current gene set analysis methods do not facilitate comparing gene sets across different organisms as they do not explicitly deal with homology mapping between species. There lacks a systematic investigation about the effect of complex gene homology on cross-species gene set analysis. In this study, we show that not accounting for the complex homology structure when comparing gene sets in two species can lead to false positive discoveries, especially when comparing gene sets that have complex gene homology relationships. To overcome this bias, we propose a straightforward statistical approach, called XGSA, that explicitly takes the cross-species homology mapping into consideration when doing gene set analysis. Simulation experiments confirm that XGSA can avoid false positive discoveries, while maintaining good statistical power compared to other ad hoc approaches for cross-species gene set analysis. We further demonstrate the effectiveness of XGSA with two real-life case studies that aim to discover conserved or species-specific molecular pathways involved in social challenge and vertebrate appendage regeneration. The R source code for XGSA is available under a GNU General Public License at http://github.com/VCCRI/XGSA CONTACT: jho@victorchang.edu.au. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. FLAGS: A Flexible and Adaptive Association Test for Gene Sets Using Summary Statistics

    PubMed Central

    Huang, Jianfei; Wang, Kai; Wei, Peng; Liu, Xiangtao; Liu, Xiaoming; Tan, Kai; Boerwinkle, Eric; Potash, James B.; Han, Shizhong

    2016-01-01

    Genome-wide association studies (GWAS) have been widely used for identifying common variants associated with complex diseases. Despite remarkable success in uncovering many risk variants and providing novel insights into disease biology, genetic variants identified to date fail to explain the vast majority of the heritability for most complex diseases. One explanation is that there are still a large number of common variants that remain to be discovered, but their effect sizes are generally too small to be detected individually. Accordingly, gene set analysis of GWAS, which examines a group of functionally related genes, has been proposed as a complementary approach to single-marker analysis. Here, we propose a flexible and adaptive test for gene sets (FLAGS), using summary statistics. Extensive simulations showed that this method has an appropriate type I error rate and outperforms existing methods with increased power. As a proof of principle, through real data analyses of Crohn’s disease GWAS data and bipolar disorder GWAS meta-analysis results, we demonstrated the superior performance of FLAGS over several state-of-the-art association tests for gene sets. Our method allows for the more powerful application of gene set analysis to complex diseases, which will have broad use given that GWAS summary results are increasingly publicly available. PMID:26773050

  7. Spectral gene set enrichment (SGSE).

    PubMed

    Frost, H Robert; Li, Zhigang; Moore, Jason H

    2015-03-03

    Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.

  8. A statistical approach towards the derivation of predictive gene sets for potency ranking of chemicals in the mouse embryonic stem cell test.

    PubMed

    Schulpen, Sjors H W; Pennings, Jeroen L A; Tonk, Elisa C M; Piersma, Aldert H

    2014-03-21

    The embryonic stem cell test (EST) is applied as a model system for detection of embryotoxicants. The application of transcriptomics allows a more detailed effect assessment compared to the morphological endpoint. Genes involved in cell differentiation, modulated by chemical exposures, may be useful as biomarkers of developmental toxicity. We describe a statistical approach to obtain a predictive gene set for toxicity potency ranking of compounds within one class. This resulted in a gene set based on differential gene expression across concentration-response series of phthalatic monoesters. We determined the concentration at which gene expression was changed at least 1.5-fold. Genes responding with the same potency ranking in vitro and in vivo embryotoxicity were selected. A leave-one-out cross-validation showed that the relative potency of each phthalate was always predicted correctly. The classical morphological 50% effect level (ID50) in EST was similar to the predicted concentration using gene set expression responses. A general down-regulation of development-related genes and up-regulation of cell-cycle related genes was observed, reminiscent of the differentiation inhibition in EST. This study illustrates the feasibility of applying dedicated gene set selections as biomarkers for developmental toxicity potency ranking on the basis of in vitro testing in the EST.

  9. Moment based gene set tests.

    PubMed

    Larson, Jessica L; Owen, Art B

    2015-04-28

    Permutation-based gene set tests are standard approaches for testing relationships between collections of related genes and an outcome of interest in high throughput expression analyses. Using M random permutations, one can attain p-values as small as 1/(M+1). When many gene sets are tested, we need smaller p-values, hence larger M, to achieve significance while accounting for the number of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tests. We study two gene set methods based on sums and sums of squared correlations. The statistics we study are among the best performers in the extensive simulation of 261 gene set methods by Ackermann and Strimmer in 2009. Our approach calculates exact relevant moments of these statistics and uses them to fit parametric distributions. The computational cost of our algorithm for the linear case is on the order of doing |G| permutations, where |G| is the number of genes in set G. For the quadratic statistics, the cost is on the order of |G|(2) permutations which can still be orders of magnitude faster than plain permutation sampling. We applied the permutation approximation method to three public Parkinson's Disease expression datasets and discovered enriched gene sets not previously discussed. We found that the moment-based gene set enrichment p-values closely approximate the permutation method p-values at a tiny fraction of their cost. They also gave nearly identical rankings to the gene sets being compared. We have developed a moment based approximation to linear and quadratic gene set test statistics' permutation distribution. This allows approximate testing to be done orders of magnitude faster than one could do by sampling permutations. We have implemented our method as a publicly available Bioconductor package, npGSEA (www.bioconductor.org) .

  10. Principal component gene set enrichment (PCGSE).

    PubMed

    Frost, H Robert; Li, Zhigang; Moore, Jason H

    2015-01-01

    Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing interpretation methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with mainly zero loadings. Although useful when just a few variables dominate the population PCs, these methods can perform poorly on genomic data, where interesting biological features are frequently represented by the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. We describe a novel approach, principal component gene set enrichment (PCGSE), for unsupervised gene set testing relative to the sample PCs of genomic data. The PCGSE method computes the statistical association between gene sets and individual PCs using a two-stage competitive gene set test. To demonstrate the efficacy of the PCGSE method, we use simulated and real gene expression data to evaluate the performance of various gene set test statistics and significance tests. Gene set testing is an effective approach for interpreting the PCs of high-dimensional genomic data. As shown using both simulated and real datasets, the PCGSE method can generate biologically meaningful and computationally efficient results via a two-stage, competitive parametric test that correctly accounts for inter-gene correlation.

  11. Seeing sets: representation by statistical properties.

    PubMed

    Ariely, D

    2001-03-01

    Sets of similar objects are common occurrences--a crowd of people, a bunch of bananas, a copse of trees, a shelf of books, a line of cars. Each item in the set may be distinct, highly visible, and discriminable. But when we look away from the set, what information do we have? The current article starts to address this question by introducing the idea of a set representation. This idea was tested using two new paradigms: mean discrimination and member identification. Three experiments using sets of different-sized spots showed that observers know a set's mean quite accurately but know little about the individual items, except their range. Taken together, these results suggest that the visual system represents the overall statistical, and not individual, properties of sets.

  12. Probabilities for separating sets of order statistics.

    PubMed

    Glueck, D H; Karimpour-Fard, A; Mandel, J; Muller, K E

    2010-04-01

    Consider a set of order statistics that arise from sorting samples from two different populations, each with their own, possibly different distribution functions. The probability that these order statistics fall in disjoint, ordered intervals and that of the smallest statistics, a certain number come from the first populations is given in terms of the two distribution functions. The result is applied to computing the joint probability of the number of rejections and the number of false rejections for the Benjamini-Hochberg false discovery rate procedure.

  13. Statistical mechanics of maximal independent sets

    NASA Astrophysics Data System (ADS)

    Dall'Asta, Luca; Pin, Paolo; Ramezanpour, Abolfazl

    2009-12-01

    The graph theoretic concept of maximal independent set arises in several practical problems in computer science as well as in game theory. A maximal independent set is defined by the set of occupied nodes that satisfy some packing and covering constraints. It is known that finding minimum and maximum-density maximal independent sets are hard optimization problems. In this paper, we use cavity method of statistical physics and Monte Carlo simulations to study the corresponding constraint satisfaction problem on random graphs. We obtain the entropy of maximal independent sets within the replica symmetric and one-step replica symmetry breaking frameworks, shedding light on the metric structure of the landscape of solutions and suggesting a class of possible algorithms. This is of particular relevance for the application to the study of strategic interactions in social and economic networks, where maximal independent sets correspond to pure Nash equilibria of a graphical game of public goods allocation.

  14. Gene set analysis using variance component tests.

    PubMed

    Huang, Yen-Tsung; Lin, Xihong

    2013-06-28

    Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data.

  15. Gene set analysis using variance component tests

    PubMed Central

    2013-01-01

    Background Gene set analyses have become increasingly important in genomic research, as many complex diseases are contributed jointly by alterations of numerous genes. Genes often coordinate together as a functional repertoire, e.g., a biological pathway/network and are highly correlated. However, most of the existing gene set analysis methods do not fully account for the correlation among the genes. Here we propose to tackle this important feature of a gene set to improve statistical power in gene set analyses. Results We propose to model the effects of an independent variable, e.g., exposure/biological status (yes/no), on multiple gene expression values in a gene set using a multivariate linear regression model, where the correlation among the genes is explicitly modeled using a working covariance matrix. We develop TEGS (Test for the Effect of a Gene Set), a variance component test for the gene set effects by assuming a common distribution for regression coefficients in multivariate linear regression models, and calculate the p-values using permutation and a scaled chi-square approximation. We show using simulations that type I error is protected under different choices of working covariance matrices and power is improved as the working covariance approaches the true covariance. The global test is a special case of TEGS when correlation among genes in a gene set is ignored. Using both simulation data and a published diabetes dataset, we show that our test outperforms the commonly used approaches, the global test and gene set enrichment analysis (GSEA). Conclusion We develop a gene set analyses method (TEGS) under the multivariate regression framework, which directly models the interdependence of the expression values in a gene set using a working covariance. TEGS outperforms two widely used methods, GSEA and global test in both simulation and a diabetes microarray data. PMID:23806107

  16. Gene Cluster Statistics with Gene Families

    PubMed Central

    Durand, Dannie

    2009-01-01

    Identifying genomic regions that descended from a common ancestor is important for understanding the function and evolution of genomes. In distantly related genomes, clusters of homologous gene pairs are evidence of candidate homologous regions. Demonstrating the statistical significance of such “gene clusters” is an essential component of comparative genomic analyses. However, currently there are no practical statistical tests for gene clusters that model the influence of the number of homologs in each gene family on cluster significance. In this work, we demonstrate empirically that failure to incorporate gene family size in gene cluster statistics results in overestimation of significance, leading to incorrect conclusions. We further present novel analytical methods for estimating gene cluster significance that take gene family size into account. Our methods do not require complete genome data and are suitable for testing individual clusters found in local regions, such as contigs in an unfinished assembly. We consider pairs of regions drawn from the same genome (paralogous clusters), as well as regions drawn from two different genomes (orthologous clusters). Determining cluster significance under general models of gene family size is computationally intractable. By assuming that all gene families are of equal size, we obtain analytical expressions that allow fast approximation of cluster probabilities. We evaluate the accuracy of this approximation by comparing the resulting gene cluster probabilities with cluster probabilities obtained by simulating a realistic, power-law distributed model of gene family size, with parameters inferred from genomic data. Surprisingly, despite the simplicity of the underlying assumption, our method accurately approximates the true cluster probabilities. It slightly overestimates these probabilities, yielding a conservative test. We present additional simulation results indicating the best choice of parameter values for data

  17. Gene set analysis for longitudinal gene expression data

    PubMed Central

    2011-01-01

    Background Gene set analysis (GSA) has become a successful tool to interpret gene expression profiles in terms of biological functions, molecular pathways, or genomic locations. GSA performs statistical tests for independent microarray samples at the level of gene sets rather than individual genes. Nowadays, an increasing number of microarray studies are conducted to explore the dynamic changes of gene expression in a variety of species and biological scenarios. In these longitudinal studies, gene expression is repeatedly measured over time such that a GSA needs to take into account the within-gene correlations in addition to possible between-gene correlations. Results We provide a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions. The limiting distributions of our statistics are derived when the number of genes goes to infinity while the number of replications can be small. When the number of genes in a gene set is small, we recommend permutation tests based on our nonparametric test statistics to achieve reliable type I error and better power while incorporating unknown correlations between and within-genes. Simulation results demonstrate that the proposed method has a greater power than other methods for various data distributions and heteroscedastic correlation structures. This method was used for an IL-2 stimulation study and significantly altered gene sets were identified. Conclusions The simulation study and the real data application showed that the proposed gene set analysis provides a promising tool for longitudinal microarray analysis. R scripts for simulating longitudinal data and calculating the nonparametric statistics are posted on the North Dakota INBRE website http://ndinbre.org/programs/bioinformatics.php. Raw microarray data is available in Gene Expression Omnibus (National Center for Biotechnology Information) with accession number GSE6085. PMID

  18. On asymptotically generalized statistical equivalent set sequences

    NASA Astrophysics Data System (ADS)

    Savas, Ekrem

    2013-10-01

    In this paper we shall study the asymptotically λ-statistical equivalent (Wijsman sense) of multiple L. In addition to these definition, natural inclusion theorems shall also be presented. This approach has not been considered in any context before.

  19. Quantitative set analysis for gene expression: a method to quantify gene set differential expression including gene-gene correlations.

    PubMed

    Yaari, Gur; Bolen, Christopher R; Thakar, Juilee; Kleinstein, Steven H

    2013-10-01

    Enrichment analysis of gene sets is a popular approach that provides a functional interpretation of genome-wide expression data. Existing tests are affected by inter-gene correlations, resulting in a high Type I error. The most widely used test, Gene Set Enrichment Analysis, relies on computationally intensive permutations of sample labels to generate a null distribution that preserves gene-gene correlations. A more recent approach, CAMERA, attempts to correct for these correlations by estimating a variance inflation factor directly from the data. Although these methods generate P-values for detecting gene set activity, they are unable to produce confidence intervals or allow for post hoc comparisons. We have developed a new computational framework for Quantitative Set Analysis of Gene Expression (QuSAGE). QuSAGE accounts for inter-gene correlations, improves the estimation of the variance inflation factor and, rather than evaluating the deviation from a null hypothesis with a P-value, it quantifies gene-set activity with a complete probability density function. From this probability density function, P-values and confidence intervals can be extracted and post hoc analysis can be carried out while maintaining statistical traceability. Compared with Gene Set Enrichment Analysis and CAMERA, QuSAGE exhibits better sensitivity and specificity on real data profiling the response to interferon therapy (in chronic Hepatitis C virus patients) and Influenza A virus infection. QuSAGE is available as an R package, which includes the core functions for the method as well as functions to plot and visualize the results.

  20. On asymptotically lacunary invariant statistical equivalent set sequences

    NASA Astrophysics Data System (ADS)

    Pancaroglu, Nimet; Nuray, Fatih; Savas, Ekrem

    2013-10-01

    In this paper, we define asymptotically invariant equivalence, strongly asymptotically invariant equivalence, asymptotically invariant statistical equivalence, asymptotically lacunary invariant statistical equivalence, strongly asymptotically lacunary invariant equivalence, asymptotically lacunary invariant equivalence (Wijsman sense) for sequences of sets. Also we investigate some relations between asymptotically lacunary invariant statistical equivalence and asymptotically invariant statistical equivalence for sequences of sets. We introduce some notions and theorems as follows, asymptotically lacunary invariant statistical equivalence, strongly asymptotically lacunary invariant equivalence, asymptotically lacunary invariant equivalence (Wijsman sense) for sequences of sets.

  1. GSAR: Bioconductor package for Gene Set analysis in R.

    PubMed

    Rahmatallah, Yasir; Zybailov, Boris; Emmert-Streib, Frank; Glazko, Galina

    2017-01-24

    Gene set analysis (in a form of functionally related genes or pathways) has become the method of choice for analyzing omics data in general and gene expression data in particular. There are many statistical methods that either summarize gene-level statistics for a gene set or apply a multivariate statistic that accounts for intergene correlations. Most available methods detect complex departures from the null hypothesis but lack the ability to identify the specific alternative hypothesis that rejects the null. GSAR (Gene Set Analysis in R) is an open-source R/Bioconductor software package for gene set analysis (GSA). It implements self-contained multivariate non-parametric statistical methods testing a complex null hypothesis against specific alternatives, such as differences in mean (shift), variance (scale), or net correlation structure. The package also provides a graphical visualization tool, based on the union of two minimum spanning trees, for correlation networks to examine the change in the correlation structures of a gene set between two conditions and highlight influential genes (hubs). Package GSAR provides a set of multivariate non-parametric statistical methods that test a complex null hypothesis against specific alternatives. The methods in package GSAR are applicable to any type of omics data that can be represented in a matrix format. The package, with detailed instructions and examples, is freely available under the GPL (> = 2) license from the Bioconductor web site.

  2. MAGMA: Generalized Gene-Set Analysis of GWAS Data

    PubMed Central

    de Leeuw, Christiaan A.; Mooij, Joris M.; Heskes, Tom; Posthuma, Danielle

    2015-01-01

    By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn’s Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn’s Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn’s Disease data was found to be considerably faster as well. PMID:25885710

  3. MAGMA: generalized gene-set analysis of GWAS data.

    PubMed

    de Leeuw, Christiaan A; Mooij, Joris M; Heskes, Tom; Posthuma, Danielle

    2015-04-01

    By aggregating data for complex traits in a biologically meaningful way, gene and gene-set analysis constitute a valuable addition to single-marker analysis. However, although various methods for gene and gene-set analysis currently exist, they generally suffer from a number of issues. Statistical power for most methods is strongly affected by linkage disequilibrium between markers, multi-marker associations are often hard to detect, and the reliance on permutation to compute p-values tends to make the analysis computationally very expensive. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical performance. The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. Simulations and an analysis of Crohn's Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. The results show that MAGMA has significantly more power than other tools for both the gene and the gene-set analysis, identifying more genes and gene sets associated with Crohn's Disease while maintaining a correct type 1 error rate. Moreover, the MAGMA analysis of the Crohn's Disease data was found to be considerably faster as well.

  4. Unsupervised gene set testing based on random matrix theory.

    PubMed

    Frost, H Robert; Amos, Christopher I

    2016-11-04

    Gene set testing, or pathway analysis, is a bioinformatics technique that performs statistical testing on biologically meaningful sets of genomic variables. Although originally developed for supervised analyses, i.e., to test the association between gene sets and an outcome variable, gene set testing also has important unsupervised applications, e.g., p-value weighting. For unsupervised testing, however, few effective gene set testing methods are available with support especially poor for several biologically relevant use cases. In this paper, we describe two new unsupervised gene set testing methods based on random matrix theory, the Marc̆enko-Pastur Distribution Test (MPDT) and the Tracy-Widom Test (TWT), that support both self-contained and competitive null hypotheses. For the self-contained case, we contrast our proposed tests with the classic multivariate test based on a modified likelihood ratio criterion. For the competitive case, we compare the new tests against a competitive version of the classic test and our recently developed Spectral Gene Set Enrichment (SGSE) method. Evaluation of the TWT and MPDT methods is based on both simulation studies and a weighted p-value analysis of two real gene expression data sets using gene sets drawn from MSigDB collections. The MPDT and TWT methods are novel and effective tools for unsupervised gene set analysis with superior statistical performance relative to existing techniques and the ability to generate biologically important results on real genomic data sets.

  5. Phantom: investigating heterogeneous gene sets in time-course data.

    PubMed

    Gu, Jinghua; Wang, Xuan; Chan, Jinyan; Baldwin, Nicole E; Turner, Jacob A

    2017-09-15

    Gene set analysis is a powerful tool to study the coordinative change of time-course data. However, most existing methods only model the overall change of a gene set, yet completely overlooked heterogeneous time-dependent changes within sub-sets of genes. We have developed a novel statistical method, Phantom, to investigate gene set heterogeneity. Phantom employs the principle of multi-objective optimization to assess the heterogeneity inside a gene set, which also accounts for the temporal dependency in time-course data. Phantom improves the performance of gene set based methods to detect biological changes across time. Phantom webpage can be accessed at: http://www.baylorhealth.edu/Phantom . R package of Phantom is available at https://cran.r-project.org/web/packages/phantom/index.html . jinghua.gu@bswhealth.org. Supplementary data are available at Bioinformatics online.

  6. Down-weighting overlapping genes improves gene set analysis.

    PubMed

    Tarca, Adi Laurentiu; Draghici, Sorin; Bhatti, Gaurav; Romero, Roberto

    2012-06-19

    The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org.

  7. Down-weighting overlapping genes improves gene set analysis

    PubMed Central

    2012-01-01

    Background The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. Results In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org. PMID:22713124

  8. Third party annotation gene data set of eutherian lysozyme genes.

    PubMed

    Premzl, Marko

    2014-12-01

    The eutherian comparative genomic analysis protocol annotated most comprehensive eutherian lysozyme gene data set. Among 209 potential coding sequences, the third party annotation gene data set of eutherian lysozyme genes included 116 complete coding sequences that first described seven major gene clusters. As one new framework of future experiments, the present integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis proposed new classification and nomenclature of eutherian lysozyme genes.

  9. An Independent Filter for Gene Set Testing Based on Spectral Enrichment.

    PubMed

    Frost, H Robert; Li, Zhigang; Asselbergs, Folkert W; Moore, Jason H

    2015-01-01

    Gene set testing has become an indispensable tool for the analysis of high-dimensional genomic data. An important motivation for testing gene sets, rather than individual genomic variables, is to improve statistical power by reducing the number of tested hypotheses. Given the dramatic growth in common gene set collections, however, testing is often performed with nearly as many gene sets as underlying genomic variables. To address the challenge to statistical power posed by large gene set collections, we have developed spectral gene set filtering (SGSF), a novel technique for independent filtering of gene set collections prior to gene set testing. The SGSF method uses as a filter statistic the p-value measuring the statistical significance of the association between each gene set and the sample principal components (PCs), taking into account the significance of the associated eigenvalues. Because this filter statistic is independent of standard gene set test statistics under the null hypothesis but dependent under the alternative, the proportion of enriched gene sets is increased without impacting the type I error rate. As shown using simulated and real gene expression data, the SGSF algorithm accurately filters gene sets unrelated to the experimental outcome resulting in significantly increased gene set testing power.

  10. Gene set analyses for interpreting microarray experiments on prokaryotic organisms.

    SciTech Connect

    Tintle, Nathan; Best, Aaron; Dejongh, Matthew; VanBruggen, Dirk; Heffron, Fred; Porwollik, Steffen; Taylor, Ronald C.

    2008-11-05

    Background: Recent advances in microarray technology have brought with them the need for enhanced methods of biologically interpreting gene expression data. Recently, methods like Gene Set Enrichment Analysis (GSEA) and variants of Fisher’s exact test have been proposed which utilize a priori biological information. Typically, these methods are demonstrated with a priori biological information from the Gene Ontology. Results: Alternative gene set definitions are presented based on gene sets inferred from the SEED: open-source software environment for comparative genome annotation and analysis of microbial organisms. Many of these gene sets are then shown to provide consistent expression across a series of experiments involving Salmonella Typhimurium. Implementation of the gene sets in an analysis of microarray data is then presented for the Salmonella Typhimurium data. Conclusions: SEED inferred gene sets can be naturally defined based on subsystems in the SEED. The consistent expression values of these SEED inferred gene sets suggest their utility for statistical analyses of gene expression data based on a priori biological information

  11. Statistical Redundancy Testing for Improved Gene Selection in Cancer Classification Using Microarray Data

    PubMed Central

    Hu, Simin; Rao, J. Sunil

    2007-01-01

    In gene selection for cancer classification using microarray data, we define an eigenvalue-ratio statistic to measure a gene’s contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene selection methods. Simulation studies illustrate the agreement between statistical redundancy testing and gene selection methods. Real data examples show the proposed gene selection methods can select a compact gene subset which can not only be used to build high quality cancer classifiers but also show biological relevance. PMID:19455233

  12. Random forests-based differential analysis of gene sets for gene expression data.

    PubMed

    Hsueh, Huey-Miin; Zhou, Da-Wei; Tsai, Chen-An

    2013-04-10

    In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for

  13. MAVTgsa: An R Package for Gene Set (Enrichment) Analysis

    DOE PAGES

    Chien, Chih-Yi; Chang, Ching-Wei; Tsai, Chen-An; ...

    2014-01-01

    Gene semore » t analysis methods aim to determine whether an a priori defined set of genes shows statistically significant difference in expression on either categorical or continuous outcomes. Although many methods for gene set analysis have been proposed, a systematic analysis tool for identification of different types of gene set significance modules has not been developed previously. This work presents an R package, called MAVTgsa, which includes three different methods for integrated gene set enrichment analysis. (1) The one-sided OLS (ordinary least squares) test detects coordinated changes of genes in gene set in one direction, either up- or downregulation. (2) The two-sided MANOVA (multivariate analysis variance) detects changes both up- and downregulation for studying two or more experimental conditions. (3) A random forests-based procedure is to identify gene sets that can accurately predict samples from different experimental conditions or are associated with the continuous phenotypes. MAVTgsa computes the P values and FDR (false discovery rate) q -value for all gene sets in the study. Furthermore, MAVTgsa provides several visualization outputs to support and interpret the enrichment results. This package is available online.« less

  14. Novel gene sets improve set-level classification of prokaryotic gene expression data.

    PubMed

    Holec, Matěj; Kuželka, Ondřej; Železný, Filip

    2015-10-28

    Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers. We define two families of gene sets using information on regulatory interactions, and evaluate them on phenotype-classification tasks using public prokaryotic gene expression data sets. From each of the two gene-set families, we first select the best-performing subtype. The two selected subtypes are then evaluated on independent (testing) data sets against state-of-the-art gene sets and against the conventional gene-level approach. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. Novel gene sets defined on the basis of regulatory interactions improve set-level classification of gene expression data. The experimental scripts and other material needed to reproduce the experiments are available at http://ida.felk.cvut.cz/novelgenesets.tar.gz.

  15. Assessment of gene set analysis methods based on microarray data.

    PubMed

    Alavi-Majd, Hamid; Khodakarim, Soheila; Zayeri, Farid; Rezaei-Tavirani, Mostafa; Tabatabaei, Seyyed Mohammad; Heydarpour-Meymeh, Maryam

    2014-01-25

    Gene set analysis (GSA) incorporates biological information into statistical knowledge to identify gene sets differently expressed between two or more phenotypes. It allows us to gain an insight into the functional working mechanism of cells beyond the detection of differently expressed gene sets. In order to evaluate the competence of GSA approaches, three self-contained GSA approaches with different statistical methods were chosen; Category, Globaltest and Hotelling's T(2) together with their assayed power to identify the differences expressed via simulation and real microarray data. The Category does not take care of the correlation structure, while the other two deal with correlations. In order to perform these methods, R and Bioconductor were used. Furthermore, venous thromboembolism and acute lymphoblastic leukemia microarray data were applied. The results of three GSAs showed that the competence of these methods depends on the distribution of gene expression in a dataset. It is very important to assay the distribution of gene expression data before choosing the GSA method to identify gene sets differently expressed between phenotypes. On the other hand, assessment of common genes among significant gene sets indicated that there was a significant agreement between the result of GSA and the findings of biologists. © 2013 Elsevier B.V. All rights reserved.

  16. Sets, Probability and Statistics: The Mathematics of Life Insurance.

    ERIC Educational Resources Information Center

    Clifford, Paul C.; And Others

    The practical use of such concepts as sets, probability and statistics are considered by many to be vital and necessary to our everyday life. This student manual is intended to familiarize students with these concepts and to provide practice using real life examples. It also attempts to illustrate how the insurance industry uses such mathematic…

  17. Online Updating of Statistical Inference in the Big Data Setting.

    PubMed

    Schifano, Elizabeth D; Wu, Jing; Wang, Chun; Yan, Jun; Chen, Ming-Hui

    2016-01-01

    We present statistical methods for big data arising from online analytical processing, where large amounts of data arrive in streams and require fast analysis without storage/access to the historical data. In particular, we develop iterative estimating algorithms and statistical inferences for linear models and estimating equations that update as new data arrive. These algorithms are computationally efficient, minimally storage-intensive, and allow for possible rank deficiencies in the subset design matrices due to rare-event covariates. Within the linear model setting, the proposed online-updating framework leads to predictive residual tests that can be used to assess the goodness-of-fit of the hypothesized model. We also propose a new online-updating estimator under the estimating equation setting. Theoretical properties of the goodness-of-fit tests and proposed estimators are examined in detail. In simulation studies and real data applications, our estimator compares favorably with competing approaches under the estimating equation setting.

  18. Multivariate gene-set testing based on graphical models.

    PubMed

    Städler, Nicolas; Mukherjee, Sach

    2015-01-01

    The identification of predefined groups of genes ("gene-sets") which are differentially expressed between two conditions ("gene-set analysis", or GSA) is a very popular analysis in bioinformatics. GSA incorporates biological knowledge by aggregating over genes that are believed to be functionally related. This can enhance statistical power over analyses that consider only one gene at a time. However, currently available GSA approaches are based on univariate two-sample comparison of single genes. This means that they cannot test for multivariate hypotheses such as differences in covariance structure between the two conditions. Yet interplay between genes is a central aspect of biological investigation and it is likely that such interplay may differ between conditions. This paper proposes a novel approach for gene-set analysis that allows for truly multivariate hypotheses, in particular differences in gene-gene networks between conditions. Testing hypotheses concerning networks is challenging due the nature of the underlying estimation problem. Our starting point is a recent, general approach for high-dimensional two-sample testing. We refine the approach and show how it can be used to perform multivariate, network-based gene-set testing. We validate the approach in simulated examples and show results using high-throughput data from several studies in cancer biology. © The Author 2014. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  19. Comparing Data Sets: Implicit Summaries of the Statistical Properties of Number Sets

    ERIC Educational Resources Information Center

    Morris, Bradley J.; Masnick, Amy M.

    2015-01-01

    Comparing datasets, that is, sets of numbers in context, is a critical skill in higher order cognition. Although much is known about how people compare single numbers, little is known about how number sets are represented and compared. We investigated how subjects compared datasets that varied in their statistical properties, including ratio of…

  20. Comparing Data Sets: Implicit Summaries of the Statistical Properties of Number Sets

    ERIC Educational Resources Information Center

    Morris, Bradley J.; Masnick, Amy M.

    2015-01-01

    Comparing datasets, that is, sets of numbers in context, is a critical skill in higher order cognition. Although much is known about how people compare single numbers, little is known about how number sets are represented and compared. We investigated how subjects compared datasets that varied in their statistical properties, including ratio of…

  1. Robust multi-group gene set analysis with few replicates.

    PubMed

    Mishra, Pashupati P; Medlar, Alan; Holm, Liisa; Törönen, Petri

    2016-12-09

    Competitive gene set analysis is a standard exploratory tool for gene expression data. Permutation-based competitive gene set analysis methods are preferable to parametric ones because the latter make strong statistical assumptions which are not always met. For permutation-based methods, we permute samples, as opposed to genes, as doing so preserves the inter-gene correlation structure. Unfortunately, up until now, sample permutation-based methods have required a minimum of six replicates per sample group. We propose a new permutation-based competitive gene set analysis method for multi-group gene expression data with as few as three replicates per group. The method is based on advanced sample permutation technique that utilizes all groups within a data set for pairwise comparisons. We present a comprehensive evaluation of different permutation techniques, using multiple data sets and contrast the performance of our method, mGSZm, with other state of the art methods. We show that mGSZm is robust, and that, despite only using less than six replicates, we are able to consistently identify a high proportion of the top ranked gene sets from the analysis of a substantially larger data set. Further, we highlight other methods where performance is highly variable and appears dependent on the underlying data set being analyzed. Our results demonstrate that robust gene set analysis of multi-group gene expression data is permissible with as few as three replicates. In doing so, we have extended the applicability of such approaches to resource constrained experiments where additional data generation is prohibitively difficult or expensive. An R package implementing the proposed method and supplementary materials are available from the website http://ekhidna.biocenter.helsinki.fi/downloads/pashupati/mGSZm.html .

  2. Statistical tests on clustered global earthquake synthetic data sets

    NASA Astrophysics Data System (ADS)

    Daub, Eric G.; Trugman, Daniel T.; Johnson, Paul A.

    2015-08-01

    We study the ability of statistical tests to identify nonrandom features of earthquake catalogs, with a focus on the global earthquake record since 1900. We construct four types of synthetic data sets containing varying strengths of clustering, with each data set containing on average 10,000 events over 100 years with magnitudes above M = 6. We apply a suite of statistical tests to each synthetic realization in order to evaluate the ability of each test to identify the sequences of events as nonrandom. Our results show that detection ability is dependent on the quantity of data, the nature of the type of clustering, and the specific signal used in the statistical test. Data sets that exhibit a stronger variation in the seismicity rate are generally easier to identify as nonrandom for a given background rate. We also show that we can address this problem in a Bayesian framework, with the clustered data sets as prior distributions. Using this new Bayesian approach, we can place quantitative bounds on the range of possible clustering strengths that are consistent with the global earthquake data. At M = 7, we can estimate 99th percentile confidence bounds on the number of triggered events, with an upper bound of 20% of the catalog for global aftershock sequences, with a stronger upper bound on the fraction of triggered events of 10% for long-term event clusters. At M = 8, the bounds are less strict due to the reduced number of events. However, our analysis shows that other types of clustering could be present in the data that we are unable to detect. Our results aid in the interpretation of the results of statistical tests on earthquake catalogs, both worldwide and regionally.

  3. Gene set analyses for interpreting microarray experiments on prokaryotic organisms.

    PubMed

    Tintle, Nathan L; Best, Aaron A; DeJongh, Matthew; Van Bruggen, Dirk; Heffron, Fred; Porwollik, Steffen; Taylor, Ronald C

    2008-11-05

    Despite the widespread usage of DNA microarrays, questions remain about how best to interpret the wealth of gene-by-gene transcriptional levels that they measure. Recently, methods have been proposed which use biologically defined sets of genes in interpretation, instead of examining results gene-by-gene. Despite a serious limitation, a method based on Fisher's exact test remains one of the few plausible options for gene set analysis when an experiment has few replicates, as is typically the case for prokaryotes. We extend five methods of gene set analysis from use on experiments with multiple replicates, for use on experiments with few replicates. We then use simulated and real data to compare these methods with each other and with the Fisher's exact test (FET) method. As a result of the simulation we find that a method named MAXMEAN-NR, maintains the nominal rate of false positive findings (type I error rate) while offering good statistical power and robustness to a variety of gene set distributions for set sizes of at least 10. Other methods (ABSSUM-NR or SUM-NR) are shown to be powerful for set sizes less than 10. Analysis of three sets of experimental data shows similar results. Furthermore, the MAXMEAN-NR method is shown to be able to detect biologically relevant sets as significant, when other methods (including FET) cannot. We also find that the popular GSEA-NR method performs poorly when compared to MAXMEAN-NR. MAXMEAN-NR is a method of gene set analysis for experiments with few replicates, as is common for prokaryotes. Results of simulation and real data analysis suggest that the MAXMEAN-NR method offers increased robustness and biological relevance of findings as compared to FET and other methods, while maintaining the nominal type I error rate.

  4. Gene set analyses for interpreting microarray experiments on prokaryotic organisms

    PubMed Central

    Tintle, Nathan L; Best, Aaron A; DeJongh, Matthew; Van Bruggen, Dirk; Heffron, Fred; Porwollik, Steffen; Taylor, Ronald C

    2008-01-01

    Background Despite the widespread usage of DNA microarrays, questions remain about how best to interpret the wealth of gene-by-gene transcriptional levels that they measure. Recently, methods have been proposed which use biologically defined sets of genes in interpretation, instead of examining results gene-by-gene. Despite a serious limitation, a method based on Fisher's exact test remains one of the few plausible options for gene set analysis when an experiment has few replicates, as is typically the case for prokaryotes. Results We extend five methods of gene set analysis from use on experiments with multiple replicates, for use on experiments with few replicates. We then use simulated and real data to compare these methods with each other and with the Fisher's exact test (FET) method. As a result of the simulation we find that a method named MAXMEAN-NR, maintains the nominal rate of false positive findings (type I error rate) while offering good statistical power and robustness to a variety of gene set distributions for set sizes of at least 10. Other methods (ABSSUM-NR or SUM-NR) are shown to be powerful for set sizes less than 10. Analysis of three sets of experimental data shows similar results. Furthermore, the MAXMEAN-NR method is shown to be able to detect biologically relevant sets as significant, when other methods (including FET) cannot. We also find that the popular GSEA-NR method performs poorly when compared to MAXMEAN-NR. Conclusion MAXMEAN-NR is a method of gene set analysis for experiments with few replicates, as is common for prokaryotes. Results of simulation and real data analysis suggest that the MAXMEAN-NR method offers increased robustness and biological relevance of findings as compared to FET and other methods, while maintaining the nominal type I error rate. PMID:18986519

  5. WebGestalt: an integrated system for exploring gene sets in various biological contexts.

    PubMed

    Zhang, Bing; Kirov, Stefan; Snoddy, Jay

    2005-07-01

    High-throughput technologies have led to the rapid generation of large-scale datasets about genes and gene products. These technologies have also shifted our research focus from 'single genes' to 'gene sets'. We have developed a web-based integrated data mining system, WebGestalt (http://genereg.ornl.gov/webgestalt/), to help biologists in exploring large sets of genes. WebGestalt is composed of four modules: gene set management, information retrieval, organization/visualization, and statistics. The management module uploads, saves, retrieves and deletes gene sets, as well as performs Boolean operations to generate the unions, intersections or differences between different gene sets. The information retrieval module currently retrieves information for up to 20 attributes for all genes in a gene set. The organization/visualization module organizes and visualizes gene sets in various biological contexts, including Gene Ontology, tissue expression pattern, chromosome distribution, metabolic and signaling pathways, protein domain information and publications. The statistics module recommends and performs statistical tests to suggest biological areas that are important to a gene set and warrant further investigation. In order to demonstrate the use of WebGestalt, we have generated 48 gene sets with genes over-represented in various human tissue types. Exploration of all the 48 gene sets using WebGestalt is available for the public at http://genereg.ornl.gov/webgestalt/wg_enrich.php.

  6. Analysis of gene set using shrinkage covariance matrix approach

    NASA Astrophysics Data System (ADS)

    Karjanto, Suryaefiza; Aripin, Rasimah

    2013-09-01

    Microarray methodology has been exploited for different applications such as gene discovery and disease diagnosis. This technology is also used for quantitative and highly parallel measurements of gene expression. Recently, microarrays have been one of main interests of statisticians because they provide a perfect example of the paradigms of modern statistics. In this study, the alternative approach to estimate the covariance matrix has been proposed to solve the high dimensionality problem in microarrays. The extension of traditional Hotelling's T2 statistic is constructed for determining the significant gene sets across experimental conditions using shrinkage approach. Real data sets were used as illustrations to compare the performance of the proposed methods with other methods. The results across the methods are consistent, implying that this approach provides an alternative to existing techniques.

  7. Caipirini: using gene sets to rank literature

    PubMed Central

    2012-01-01

    Background Keeping up-to-date with bioscience literature is becoming increasingly challenging. Several recent methods help meet this challenge by allowing literature search to be launched based on lists of abstracts that the user judges to be 'interesting'. Some methods go further by allowing the user to provide a second input set of 'uninteresting' abstracts; these two input sets are then used to search and rank literature by relevance. In this work we present the service 'Caipirini' (http://caipirini.org) that also allows two input sets, but takes the novel approach of allowing ranking of literature based on one or more sets of genes. Results To evaluate the usefulness of Caipirini, we used two test cases, one related to the human cell cycle, and a second related to disease defense mechanisms in Arabidopsis thaliana. In both cases, the new method achieved high precision in finding literature related to the biological mechanisms underlying the input data sets. Conclusions To our knowledge Caipirini is the first service enabling literature search directly based on biological relevance to gene sets; thus, Caipirini gives the research community a new way to unlock hidden knowledge from gene sets derived via high-throughput experiments. PMID:22297131

  8. The limitations of simple gene set enrichment analysis assuming gene independence.

    PubMed

    Tamayo, Pablo; Steinhardt, George; Liberzon, Arthur; Mesirov, Jill P

    2016-02-01

    Since its first publication in 2003, the Gene Set Enrichment Analysis method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes Gene Set Enrichment Analysis's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with Gene Set Enrichment Analysis's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods. © The Author(s) 2012.

  9. Camera: a competitive gene set test accounting for inter-gene correlation.

    PubMed

    Wu, Di; Smyth, Gordon K

    2012-09-01

    Competitive gene set tests are commonly used in molecular pathway analysis to test for enrichment of a particular gene annotation category amongst the differential expression results from a microarray experiment. Existing gene set tests that rely on gene permutation are shown here to be extremely sensitive to inter-gene correlation. Several data sets are analyzed to show that inter-gene correlation is non-ignorable even for experiments on homogeneous cell populations using genetically identical model organisms. A new gene set test procedure (CAMERA) is proposed based on the idea of estimating the inter-gene correlation from the data, and using it to adjust the gene set test statistic. An efficient procedure is developed for estimating the inter-gene correlation and characterizing its precision. CAMERA is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression. Analysis of breast cancer data shows that CAMERA recovers known relationships between tumor subtypes in very convincing terms. CAMERA can be used to analyze specified sets or as a pathway analysis tool using a database of molecular signatures.

  10. Statistical ensemble of gene regulatory networks of macrophage differentiation.

    PubMed

    Castiglione, Filippo; Tieri, Paolo; Palma, Alessandro; Jarrah, Abdul Salam

    2016-12-22

    Macrophages cover a major role in the immune system, being the most plastic cell yielding several key immune functions. Here we derived a minimalistic gene regulatory network model for the differentiation of macrophages into the two phenotypes M1 (pro-) and M2 (anti-inflammatory). To test the model, we simulated a large number of such networks as in a statistical ensemble. In other words, to enable the inter-cellular crosstalk required to obtain an immune activation in which the macrophage plays its role, the simulated networks are not taken in isolation but combined with other cellular agents, thus setting up a discrete minimalistic model of the immune system at the microscopic/intracellular (i.e., genetic regulation) and mesoscopic/intercellular scale. We show that within the mesoscopic level description of cellular interaction and cooperation, the gene regulatory logic is coherent and contributes to the overall dynamics of the ensembles that shows, statistically, the expected behaviour.

  11. What's statistical about learning? Insights from modelling statistical learning as a set of memory processes.

    PubMed

    Thiessen, Erik D

    2017-01-05

    Statistical learning has been studied in a variety of different tasks, including word segmentation, object identification, category learning, artificial grammar learning and serial reaction time tasks (e.g. Saffran et al. 1996 Science 274: , 1926-1928; Orban et al. 2008 Proceedings of the National Academy of Sciences 105: , 2745-2750; Thiessen & Yee 2010 Child Development 81: , 1287-1303; Saffran 2002 Journal of Memory and Language 47: , 172-196; Misyak & Christiansen 2012 Language Learning 62: , 302-331). The difference among these tasks raises questions about whether they all depend on the same kinds of underlying processes and computations, or whether they are tapping into different underlying mechanisms. Prior theoretical approaches to statistical learning have often tried to explain or model learning in a single task. However, in many cases these approaches appear inadequate to explain performance in multiple tasks. For example, explaining word segmentation via the computation of sequential statistics (such as transitional probability) provides little insight into the nature of sensitivity to regularities among simultaneously presented features. In this article, we will present a formal computational approach that we believe is a good candidate to provide a unifying framework to explore and explain learning in a wide variety of statistical learning tasks. This framework suggests that statistical learning arises from a set of processes that are inherent in memory systems, including activation, interference, integration of information and forgetting (e.g. Perruchet & Vinter 1998 Journal of Memory and Language 39: , 246-263; Thiessen et al. 2013 Psychological Bulletin 139: , 792-814). From this perspective, statistical learning does not involve explicit computation of statistics, but rather the extraction of elements of the input into memory traces, and subsequent integration across those memory traces that emphasize consistent information (Thiessen and Pavlik

  12. Ranking metrics in gene set enrichment analysis: do they matter?

    PubMed

    Zyla, Joanna; Marczyk, Michal; Weiner, January; Polanska, Joanna

    2017-05-12

    There exist many methods for describing the complex relation between changes of gene expression in molecular pathways or gene ontologies under different experimental conditions. Among them, Gene Set Enrichment Analysis seems to be one of the most commonly used (over 10,000 citations). An important parameter, which could affect the final result, is the choice of a metric for the ranking of genes. Applying a default ranking metric may lead to poor results. In this work 28 benchmark data sets were used to evaluate the sensitivity and false positive rate of gene set analysis for 16 different ranking metrics including new proposals. Furthermore, the robustness of the chosen methods to sample size was tested. Using k-means clustering algorithm a group of four metrics with the highest performance in terms of overall sensitivity, overall false positive rate and computational load was established i.e. absolute value of Moderated Welch Test statistic, Minimum Significant Difference, absolute value of Signal-To-Noise ratio and Baumgartner-Weiss-Schindler test statistic. In case of false positive rate estimation, all selected ranking metrics were robust with respect to sample size. In case of sensitivity, the absolute value of Moderated Welch Test statistic and absolute value of Signal-To-Noise ratio gave stable results, while Baumgartner-Weiss-Schindler and Minimum Significant Difference showed better results for larger sample size. Finally, the Gene Set Enrichment Analysis method with all tested ranking metrics was parallelised and implemented in MATLAB, and is available at https://github.com/ZAEDPolSl/MrGSEA . Choosing a ranking metric in Gene Set Enrichment Analysis has critical impact on results of pathway enrichment analysis. The absolute value of Moderated Welch Test has the best overall sensitivity and Minimum Significant Difference has the best overall specificity of gene set analysis. When the number of non-normally distributed genes is high, using Baumgartner

  13. WebGestalt: an integrated system for exploring gene sets in various biological contexts

    PubMed Central

    Zhang, Bing; Kirov, Stefan; Snoddy, Jay

    2005-01-01

    High-throughput technologies have led to the rapid generation of large-scale datasets about genes and gene products. These technologies have also shifted our research focus from ‘single genes’ to ‘gene sets’. We have developed a web-based integrated data mining system, WebGestalt (), to help biologists in exploring large sets of genes. WebGestalt is composed of four modules: gene set management, information retrieval, organization/visualization, and statistics. The management module uploads, saves, retrieves and deletes gene sets, as well as performs Boolean operations to generate the unions, intersections or differences between different gene sets. The information retrieval module currently retrieves information for up to 20 attributes for all genes in a gene set. The organization/visualization module organizes and visualizes gene sets in various biological contexts, including Gene Ontology, tissue expression pattern, chromosome distribution, metabolic and signaling pathways, protein domain information and publications. The statistics module recommends and performs statistical tests to suggest biological areas that are important to a gene set and warrant further investigation. In order to demonstrate the use of WebGestalt, we have generated 48 gene sets with genes over-represented in various human tissue types. Exploration of all the 48 gene sets using WebGestalt is available for the public at . PMID:15980575

  14. The Gene Set Builder: collation, curation, and distribution of sets of genes

    PubMed Central

    Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W

    2005-01-01

    Background In bioinformatics and genomics, there are many applications designed to investigate the common properties for a set of genes. Often, these multi-gene analysis tools attempt to reveal sequential, functional, and expressional ties. However, while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. As a result, the process of making or accessing a set often involves tedious and time consuming steps such as finding identifiers for each individual gene. These steps are often repeated extensively to shift from one identifier type to another; or to recreate a published set. In this paper, we present a simple online tool which – with the help of the gene catalogs Ensembl and GeneLynx – can help researchers build and annotate sets of genes quickly and easily. Description The Gene Set Builder is a database-driven, web-based tool designed to help researchers compile, store, export, and share sets of genes. This application supports the 17 eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeast to human. User-created information such as sets and customized annotations are stored to facilitate easy access. Gene sets stored in the system can be "exported" in a variety of output formats – as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" with specific users to facilitate collaborations or fully released to provide access to published results. The application also features a Perl API (Application Programming Interface) for direct connectivity to custom analysis tools. A downloadable Quick Reference guide and an online tutorial are available to help new users learn its functionalities. Conclusion The Gene Set Builder is an Ensembl-facilitated online tool designed to help researchers compile and manage sets of

  15. The Gene Set Builder: collation, curation, and distribution of sets of genes.

    PubMed

    Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W

    2005-12-21

    In bioinformatics and genomics, there are many applications designed to investigate the common properties for a set of genes. Often, these multi-gene analysis tools attempt to reveal sequential, functional, and expressional ties. However, while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. As a result, the process of making or accessing a set often involves tedious and time consuming steps such as finding identifiers for each individual gene. These steps are often repeated extensively to shift from one identifier type to another; or to recreate a published set. In this paper, we present a simple online tool which - with the help of the gene catalogs Ensembl and GeneLynx - can help researchers build and annotate sets of genes quickly and easily. The Gene Set Builder is a database-driven, web-based tool designed to help researchers compile, store, export, and share sets of genes. This application supports the 17 eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeast to human. User-created information such as sets and customized annotations are stored to facilitate easy access. Gene sets stored in the system can be "exported" in a variety of output formats - as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" with specific users to facilitate collaborations or fully released to provide access to published results. The application also features a Perl API (Application Programming Interface) for direct connectivity to custom analysis tools. A downloadable Quick Reference guide and an online tutorial are available to help new users learn its functionalities. The Gene Set Builder is an Ensembl-facilitated online tool designed to help researchers compile and manage sets of genes in a user-friendly environment. The

  16. STATISTICS OF DARK MATTER HALOS FROM THE EXCURSION SET APPROACH

    SciTech Connect

    Lapi, A.; Salucci, P.; Danese, L.

    2013-08-01

    We exploit the excursion set approach in integral formulation to derive novel, accurate analytic approximations of the unconditional and conditional first crossing distributions for random walks with uncorrelated steps and general shapes of the moving barrier; we find the corresponding approximations of the unconditional and conditional halo mass functions for cold dark matter (DM) power spectra to represent very well the outcomes of state-of-the-art cosmological N-body simulations. In addition, we apply these results to derive, and confront with simulations, other quantities of interest in halo statistics, including the rates of halo formation and creation, the average halo growth history, and the halo bias. Finally, we discuss how our approach and main results change when considering random walks with correlated instead of uncorrelated steps, and warm instead of cold DM power spectra.

  17. A test statistic for the affected-sib-set method.

    PubMed

    Lange, K

    1986-07-01

    This paper discusses generalizations of the affected-sib-pair method. First, the requirement that sib identity-by-descent relations be known unambiguously is relaxed by substituting sib identity-by-state relations. This permits affected sibs to be used even when their parents are unavailable for typing. In the limit of an infinite number of marker alleles each of infinitesimal population frequency, the identity-by-state relations coincide with the usual identity-by-descent relations. Second, a weighted pairs test statistic is proposed that covers affected sib sets of size greater than two. These generalizations make the affected-sib-pair method a more powerful technique for detecting departures from independent segregation of disease and marker phenotypes. A sample calculation suggests such a departure for tuberculoid leprosy and the HLA D locus.

  18. Using the Gene Ontology to Scan Multi-Level Gene Sets for Associations in Genome Wide Association Studies

    PubMed Central

    Schaid, Daniel J.; Sinnwell, Jason P.; Jenkins, Gregory D.; McDonnell, Shannon K.; Ingle, James N.; Kubo, Michiaki; Goss, Paul E.; Costantino, Joseph P.; Wickerham, D. Lawrence; Weinshilboum, Richard M.

    2011-01-01

    Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc “fixes”. To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted p-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses. PMID:22161999

  19. A statistical mechanics analysis of the set covering problem

    NASA Astrophysics Data System (ADS)

    Fontanari, J. F.

    1996-02-01

    The dependence of the optimal solution average cost 0305-4470/29/3/004/img1 of the set covering problem on the density of 1's of the incidence matrix (0305-4470/29/3/004/img2) and on the number of constraints (P) is investigated in the limit where the number of items (N) goes to infinity. The annealed approximation is employed to study two stochastic models: the constant density model, where the elements of the incidence matrix are statistically independent random variables, and the Karp model, where the rows of the incidence matrix possess the same number of 1's. Lower bounds for 0305-4470/29/3/004/img1 are presented in the case that P scales with ln N and 0305-4470/29/3/004/img2 is of order 1, as well as in the case that P scales linearly with N and 0305-4470/29/3/004/img2 is of order 1/N. It is shown that in the case that P scales with exp N and 0305-4470/29/3/004/img2 is of order 1 the annealed approximation yields exact results for both models.

  20. DBGSA: a novel method of distance-based gene set analysis.

    PubMed

    Li, Jin; Wang, Limei; Xu, Liangde; Zhang, Ruijie; Huang, Meilin; Wang, Ke; Xu, Jiankai; Lv, Hongchao; Shang, Zhenwei; Zhang, Mingming; Jiang, Yongshuai; Guo, Maozu; Li, Xia

    2012-10-01

    When compared with single gene functional analysis, gene set analysis (GSA) can extract more information from gene expression profiles. Currently, several gene set methods have been proposed, but most of the methods cannot detect gene sets with a large number of minor-effect genes. Here, we propose a novel distance-based gene set analysis method. The distance between two groups of genes with different phenotypes based on gene expression should be larger if a certain gene set is significantly associated with the given phenotype. We calculated the distance between two groups with different phenotypes, estimated the significant P-values using two permutation methods and performed multiple hypothesis testing adjustments. This method was performed on one simulated data set and three real data sets. After a comparison and literature verification, we determined that the gene resampling-based permutation method is more suitable for GSA, and the centroid statistical and average linkage statistical distance methods are efficient, especially in detecting gene sets containing more minor-effect genes. We believe that this distance-based method will assist us in finding functional gene sets that are significantly related to a complex trait. Additionally, we have prepared a simple and publically available Perl and R package (http://bioinfo.hrbmu.edu.cn/dbgsa or http://cran.r-project.org/web/packages/DBGSA/).

  1. Gene Sets Net Correlations Analysis (GSNCA): a multivariate differential coexpression test for gene sets

    PubMed Central

    Rahmatallah, Yasir; Emmert-Streib, Frank; Glazko, Galina

    2014-01-01

    Motivation: To date, gene set analysis approaches primarily focus on identifying differentially expressed gene sets (pathways). Methods for identifying differentially coexpressed pathways also exist but are mostly based on aggregated pairwise correlations or other pairwise measures of coexpression. Instead, we propose Gene Sets Net Correlations Analysis (GSNCA), a multivariate differential coexpression test that accounts for the complete correlation structure between genes. Results: In GSNCA, weight factors are assigned to genes in proportion to the genes’ cross-correlations (intergene correlations). The problem of finding the weight vectors is formulated as an eigenvector problem with a unique solution. GSNCA tests the null hypothesis that for a gene set there is no difference in the weight vectors of the genes between two conditions. In simulation studies and the analyses of experimental data, we demonstrate that GSNCA captures changes in the structure of genes’ cross-correlations rather than differences in the averaged pairwise correlations. Thus, GSNCA infers differences in coexpression networks, however, bypassing method-dependent steps of network inference. As an additional result from GSNCA, we define hub genes as genes with the largest weights and show that these genes correspond frequently to major and specific pathway regulators, as well as to genes that are most affected by the biological difference between two conditions. In summary, GSNCA is a new approach for the analysis of differentially coexpressed pathways that also evaluates the importance of the genes in the pathways, thus providing unique information that may result in the generation of novel biological hypotheses. Availability and implementation: Implementation of the GSNCA test in R is available upon request from the authors. Contact: YRahmatallah@uams.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24292935

  2. The Limitations of Simple Gene Set Enrichment Analysis Assuming Gene Independence

    PubMed Central

    Tamayo, Pablo; Steinhardt, George; Liberzon, Arthur; Mesirov, Jill P.

    2013-01-01

    Since its first publication in 2003, the Gene Set Enrichment Analysis (GSEA) method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach, using a one-sample t-test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes GSEA’s nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with GSEA’s on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods. PMID:23070592

  3. Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data.

    PubMed

    Lewin, Alex; Grieve, Ian C

    2006-10-03

    Gene Ontology (GO) terms are often used to assess the results of microarray experiments. The most common way to do this is to perform Fisher's exact tests to find GO terms which are over-represented amongst the genes declared to be differentially expressed in the analysis of the microarray experiment. However, due to the high degree of dependence between GO terms, statistical testing is conservative, and interpretation is difficult. We propose testing groups of GO terms rather than individual terms, to increase statistical power, reduce dependence between tests and improve the interpretation of results. We use the publicly available package POSOC to group the terms. Our method finds groups of GO terms significantly over-represented amongst differentially expressed genes which are not found by Fisher's tests on individual GO terms. Grouping Gene Ontology terms improves the interpretation of gene set enrichment for microarray data.

  4. Applying Statistical Process Quality Control Methodology to Educational Settings.

    ERIC Educational Resources Information Center

    Blumberg, Carol Joyce

    A subset of Statistical Process Control (SPC) methodology known as Control Charting is introduced. SPC methodology is a collection of graphical and inferential statistics techniques used to study the progress of phenomena over time. The types of control charts covered are the null X (mean), R (Range), X (individual observations), MR (moving…

  5. GAGE: generally applicable gene set enrichment for pathway analysis

    PubMed Central

    Luo, Weijun; Friedman, Michael S; Shedden, Kerby; Hankenson, Kurt D; Woolf, Peter J

    2009-01-01

    Background Gene set analysis (GSA) is a widely used strategy for gene expression data analysis based on pathway knowledge. GSA focuses on sets of related genes and has established major advantages over individual gene analyses, including greater robustness, sensitivity and biological relevance. However, previous GSA methods have limited usage as they cannot handle datasets of different sample sizes or experimental designs. Results To address these limitations, we present a new GSA method called Generally Applicable Gene-set Enrichment (GAGE). We successfully apply GAGE to multiple microarray datasets with different sample sizes, experimental designs and profiling techniques. GAGE shows significantly better results when compared to two other commonly used GSA methods of GSEA and PAGE. We demonstrate this improvement in the following three aspects: (1) consistency across repeated studies/experiments; (2) sensitivity and specificity; (3) biological relevance of the regulatory mechanisms inferred. GAGE reveals novel and relevant regulatory mechanisms from both published and previously unpublished microarray studies. From two published lung cancer data sets, GAGE derived a more cohesive and predictive mechanistic scheme underlying lung cancer progress and metastasis. For a previously unpublished BMP6 study, GAGE predicted novel regulatory mechanisms for BMP6 induced osteoblast differentiation, including the canonical BMP-TGF beta signaling, JAK-STAT signaling, Wnt signaling, and estrogen signaling pathways–all of which are supported by the experimental literature. Conclusion GAGE is generally applicable to gene expression datasets with different sample sizes and experimental designs. GAGE consistently outperformed two most frequently used GSA methods and inferred statistically and biologically more relevant regulatory pathways. The GAGE method is implemented in R in the "gage" package, available under the GNU GPL from . PMID:19473525

  6. Functional cohesion of gene sets determined by latent semantic indexing of PubMed abstracts.

    PubMed

    Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W; George, Ebenezer O; Homayouni, Ramin

    2011-04-14

    High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. GCAT is freely available at http://binf1.memphis.edu/gcat.

  7. Functional Cohesion of Gene Sets Determined by Latent Semantic Indexing of PubMed Abstracts

    PubMed Central

    Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W.; George, Ebenezer O.; Homayouni, Ramin

    2011-01-01

    High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. Availability GCAT is freely available at http://binf1.memphis.edu/gcat PMID:21533142

  8. Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets

    PubMed Central

    Xiong, Qing; Ancona, Nicola; Hauser, Elizabeth R.; Mukherjee, Sayan; Furey, Terrence S.

    2012-01-01

    Single variant or single gene analyses generally account for only a small proportion of the phenotypic variation in complex traits. Alternatively, gene set or pathway association analyses are playing an increasingly important role in uncovering genetic architectures of complex traits through the identification of systematic genetic interactions. Two dominant paradigms for gene set analyses are association analyses based on SNP genotypes and those based on gene expression profiles. However, gene–disease association can manifest in many ways, such as alterations of gene expression, genotype, and copy number; thus, an integrative approach combining multiple forms of evidence can more accurately and comprehensively capture pathway associations. We have developed a single statistical framework, Gene Set Association Analysis (GSAA), that simultaneously measures genome-wide patterns of genetic variation and gene expression variation to identify sets of genes enriched for differential expression and/or trait-associated genetic markers. Simulation studies illustrate that joint analyses of genomic data increase the power to detect real associations when compared with gene set methods that use only one genomic data type. The analysis of two human diseases, glioblastoma and Crohn's disease, detected abnormalities in previously identified disease-associated pathways, such as pathways related to PI3K signaling, DNA damage response, and the activation of NFKB. In addition, GSAA predicted novel pathway associations, for example, differential genetic and expression characteristics in genes from the ABC transporter family in glioblastoma and from the HLA system in Crohn's disease. These demonstrate that GSAA can help uncover biological pathways underlying human diseases and complex traits. PMID:21940837

  9. Simple Data Sets for Distinct Basic Summary Statistics

    ERIC Educational Resources Information Center

    Lesser, Lawrence M.

    2011-01-01

    It is important to avoid ambiguity with numbers because unfortunate choices of numbers can inadvertently make it possible for students to form misconceptions or make it difficult for teachers to tell if students obtained the right answer for the right reason. Therefore, it is important to make sure when introducing basic summary statistics that…

  10. Gene set selection via LASSO penalized regression (SLPR).

    PubMed

    Frost, H Robert; Amos, Christopher I

    2017-07-07

    Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. Statistical Mechanics of the Minimum Dominating Set Problem

    NASA Astrophysics Data System (ADS)

    Zhao, Jin-Hua; Habibulla, Yusupjan; Zhou, Hai-Jun

    2015-06-01

    The minimum dominating set (MDS) problem has wide applications in network science and related fields. It aims at constructing a node set of smallest size such that any node of the network is either in this set or is adjacent to at least one node of this set. Although this optimization problem is generally very difficult, we show it can be exactly solved by a generalized leaf-removal (GLR) process if the network contains no core. We present a percolation theory to describe the GLR process on random networks, and solve a spin glass model by mean field method to estimate the MDS size. We also implement a message-passing algorithm and a local heuristic algorithm that combines GLR with greedy node-removal to obtain near-optimal solutions for single random networks. Our algorithms also perform well on real-world network instances.

  12. Improving gene set analysis of microarray data by SAM-GS

    PubMed Central

    Dinu, Irina; Potter, John D; Mueller, Thomas; Liu, Qi; Adewale, Adeniyi J; Jhangri, Gian S; Einecke, Gunilla; Famulski, Konrad S; Halloran, Philip; Yasui, Yutaka

    2007-01-01

    Background Gene-set analysis evaluates the expression of biological pathways, or a priori defined gene sets, rather than that of individual genes, in association with a binary phenotype, and is of great biologic interest in many DNA microarray studies. Gene Set Enrichment Analysis (GSEA) has been applied widely as a tool for gene-set analyses. We describe here some critical problems with GSEA and propose an alternative method by extending the individual-gene analysis method, Significance Analysis of Microarray (SAM), to gene-set analyses (SAM-GS). Results Using a mouse microarray dataset with simulated gene sets, we illustrate that GSEA gives statistical significance to gene sets that have no gene associated with the phenotype (null gene sets), and has very low power to detect gene sets in which half the genes are moderately or strongly associated with the phenotype (truly-associated gene sets). SAM-GS, on the other hand, performs very well. The two methods are also compared in the analyses of three real microarray datasets and relevant pathways, the diverging results of which clearly show advantages of SAM-GS over GSEA, both statistically and biologically. In a microarray study for identifying biological pathways whose gene expressions are associated with p53 mutation in cancer cell lines, we found biologically relevant performance differences between the two methods. Specifically, there are 31 additional pathways identified as significant by SAM-GS over GSEA, that are associated with the presence vs. absence of p53. Of the 31 gene sets, 11 actually involve p53 directly as a member. A further 6 gene sets directly involve the extrinsic and intrinsic apoptosis pathways, 3 involve the cell-cycle machinery, and 3 involve cytokines and/or JAK/STAT signaling. Each of these 12 gene sets, then, is in a direct, well-established relationship with aspects of p53 signaling. Of the remaining 8 gene sets, 6 have plausible, if less well established, links with p53. Conclusion We

  13. Gene set analysis of purine and pyrimidine antimetabolites cancer therapies.

    PubMed

    Fridley, Brooke L; Batzler, Anthony; Li, Liang; Li, Fang; Matimba, Alice; Jenkins, Gregory D; Ji, Yuan; Wang, Liewei; Weinshilboum, Richard M

    2011-11-01

    Responses to therapies, either with regard to toxicities or efficacy, are expected to involve complex relationships of gene products within the same molecular pathway or functional gene set. Therefore, pathways or gene sets, as opposed to single genes, may better reflect the true underlying biology and may be more appropriate units for analysis of pharmacogenomic studies. Application of such methods to pharmacogenomic studies may enable the detection of more subtle effects of multiple genes in the same pathway that may be missed by assessing each gene individually. A gene set analysis of 3821 gene sets is presented assessing the association between basal messenger RNA expression and drug cytotoxicity using ethnically defined human lymphoblastoid cell lines for two classes of drugs: pyrimidines [gemcitabine (dFdC) and arabinoside] and purines [6-thioguanine and 6-mercaptopurine]. The gene set nucleoside-diphosphatase activity was found to be significantly associated with both dFdC and arabinoside, whereas gene set γ-aminobutyric acid catabolic process was associated with dFdC and 6-thioguanine. These gene sets were significantly associated with the phenotype even after adjusting for multiple testing. In addition, five associated gene sets were found in common between the pyrimidines and two gene sets for the purines (3',5'-cyclic-AMP phosphodiesterase activity and γ-aminobutyric acid catabolic process) with a P value of less than 0.0001. Functional validation was attempted with four genes each in gene sets for thiopurine and pyrimidine antimetabolites. All four genes selected from the pyrimidine gene sets (PSME3, CANT1, ENTPD6, ADRM1) were validated, but only one (PDE4D) was validated for the thiopurine gene sets. In summary, results from the gene set analysis of pyrimidine and purine therapies, used often in the treatment of various cancers, provide novel insight into the relationship between genomic variation and drug response.

  14. Gene set analysis of purine and pyrimidine antimetabolites cancer therapies

    PubMed Central

    Fridley, Brooke L.; Batzler, Anthony; Li, Liang; Li, Fang; Matimba, Alice; Jenkins, Gregory D.; Ji, Yuan; Wang, Liewei; Weinshilboum, Richard M.

    2011-01-01

    Objective Responses to therapies, either with regards to toxicities or efficacy, are expected to involve complex relationships of gene products within the same molecular pathway or functional gene set. Therefore, pathways or gene sets, as opposed to single genes, may better reflect the true underlying biology and may be more appropriate units for analysis of pharmacogenomic studies. Application of such methods to pharmacogenomic studies may enable the detection of more subtle effects of multiple genes in the same pathway that may be missed by assessing each gene individually. Methods A gene set analysis of 3,821 gene sets is presented assessing the association between basal mRNA expression and drug cytotoxicity using ethnically defined human lymphoblastoid cell lines for two classes of drugs: pyrimidines (dFdC and AraC) and purines (6-TG and 6-MP). Results The gene set nucleoside-diphosphatase activity was found to be significantly associated with both dFdC and AraC, while gene set gamma-aminobutyric acid catabolic process was associated with dFdC and 6-TG. These gene sets were significantly associated with the phenotype even after adjusting for multiple testing. In addition, five associated gene sets were found in common between the pyrimidines and two gene sets for the purines (3′,5′-cyclic-AMP phosphodiesterase activity and gamma-aminobutyric acid catabolic process) with p < 0.0001. Functional validation was attempted with 4 genes each in gene sets for thiopurine and pyrimidine anti-metabolites. All four genes selected from the pyrimidine gene sets (PSME3, CANT1, ENTPD6, ADRM1) were validated, but only one (PDE4D) was validated for the thiopurine gene sets. Conclusions In summary, results from the gene set analysis of pyrimidine and purine therapies, used often in the treatment of various cancers, provide novel insight into the relationship between genomic variation and drug response. PMID:21869733

  15. GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies

    PubMed Central

    Zhang, Bing; Schmoyer, Denise; Kirov, Stefan; Snoddy, Jay

    2004-01-01

    Background Microarray and other high-throughput technologies are producing large sets of interesting genes that are difficult to analyze directly. Bioinformatics tools are needed to interpret the functional information in the gene sets. Results We have created a web-based tool for data analysis and data visualization for sets of genes called GOTree Machine (GOTM). This tool was originally intended to analyze sets of co-regulated genes identified from microarray analysis but is adaptable for use with other gene sets from other high-throughput analyses. GOTree Machine generates a GOTree, a tree-like structure to navigate the Gene Ontology Directed Acyclic Graph for input gene sets. This system provides user friendly data navigation and visualization. Statistical analysis helps users to identify the most important Gene Ontology categories for the input gene sets and suggests biological areas that warrant further study. GOTree Machine is available online at . Conclusion GOTree Machine has a broad application in functional genomic, proteomic and other high-throughput methods that generate large sets of interesting genes; its primary purpose is to help users sort for interesting patterns in gene sets. PMID:14975175

  16. Agreeing Response Set: Statistical Nuisance or Meaningful Personality Concept?

    ERIC Educational Resources Information Center

    Blau, Gary; Katerberg, Ralph

    1982-01-01

    Two predictions derived from the view that agreeing response set is a personality manifestation were tested: (a) nay-sayers maintain more belief consistency than do yea-sayers, and (b) nay-sayers are more intellectually oriented, while yea-sayers are more emotionally oriented. Both predictions were supported using Air Force trainee subjects.…

  17. Studying the complex expression dependences between sets of coexpressed genes.

    PubMed

    Huerta, Mario; Casanova, Oriol; Barchino, Roberto; Flores, Jose; Querol, Enrique; Cedano, Juan

    2014-01-01

    Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. The use of clustering methods to obtain sets of coexpressed genes from expression arrays is very common; nevertheless there are no appropriate tools to study the expression networks among these sets of coexpressed genes. The aim of the developed tools is to allow studying the complex expression dependences that exist between sets of coexpressed genes. For this purpose, we start detecting the nonlinear expression relationships between pairs of genes, plus the coexpressed genes. Next, we form networks among sets of coexpressed genes that maintain nonlinear expression dependences between all of them. The expression relationship between the sets of coexpressed genes is defined by the expression relationship between the skeletons of these sets, where this skeleton represents the coexpressed genes with a well-defined nonlinear expression relationship with the skeleton of the other sets. As a result, we can study the nonlinear expression relationships between a target gene and other sets of coexpressed genes, or start the study from the skeleton of the sets, to study the complex relationships of activation and deactivation between the sets of coexpressed genes that carry out the different cellular processes present in the expression experiments.

  18. Correcting Transcription Factor Gene Sets for Copy Number and Promoter Methylation Variations

    PubMed Central

    Rathi, Komal S.; Gaykalova, Daria A.; Hennesey, Patrick; Califano, Joseph A.; Ochs, Michael F.

    2014-01-01

    Gene set analysis provides a method to generate statistical inferences across sets of linked genes, primarily using high-throughput expression data. Common gene sets include biological pathways, operons, and targets of transcriptional regulators. In higher eukaryotes, especially when dealing with diseases with strong genetic and epigenetic components such as cancer, copy number loss and gene silencing through promoter methylation can eliminate the possibility that a gene is transcribed. This, in turn, can adversely affect the estimation of transcription factor or pathway activity from a set of target genes, since some of the targets may not be responsive to transcriptional regulation. Here we introduce a simple filtering approach that removes genes from consideration if they show copy number loss or promoter methylation and demonstrate the improvement in inference of transcription factor activity in a simulated data set based on the background expression observed in normal head and neck tissue. PMID:25195578

  19. IMPROVED PERFORMANCE OF GENE SET ANALYSIS ON GENOME-WIDE TRANSCRIPTOMICS DATA WHEN USING GENE ACTIVITY STATE ESTIMATES.

    PubMed

    Kamp, Thomas; Adams, Micah; Disselkoen, Craig; Tintle, Nathan

    2016-01-01

    Gene set analysis methods continue to be a popular and powerful method of evaluating genome-wide transcriptomics data. These approach require a priori grouping of genes into biologically meaningful sets, and then conducting downstream analyses at the set (instead of gene) level of analysis. Gene set analysis methods have been shown to yield more powerful statistical conclusions than single-gene analyses due to both reduced multiple testing penalties and potentially larger observed effects due to the aggregation of effects across multiple genes in the set. Traditionally, gene set analysis methods have been applied directly to normalized, log-transformed, transcriptomics data. Recently, efforts have been made to transform transcriptomics data to scales yielding more biologically interpretable results. For example, recently proposed models transform log-transformed transcriptomics data to a confidence metric (ranging between 0 and 100%) that a gene is active (roughly speaking, that the gene product is part of an active cellular mechanism). In this manuscript, we demonstrate, on both real and simulated transcriptomics data, that tests for differential expression between sets of genes using are typically more powerful when using gene activity state estimates as opposed to log-transformed gene expression data. Our analysis suggests further exploration of techniques to transform transcriptomics data to meaningful quantities for improved downstream inference.

  20. Human interactome resource and gene set linkage analysis for the functional interpretation of biologically meaningful gene sets.

    PubMed

    Zhou, Xi; Chen, Pengcheng; Wei, Qiang; Shen, Xueling; Chen, Xin

    2013-08-15

    A molecular interaction network can be viewed as a network in which genes with related functions are connected. Therefore, at a systems level, connections between individual genes in a molecular interaction network can be used to infer the collective functional linkages between biologically meaningful gene sets. We present the human interactome resource and the gene set linkage analysis (GSLA) tool for the functional interpretation of biologically meaningful gene sets observed in experiments. GSLA determines whether an observed gene set has significant functional linkages to established biological processes. When an observed gene set is not enriched by known biological processes, traditional enrichment-based interpretation methods cannot produce functional insights, but GSLA can still evaluate whether those genes work in concert to regulate specific biological processes, thereby suggesting the functional implications of the observed gene set. The quality of human interactome resource and the utility of GSLA are illustrated with multiple assessments. http://www.cls.zju.edu.cn/hir/

  1. Excursion sets and non-Gaussian void statistics

    SciTech Connect

    D'Amico, Guido; Musso, Marcello; Paranjape, Aseem; Norena, Jorge

    2011-01-15

    Primordial non-Gaussianity (NG) affects the large scale structure (LSS) of the Universe by leaving an imprint on the distribution of matter at late times. Much attention has been focused on using the distribution of collapsed objects (i.e. dark matter halos and the galaxies and galaxy clusters that reside in them) to probe primordial NG. An equally interesting and complementary probe however is the abundance of extended underdense regions or voids in the LSS. The calculation of the abundance of voids using the excursion set formalism in the presence of primordial NG is subject to the same technical issues as the one for halos, which were discussed e.g. in Ref. [51][G. D'Amico, M. Musso, J. Norena, and A. Paranjape, arXiv:1005.1203.]. However, unlike the excursion set problem for halos which involved random walks in the presence of one barrier {delta}{sub c}, the void excursion set problem involves two barriers {delta}{sub v} and {delta}{sub c}. This leads to a new complication introduced by what is called the 'void-in-cloud' effect discussed in the literature, which is unique to the case of voids. We explore a path integral approach which allows us to carefully account for all these issues, leading to a rigorous derivation of the effects of primordial NG on void abundances. The void-in-cloud issue, in particular, makes the calculation conceptually rather different from the one for halos. However, we show that its final effect can be described by a simple yet accurate approximation. Our final void abundance function is valid on larger scales than the expressions of other authors, while being broadly in agreement with those expressions on smaller scales.

  2. The Effect of Distributed Practice in Undergraduate Statistics Homework Sets: A Randomized Trial

    ERIC Educational Resources Information Center

    Crissinger, Bryan R.

    2015-01-01

    Most homework sets in statistics courses are constructed so that students concentrate or "mass" their practice on a certain topic in one problem set. Distributed practice homework sets include review problems in each set so that practice on a topic is distributed across problem sets. There is a body of research that points to the…

  3. The Effect of Distributed Practice in Undergraduate Statistics Homework Sets: A Randomized Trial

    ERIC Educational Resources Information Center

    Crissinger, Bryan R.

    2015-01-01

    Most homework sets in statistics courses are constructed so that students concentrate or "mass" their practice on a certain topic in one problem set. Distributed practice homework sets include review problems in each set so that practice on a topic is distributed across problem sets. There is a body of research that points to the…

  4. Statistical principles underlying analytic goal-setting in clinical chemistry.

    PubMed

    Harris, E K

    1979-08-01

    The Survey programs of the College of American Pathologists (CAP) have assessed current levels of analytic variance in many biochemical measurements, and a number of clinical chemists have proposed analytic goals. The practical importance of further reductions in analytic variance depends on the specific use of the laboratory test. Three general areas of application are described: 1) surveying a population to detect disease, 2) determining whether a particular individual's level of a given analyte is above or below a predefined alarm point, 3) monitoring an individual over a period of time to detect trends. Within each of these different contests, statistical methods are proposed for judging the practical effect of improvements in current levels of analytic precision, taking into account recent estimates of biological variation within the average individual and between individuals. As might be expected, reductions in analytic variance have greatest impact in those applications where biological variance is minimal. Such reductions will generally have little effect on the efficiency of a population survey but may be extremely valuable in decision-making concerning a particular hospital patient.

  5. Grouped False-Discovery Rate for Removing the Gene-Set-Level Bias of RNA-seq.

    PubMed

    Yang, Tae Young; Jeong, Seongmun

    2013-01-01

    In recent years, RNA-seq has become a very competitive alternative to microarrays. In RNA-seq experiments, the expected read count for a gene is proportional to its expression level multiplied by its transcript length. Even when two genes are expressed at the same level, differences in length will yield differing numbers of total reads. The characteristics of these RNA-seq experiments create a gene-level bias such that the proportion of significantly differentially expressed genes increases with the transcript length, whereas such bias is not present in microarray data. Gene-set analysis seeks to identify the gene sets that are enriched in the list of the identified significant genes. In the gene-set analysis of RNA-seq, the gene-level bias subsequently yields the gene-set-level bias that a gene set with genes of long length will be more likely to show up as enriched than will a gene set with genes of shorter length. Because gene expression is not related to its transcript length, any gene set containing long genes is not of biologically greater interest than gene sets with shorter genes. Accordingly the gene-set-level bias should be removed to accurately calculate the statistical significance of each gene-set enrichment in the RNA-seq. We present a new gene set analysis method of RNA-seq, called FDRseq, which can accurately calculate the statistical significance of a gene-set enrichment score by the grouped false-discovery rate. Numerical examples indicated that FDRseq is appropriate for controlling the transcript length bias in the gene-set analysis of RNA-seq data. To implement FDRseq, we developed the R program, which can be downloaded at no cost from http://home.mju.ac.kr/home/index.action?siteId=tyang.

  6. Chronic periodontitis genome-wide association studies: gene-centric and gene set enrichment analyses.

    PubMed

    Rhodin, K; Divaris, K; North, K E; Barros, S P; Moss, K; Beck, J D; Offenbacher, S

    2014-09-01

    Recent genome-wide association studies (GWAS) of chronic periodontitis (CP) offer rich data sources for the investigation of candidate genes, functional elements, and pathways. We used GWAS data of CP (n = 4,504) and periodontal pathogen colonization (n = 1,020) from a cohort of adult Americans of European descent participating in the Atherosclerosis Risk in Communities study and employed a MAGENTA approach (i.e., meta-analysis gene set enrichment of variant associations) to obtain gene-centric and gene set association results corrected for gene size, number of single-nucleotide polymorphisms, and local linkage disequilibrium characteristics based on the human genome build 18 (National Center for Biotechnology Information build 36). We used the Gene Ontology, Ingenuity, KEGG, Panther, Reactome, and Biocarta databases for gene set enrichment analyses. Six genes showed evidence of statistically significant association: 4 with severe CP (NIN, p = 1.6 × 10(-7); ABHD12B, p = 3.6 × 10(-7); WHAMM, p = 1.7 × 10(-6); AP3B2, p = 2.2 × 10(-6)) and 2 with high periodontal pathogen colonization (red complex-KCNK1, p = 3.4 × 10(-7); Porphyromonas gingivalis-DAB2IP, p = 1.0 × 10(-6)). Top-ranked genes for moderate CP were HGD (p = 1.4 × 10(-5)), ZNF675 (p = 1.5 × 10(-5)), TNFRSF10C (p = 2.0 × 10(-5)), and EMR1 (p = 2.0 × 10(-5)). Loci containing NIN, EMR1, KCNK1, and DAB2IP had showed suggestive evidence of association in the earlier single-nucleotide polymorphism-based analyses, whereas WHAMM and AP2B2 emerged as novel candidates. The top gene sets included severe CP ("endoplasmic reticulum membrane," "cytochrome P450," "microsome," and "oxidation reduction") and moderate CP ("regulation of gene expression," "zinc ion binding," "BMP signaling pathway," and "ruffle"). Gene-centric analyses offer a promising avenue for efficient interrogation of large-scale GWAS data. These results highlight genes in previously identified loci and new candidate genes and pathways

  7. Correcting transcription factor gene sets for copy number and promoter methylation variations.

    PubMed

    Rathi, Komal S; Gaykalova, Daria A; Hennessey, Patrick; Califano, Joseph A; Ochs, Michael F

    2014-09-01

    Gene set analysis provides a method to generate statistical inferences across sets of linked genes, primarily using high-throughput expression data. Common gene sets include biological pathways, operons, and targets of transcriptional regulators. In higher eukaryotes, especially when dealing with diseases with strong genetic and epigenetic components such as cancer, copy number loss and gene silencing through promoter methylation can eliminate the possibility that a gene is transcribed. This, in turn, can adversely affect the estimation of transcription factor or pathway activity from a set of target genes, as some of the targets may not be responsive to transcriptional regulation. Here we introduce a simple filtering approach that removes genes from consideration if they show copy number loss or promoter methylation, and demonstrate the improvement in inference of transcription factor activity in a simulated dataset based on the background expression observed in normal head and neck tissue.

  8. Gene coexpression measures in large heterogeneous samples using count statistics.

    PubMed

    Wang, Y X Rachel; Waterman, Michael S; Huang, Haiyan

    2014-11-18

    With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.

  9. On statistical inference for the random set generated Cox process with set-marking.

    PubMed

    Penttinen, Antti; Niemi, Aki

    2007-04-01

    Cox point process is a process class for hierarchical modelling of systems of non-interacting points in Rd under environmental heterogeneity which is modelled through a random intensity function. In this work a class of Cox processes is suggested where the random intensity is generated by a random closed set. Such heterogeneity appears for example in forestry where silvicultural treatments like harvesting and site-preparation create geometrical patterns for tree density variation in two different phases. In this paper the second order property, important both in data analysis and in the context of spatial sampling, is derived. The usefulness of the random set generated Cox process is highly increased, if for each point it is observed whether it is included in the random set or not. This additional information is easy and economical to obtain in many cases and is hence of practical value; it leads to marks for the points. The resulting random set marked Cox process is a marked point process where the marks are intensity-dependent. The problem with set-marking is that the marks are not a representative sample from the random set. This paper derives the second order property of the random set marked Cox process and suggests a practical estimation method for area fraction and covariance of the random set and for the point densities within and outside the random set. A simulated example and a forestry example are given.

  10. LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights

    PubMed Central

    Dong, Xinran; Hao, Yun; Wang, Xiao; Tian, Weidong

    2016-01-01

    Pathway or gene set over-representation analysis (ORA) has become a routine task in functional genomics studies. However, currently widely used ORA tools employ statistical methods such as Fisher’s exact test that reduce a pathway into a list of genes, ignoring the constitutive functional non-equivalent roles of genes and the complex gene-gene interactions. Here, we develop a novel method named LEGO (functional Link Enrichment of Gene Ontology or gene sets) that takes into consideration these two types of information by incorporating network-based gene weights in ORA analysis. In three benchmarks, LEGO achieves better performance than Fisher and three other network-based methods. To further evaluate LEGO’s usefulness, we compare LEGO with five gene expression-based and three pathway topology-based methods using a benchmark of 34 disease gene expression datasets compiled by a recent publication, and show that LEGO is among the top-ranked methods in terms of both sensitivity and prioritization for detecting target KEGG pathways. In addition, we develop a cluster-and-filter approach to reduce the redundancy among the enriched gene sets, making the results more interpretable to biologists. Finally, we apply LEGO to two lists of autism genes, and identify relevant gene sets to autism that could not be found by Fisher. PMID:26750448

  11. Sets, Probability and Statistics: The Mathematics of Life Insurance. [Computer Program.] Second Edition.

    ERIC Educational Resources Information Center

    King, James M.; And Others

    The materials described here represent the conversion of a highly popular student workbook "Sets, Probability and Statistics: The Mathematics of Life Insurance" into a computer program. The program is designed to familiarize students with the concepts of sets, probability, and statistics, and to provide practice using real life examples. It also…

  12. Training set selection for the prediction of essential genes.

    PubMed

    Cheng, Jian; Xu, Zhao; Wu, Wenwu; Zhao, Li; Li, Xiangchen; Liu, Yanlin; Tao, Shiheng

    2014-01-01

    Various computational models have been developed to transfer annotations of gene essentiality between organisms. However, despite the increasing number of microorganisms with well-characterized sets of essential genes, selection of appropriate training sets for predicting the essential genes of poorly-studied or newly sequenced organisms remains challenging. In this study, a machine learning approach was applied reciprocally to predict the essential genes in 21 microorganisms. Results showed that training set selection greatly influenced predictive accuracy. We determined four criteria for training set selection: (1) essential genes in the selected training set should be reliable; (2) the growth conditions in which essential genes are defined should be consistent in training and prediction sets; (3) species used as training set should be closely related to the target organism; and (4) organisms used as training and prediction sets should exhibit similar phenotypes or lifestyles. We then analyzed the performance of an incomplete training set and an integrated training set with multiple organisms. We found that the size of the training set should be at least 10% of the total genes to yield accurate predictions. Additionally, the integrated training sets exhibited remarkable increase in stability and accuracy compared with single sets. Finally, we compared the performance of the integrated training sets with the four criteria and with random selection. The results revealed that a rational selection of training sets based on our criteria yields better performance than random selection. Thus, our results provide empirical guidance on training set selection for the identification of essential genes on a genome-wide scale.

  13. Statistical aspect of trait mapping using a dense set of markers: A partial review

    SciTech Connect

    Dupuis, J.

    1996-12-31

    This paper presents a review of statistical methods used to locate trait loci using maps of markers spanning the whole genome. Such maps are becoming readily available and can be especially useful in mapping traits that are non Mendelian. Genome-wide search for a trait locus is often called a {open_quotes}global search{close_quotes}. Global search methods include, but are not restricted to, identifying disease susceptibility genes using affected relative pairs, finding quantitative trait loci in experimental organisms and locating quantitative trait loci in humans. For human linkage, we concentrate on methods using pairs of affected relatives rather than pedigree analysis. We begin in the next section with a review of work on the use of affected pairs of relatives to identify gene loci that increase susceptibility to a particular disease. We first review Risch`s 1990 series of papers. Risch`s method can be used to search the entire genome for such susceptibility genes. Using Risch`s idea Elston explored the issue of how many pairs and markers are necessary to reach a certain probability of detecting a locus if there exists one. He proposed a more economical two stage design that uses few markers at the first stage but adds markers around the {open_quotes}promising{close_quotes} area of the genome at the second stage. However, Risch and Elston do not use multipoint linkage analysis, which takes into account all markers at once (rather than one at a time) in the calculation of the test statistic. Such multipoint methods for affected relatives have been developed by Feingold and Feingold et al. The last authors` multipoint method is based on a continuous specification of identity by descent between the affected relatives but can also be used for a set of linked markers spanning the genome. A brief description of their method and treatment of more complex issues such as combining relative pairs is included. 29 refs., 4 tabs.

  14. Principles for the organization of gene-sets.

    PubMed

    Li, Wentian; Freudenberg, Jan; Oswald, Michaela

    2015-12-01

    A gene-set, an important concept in microarray expression analysis and systems biology, is a collection of genes and/or their products (i.e. proteins) that have some features in common. There are many different ways to construct gene-sets, but a systematic organization of these ways is lacking. Gene-sets are mainly organized ad hoc in current public-domain databases, with group header names often determined by practical reasons (such as the types of technology in obtaining the gene-sets or a balanced number of gene-sets under a header). Here we aim at providing a gene-set organization principle according to the level at which genes are connected: homology, physical map proximity, chemical interaction, biological, and phenotypic-medical levels. We also distinguish two types of connections between genes: actual connection versus sharing of a label. Actual connections denote direct biological interactions, whereas shared label connection denotes shared membership in a group. Some extensions of the framework are also addressed such as overlapping of gene-sets, modules, and the incorporation of other non-protein-coding entities such as microRNAs.

  15. SiBIC: a web server for generating gene set networks based on biclusters obtained by maximal frequent itemset mining.

    PubMed

    Takahashi, Kei-ichiro; Takigawa, Ichigaku; Mamitsuka, Hiroshi

    2013-01-01

    Detecting biclusters from expression data is useful, since biclusters are coexpressed genes under only part of all given experimental conditions. We present a software called SiBIC, which from a given expression dataset, first exhaustively enumerates biclusters, which are then merged into rather independent biclusters, which finally are used to generate gene set networks, in which a gene set assigned to one node has coexpressed genes. We evaluated each step of this procedure: 1) significance of the generated biclusters biologically and statistically, 2) biological quality of merged biclusters, and 3) biological significance of gene set networks. We emphasize that gene set networks, in which nodes are not genes but gene sets, can be more compact than usual gene networks, meaning that gene set networks are more comprehensible. SiBIC is available at http://utrecht.kuicr.kyoto-u.ac.jp:8080/miami/faces/index.jsp.

  16. Gene regulatory network inference using out of equilibrium statistical mechanics

    PubMed Central

    Benecke, Arndt

    2008-01-01

    Spatiotemporal control of gene expression is fundamental to multicellular life. Despite prodigious efforts, the encoding of gene expression regulation in eukaryotes is not understood. Gene expression analyses nourish the hope to reverse engineer effector-target gene networks using inference techniques. Inference from noisy and circumstantial data relies on using robust models with few parameters for the underlying mechanisms. However, a systematic path to gene regulatory network reverse engineering from functional genomics data is still impeded by fundamental problems. Recently, Johannes Berg from the Theoretical Physics Institute of Cologne University has made two remarkable contributions that significantly advance the gene regulatory network inference problem. Berg, who uses gene expression data from yeast, has demonstrated a nonequilibrium regime for mRNA concentration dynamics and was able to map the gene regulatory process upon simple stochastic systems driven out of equilibrium. The impact of his demonstration is twofold, affecting both the understanding of the operational constraints under which transcription occurs and the capacity to extract relevant information from highly time-resolved expression data. Berg has used his observation to predict target genes of selected transcription factors, and thereby, in principle, demonstrated applicability of his out of equilibrium statistical mechanics approach to the gene network inference problem. PMID:19404429

  17. Gene regulatory network inference using out of equilibrium statistical mechanics.

    PubMed

    Benecke, Arndt

    2008-08-01

    Spatiotemporal control of gene expression is fundamental to multicellular life. Despite prodigious efforts, the encoding of gene expression regulation in eukaryotes is not understood. Gene expression analyses nourish the hope to reverse engineer effector-target gene networks using inference techniques. Inference from noisy and circumstantial data relies on using robust models with few parameters for the underlying mechanisms. However, a systematic path to gene regulatory network reverse engineering from functional genomics data is still impeded by fundamental problems. Recently, Johannes Berg from the Theoretical Physics Institute of Cologne University has made two remarkable contributions that significantly advance the gene regulatory network inference problem. Berg, who uses gene expression data from yeast, has demonstrated a nonequilibrium regime for mRNA concentration dynamics and was able to map the gene regulatory process upon simple stochastic systems driven out of equilibrium. The impact of his demonstration is twofold, affecting both the understanding of the operational constraints under which transcription occurs and the capacity to extract relevant information from highly time-resolved expression data. Berg has used his observation to predict target genes of selected transcription factors, and thereby, in principle, demonstrated applicability of his out of equilibrium statistical mechanics approach to the gene network inference problem.

  18. Statistical Approaches for Gene Selection, Hub Gene Identification and Module Interaction in Gene Co-Expression Network Analysis: An Application to Aluminum Stress in Soybean (Glycine max L.)

    PubMed Central

    Das, Samarendra; Meher, Prabina Kumar; Bhar, Lal Mohan; Mandal, Baidya Nath

    2017-01-01

    Selection of informative genes is an important problem in gene expression studies. The small sample size and the large number of genes in gene expression data make the selection process complex. Further, the selected informative genes may act as a vital input for gene co-expression network analysis. Moreover, the identification of hub genes and module interactions in gene co-expression networks is yet to be fully explored. This paper presents a statistically sound gene selection technique based on support vector machine algorithm for selecting informative genes from high dimensional gene expression data. Also, an attempt has been made to develop a statistical approach for identification of hub genes in the gene co-expression network. Besides, a differential hub gene analysis approach has also been developed to group the identified hub genes into various groups based on their gene connectivity in a case vs. control study. Based on this proposed approach, an R package, i.e., dhga (https://cran.r-project.org/web/packages/dhga) has been developed. The comparative performance of the proposed gene selection technique as well as hub gene identification approach was evaluated on three different crop microarray datasets. The proposed gene selection technique outperformed most of the existing techniques for selecting robust set of informative genes. Based on the proposed hub gene identification approach, a few number of hub genes were identified as compared to the existing approach, which is in accordance with the principle of scale free property of real networks. In this study, some key genes along with their Arabidopsis orthologs has been reported, which can be used for Aluminum toxic stress response engineering in soybean. The functional analysis of various selected key genes revealed the underlying molecular mechanisms of Aluminum toxic stress response in soybean. PMID:28056073

  19. Statistical Approaches for Gene Selection, Hub Gene Identification and Module Interaction in Gene Co-Expression Network Analysis: An Application to Aluminum Stress in Soybean (Glycine max L.).

    PubMed

    Das, Samarendra; Meher, Prabina Kumar; Rai, Anil; Bhar, Lal Mohan; Mandal, Baidya Nath

    2017-01-01

    Selection of informative genes is an important problem in gene expression studies. The small sample size and the large number of genes in gene expression data make the selection process complex. Further, the selected informative genes may act as a vital input for gene co-expression network analysis. Moreover, the identification of hub genes and module interactions in gene co-expression networks is yet to be fully explored. This paper presents a statistically sound gene selection technique based on support vector machine algorithm for selecting informative genes from high dimensional gene expression data. Also, an attempt has been made to develop a statistical approach for identification of hub genes in the gene co-expression network. Besides, a differential hub gene analysis approach has also been developed to group the identified hub genes into various groups based on their gene connectivity in a case vs. control study. Based on this proposed approach, an R package, i.e., dhga (https://cran.r-project.org/web/packages/dhga) has been developed. The comparative performance of the proposed gene selection technique as well as hub gene identification approach was evaluated on three different crop microarray datasets. The proposed gene selection technique outperformed most of the existing techniques for selecting robust set of informative genes. Based on the proposed hub gene identification approach, a few number of hub genes were identified as compared to the existing approach, which is in accordance with the principle of scale free property of real networks. In this study, some key genes along with their Arabidopsis orthologs has been reported, which can be used for Aluminum toxic stress response engineering in soybean. The functional analysis of various selected key genes revealed the underlying molecular mechanisms of Aluminum toxic stress response in soybean.

  20. Identifying the optimal gene and gene set in hepatocellular carcinoma based on differential expression and differential co-expression algorithm.

    PubMed

    Dong, Li-Yang; Zhou, Wei-Zhong; Ni, Jun-Wei; Xiang, Wei; Hu, Wen-Hao; Yu, Chang; Li, Hai-Yan

    2017-02-01

    The objective of this study was to identify the optimal gene and gene set for hepatocellular carcinoma (HCC) utilizing differential expression and differential co-expression (DEDC) algorithm. The DEDC algorithm consisted of four parts: calculating differential expression (DE) by absolute t-value in t-statistics; computing differential co-expression (DC) based on Z-test; determining optimal thresholds on the basis of Chi-squared (χ2) maximization and the corresponding gene was the optimal gene; and evaluating functional relevance of genes categorized into different partitions to determine the optimal gene set with highest mean minimum functional information (FI) gain (Δ*G). The optimal thresholds divided genes into four partitions, high DE and high DC (HDE-HDC), high DE and low DC (HDE-LDC), low DE and high DC (LDE‑HDC), and low DE and low DC (LDE-LDC). In addition, the optimal gene was validated by conducting reverse transcription-polymerase chain reaction (RT-PCR) assay. The optimal threshold for DC and DE were 1.032 and 1.911, respectively. Using the optimal gene, the genes were divided into four partitions including: HDE-HDC (2,053 genes), HED-LDC (2,822 genes), LDE-HDC (2,622 genes), and LDE-LDC (6,169 genes). The optimal gene was microtubule‑associated protein RP/EB family member 1 (MAPRE1), and RT-PCR assay validated the significant difference between the HCC and normal state. The optimal gene set was nucleoside metabolic process (GO\\GO:0009116) with Δ*G = 18.681 and 24 HDE-HDC partitions in total. In conclusion, we successfully investigated the optimal gene, MAPRE1, and gene set, nucleoside metabolic process, which may be potential biomarkers for targeted therapy and provide significant insight for revealing the pathological mechanism underlying HCC.

  1. Prior knowledge transfer across transcriptional data sets and technologies using compositional statistics yields new mislabelled ovarian cell line

    PubMed Central

    Blayney, Jaine K.; Davison, Timothy; McCabe, Nuala; Walker, Steven; Keating, Karen; Delaney, Thomas; Greenan, Caroline; Williams, Alistair R.; McCluggage, W. Glenn; Capes-Davis, Amanda; Harkin, D. Paul; Gourley, Charlie; Kennedy, Richard D.

    2016-01-01

    Here, we describe gene expression compositional assignment (GECA), a powerful, yet simple method based on compositional statistics that can validate the transfer of prior knowledge, such as gene lists, into independent data sets, platforms and technologies. Transcriptional profiling has been used to derive gene lists that stratify patients into prognostic molecular subgroups and assess biomarker performance in the pre-clinical setting. Archived public data sets are an invaluable resource for subsequent in silico validation, though their use can lead to data integration issues. We show that GECA can be used without the need for normalising expression levels between data sets and can outperform rank-based correlation methods. To validate GECA, we demonstrate its success in the cross-platform transfer of gene lists in different domains including: bladder cancer staging, tumour site of origin and mislabelled cell lines. We also show its effectiveness in transferring an epithelial ovarian cancer prognostic gene signature across technologies, from a microarray to a next-generation sequencing setting. In a final case study, we predict the tumour site of origin and histopathology of epithelial ovarian cancer cell lines. In particular, we identify and validate the commonly-used cell line OVCAR-5 as non-ovarian, being gastrointestinal in origin. GECA is available as an open-source R package. PMID:27353327

  2. Curated eutherian third party data gene data sets

    PubMed Central

    Premzl, Marko

    2015-01-01

    The free available eutherian genomic sequence data sets advanced scientific field of genomics. Of note, future revisions of gene data sets were expected, due to incompleteness of public eutherian genomic sequence assemblies and potential genomic sequence errors. The eutherian comparative genomic analysis protocol was proposed as guidance in protection against potential genomic sequence errors in public eutherian genomic sequences. The protocol was applicable in updates of 7 major eutherian gene data sets, including 812 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets. PMID:26862561

  3. Principal Angle Enrichment Analysis (PAEA): Dimensionally Reduced Multivariate Gene Set Enrichment Analysis Tool.

    PubMed

    Clark, Neil R; Szymkiewicz, Maciej; Wang, Zichen; Monteiro, Caroline D; Jones, Matthew R; Ma'ayan, Avi

    2015-11-01

    Gene set analysis of differential expression, which identifies collectively differentially expressed gene sets, has become an important tool for biology. The power of this approach lies in its reduction of the dimensionality of the statistical problem and its incorporation of biological interpretation by construction. Many approaches to gene set analysis have been proposed, but benchmarking their performance in the setting of real biological data is difficult due to the lack of a gold standard. In a previously published work we proposed a geometrical approach to differential expression which performed highly in benchmarking tests and compared well to the most popular methods of differential gene expression. As reported, this approach has a natural extension to gene set analysis which we call Principal Angle Enrichment Analysis (PAEA). PAEA employs dimensionality reduction and a multivariate approach for gene set enrichment analysis. However, the performance of this method has not been assessed nor its implementation as a web-based tool. Here we describe new benchmarking protocols for gene set analysis methods and find that PAEA performs highly. The PAEA method is implemented as a user-friendly web-based tool, which contains 70 gene set libraries and is freely available to the community.

  4. FDR-FET: an optimizing gene set enrichment analysis method.

    PubMed

    Ji, Rui-Ru; Ott, Karl-Heinz; Yordanova, Roumyana; Bruccoleri, Robert E

    2011-01-01

    Gene set enrichment analysis for analyzing large profiling and screening experiments can reveal unifying biological schemes based on previously accumulated knowledge represented as "gene sets". Most of the existing implementations use a fixed fold-change or P value cutoff to generate regulated gene lists. However, the threshold selection in most cases is arbitrary, and has a significant effect on the test outcome and interpretation of the experiment. We developed a new gene set enrichment analysis method, ie, FDR-FET, which dynamically optimizes the threshold choice and improves the sensitivity and selectivity of gene set enrichment analysis. The procedure translates experimental results into a series of regulated gene lists at multiple false discovery rate (FDR) cutoffs, and computes the P value of the overrepresentation of a gene set using a Fisher's exact test (FET) in each of these gene lists. The lowest P value is retained to represent the significance of the gene set. We also implemented improved methods to define a more relevant global reference set for the FET. We demonstrate the validity of the method using a published microarray study of three protease inhibitors of the human immunodeficiency virus and compare the results with those from other popular gene set enrichment analysis algorithms. Our results show that combining FDR with multiple cutoffs allows us to control the error while retaining genes that increase information content. We conclude that FDR-FET can selectively identify significant affected biological processes. Our method can be used for any user-generated gene list in the area of transcriptome, proteome, and other biological and scientific applications.

  5. Beyond main effects of gene-sets: harsh parenting moderates the association between a dopamine gene-set and child externalizing behavior.

    PubMed

    Windhorst, Dafna A; Mileva-Seitz, Viara R; Rippe, Ralph C A; Tiemeier, Henning; Jaddoe, Vincent W V; Verhulst, Frank C; van IJzendoorn, Marinus H; Bakermans-Kranenburg, Marian J

    2016-08-01

    In a longitudinal cohort study, we investigated the interplay of harsh parenting and genetic variation across a set of functionally related dopamine genes, in association with children's externalizing behavior. This is one of the first studies to employ gene-based and gene-set approaches in tests of Gene by Environment (G × E) effects on complex behavior. This approach can offer an important alternative or complement to candidate gene and genome-wide environmental interaction (GWEI) studies in the search for genetic variation underlying individual differences in behavior. Genetic variants in 12 autosomal dopaminergic genes were available in an ethnically homogenous part of a population-based cohort. Harsh parenting was assessed with maternal (n = 1881) and paternal (n = 1710) reports at age 3. Externalizing behavior was assessed with the Child Behavior Checklist (CBCL) at age 5 (71 ± 3.7 months). We conducted gene-set analyses of the association between variation in dopaminergic genes and externalizing behavior, stratified for harsh parenting. The association was statistically significant or approached significance for children without harsh parenting experiences, but was absent in the group with harsh parenting. Similarly, significant associations between single genes and externalizing behavior were only found in the group without harsh parenting. Effect sizes in the groups with and without harsh parenting did not differ significantly. Gene-environment interaction tests were conducted for individual genetic variants, resulting in two significant interaction effects (rs1497023 and rs4922132) after correction for multiple testing. Our findings are suggestive of G × E interplay, with associations between dopamine genes and externalizing behavior present in children without harsh parenting, but not in children with harsh parenting experiences. Harsh parenting may overrule the role of genetic factors in externalizing behavior. Gene-based and gene-set

  6. Evaluation of statistical treatments of left-censored environmental data using coincident uncensored data sets: I. Summary statistics

    USGS Publications Warehouse

    Antweiler, R.C.; Taylor, H.E.

    2008-01-01

    The main classes of statistical treatment of below-detection limit (left-censored) environmental data for the determination of basic statistics that have been used in the literature are substitution methods, maximum likelihood, regression on order statistics (ROS), and nonparametric techniques. These treatments, along with using all instrument-generated data (even those below detection), were evaluated by examining data sets in which the true values of the censored data were known. It was found that for data sets with less than 70% censored data, the best technique overall for determination of summary statistics was the nonparametric Kaplan-Meier technique. ROS and the two substitution methods of assigning one-half the detection limit value to censored data or assigning a random number between zero and the detection limit to censored data were adequate alternatives. The use of these two substitution methods, however, requires a thorough understanding of how the laboratory censored the data. The technique of employing all instrument-generated data - including numbers below the detection limit - was found to be less adequate than the above techniques. At high degrees of censoring (greater than 70% censored data), no technique provided good estimates of summary statistics. Maximum likelihood techniques were found to be far inferior to all other treatments except substituting zero or the detection limit value to censored data.

  7. IGSA: Individual Gene Sets Analysis, including Enrichment and Clustering

    PubMed Central

    Liu, Lei; Ma, Hongzhe; Yang, Jingbo; Xie, Hongbo; Liu, Bo; Jin, Qing

    2016-01-01

    Analysis of gene sets has been widely applied in various high-throughput biological studies. One weakness in the traditional methods is that they neglect the heterogeneity of genes expressions in samples which may lead to the omission of some specific and important gene sets. It is also difficult for them to reflect the severities of disease and provide expression profiles of gene sets for individuals. We developed an application software called IGSA that leverages a powerful analytical capacity in gene sets enrichment and samples clustering. IGSA calculates gene sets expression scores for each sample and takes an accumulating clustering strategy to let the samples gather into the set according to the progress of disease from mild to severe. We focus on gastric, pancreatic and ovarian cancer data sets for the performance of IGSA. We also compared the results of IGSA in KEGG pathways enrichment with David, GSEA, SPIA, ssGSEA and analyzed the results of IGSA clustering and different similarity measurement methods. Notably, IGSA is proved to be more sensitive and specific in finding significant pathways, and can indicate related changes in pathways with the severity of disease. In addition, IGSA provides with significant gene sets profile for each sample. PMID:27764138

  8. IGSA: Individual Gene Sets Analysis, including Enrichment and Clustering.

    PubMed

    Wu, Lingxiang; Chen, Xiujie; Zhang, Denan; Zhang, Wubing; Liu, Lei; Ma, Hongzhe; Yang, Jingbo; Xie, Hongbo; Liu, Bo; Jin, Qing

    2016-01-01

    Analysis of gene sets has been widely applied in various high-throughput biological studies. One weakness in the traditional methods is that they neglect the heterogeneity of genes expressions in samples which may lead to the omission of some specific and important gene sets. It is also difficult for them to reflect the severities of disease and provide expression profiles of gene sets for individuals. We developed an application software called IGSA that leverages a powerful analytical capacity in gene sets enrichment and samples clustering. IGSA calculates gene sets expression scores for each sample and takes an accumulating clustering strategy to let the samples gather into the set according to the progress of disease from mild to severe. We focus on gastric, pancreatic and ovarian cancer data sets for the performance of IGSA. We also compared the results of IGSA in KEGG pathways enrichment with David, GSEA, SPIA, ssGSEA and analyzed the results of IGSA clustering and different similarity measurement methods. Notably, IGSA is proved to be more sensitive and specific in finding significant pathways, and can indicate related changes in pathways with the severity of disease. In addition, IGSA provides with significant gene sets profile for each sample.

  9. Characterizing gene-gene interactions in a statistical epistasis network of twelve candidate genes for obesity.

    PubMed

    De, Rishika; Hu, Ting; Moore, Jason H; Gilbert-Diamond, Diane

    2015-01-01

    Recent findings have reemphasized the importance of epistasis, or gene-gene interactions, as a contributing factor to the unexplained heritability of obesity. Network-based methods such as statistical epistasis networks (SEN), present an intuitive framework to address the computational challenge of studying pairwise interactions between thousands of genetic variants. In this study, we aimed to analyze pairwise interactions that are associated with Body Mass Index (BMI) between SNPs from twelve genes robustly associated with obesity (BDNF, ETV5, FAIM2, FTO, GNPDA2, KCTD15, MC4R, MTCH2, NEGR1, SEC16B, SH2B1, and TMEM18). We used information gain measures to identify all SNP-SNP interactions among and between these genes that were related to obesity (BMI > 30 kg/m(2)) within the Framingham Heart Study Cohort; interactions exceeding a certain threshold were used to build an SEN. We also quantified whether interactions tend to occur more between SNPs from the same gene (dyadicity) or between SNPs from different genes (heterophilicity). We identified a highly connected SEN of 709 SNPs and 1241 SNP-SNP interactions. Combining the SEN framework with dyadicity and heterophilicity analyses, we found 1 dyadic gene (TMEM18, P-value = 0.047) and 3 heterophilic genes (KCTD15, P-value = 0.045; SH2B1, P-value = 0.003; and TMEM18, P-value = 0.001). We also identified a lncRNA SNP (rs4358154) as a key node within the SEN using multiple network measures. This study presents an analytical framework to characterize the global landscape of genetic interactions from genome-wide arrays and also to discover nodes of potential biological significance within the identified network.

  10. FDR-FET: an optimizing gene set enrichment analysis method

    PubMed Central

    Ji, Rui-Ru; Ott, Karl-Heinz; Yordanova, Roumyana; Bruccoleri, Robert E

    2011-01-01

    Gene set enrichment analysis for analyzing large profiling and screening experiments can reveal unifying biological schemes based on previously accumulated knowledge represented as “gene sets”. Most of the existing implementations use a fixed fold-change or P value cutoff to generate regulated gene lists. However, the threshold selection in most cases is arbitrary, and has a significant effect on the test outcome and interpretation of the experiment. We developed a new gene set enrichment analysis method, ie, FDR-FET, which dynamically optimizes the threshold choice and improves the sensitivity and selectivity of gene set enrichment analysis. The procedure translates experimental results into a series of regulated gene lists at multiple false discovery rate (FDR) cutoffs, and computes the P value of the overrepresentation of a gene set using a Fisher’s exact test (FET) in each of these gene lists. The lowest P value is retained to represent the significance of the gene set. We also implemented improved methods to define a more relevant global reference set for the FET. We demonstrate the validity of the method using a published microarray study of three protease inhibitors of the human immunodeficiency virus and compare the results with those from other popular gene set enrichment analysis algorithms. Our results show that combining FDR with multiple cutoffs allows us to control the error while retaining genes that increase information content. We conclude that FDR-FET can selectively identify significant affected biological processes. Our method can be used for any user-generated gene list in the area of transcriptome, proteome, and other biological and scientific applications. PMID:21918636

  11. goSTAG: gene ontology subtrees to tag and annotate genes within a set.

    PubMed

    Bennett, Brian D; Bushel, Pierre R

    2017-01-01

    Over-representation analysis (ORA) detects enrichment of genes within biological categories. Gene Ontology (GO) domains are commonly used for gene/gene-product annotation. When ORA is employed, often times there are hundreds of statistically significant GO terms per gene set. Comparing enriched categories between a large number of analyses and identifying the term within the GO hierarchy with the most connections is challenging. Furthermore, ascertaining biological themes representative of the samples can be highly subjective from the interpretation of the enriched categories. We developed goSTAG for utilizing GO Subtrees to Tag and Annotate Genes that are part of a set. Given gene lists from microarray, RNA sequencing (RNA-Seq) or other genomic high-throughput technologies, goSTAG performs GO enrichment analysis and clusters the GO terms based on the p-values from the significance tests. GO subtrees are constructed for each cluster, and the term that has the most paths to the root within the subtree is used to tag and annotate the cluster as the biological theme. We tested goSTAG on a microarray gene expression data set of samples acquired from the bone marrow of rats exposed to cancer therapeutic drugs to determine whether the combination or the order of administration influenced bone marrow toxicity at the level of gene expression. Several clusters were labeled with GO biological processes (BPs) from the subtrees that are indicative of some of the prominent pathways modulated in bone marrow from animals treated with an oxaliplatin/topotecan combination. In particular, negative regulation of MAP kinase activity was the biological theme exclusively in the cluster associated with enrichment at 6 h after treatment with oxaliplatin followed by control. However, nucleoside triphosphate catabolic process was the GO BP labeled exclusively at 6 h after treatment with topotecan followed by control. goSTAG converts gene lists from genomic analyses into biological themes

  12. Semantic particularity measure for functional characterization of gene sets using gene ontology.

    PubMed

    Bettembourg, Charles; Diot, Christian; Dameron, Olivier

    2014-01-01

    Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies.

  13. Semantic Particularity Measure for Functional Characterization of Gene Sets Using Gene Ontology

    PubMed Central

    Bettembourg, Charles; Diot, Christian; Dameron, Olivier

    2014-01-01

    Background Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. Results We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. Conclusion Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies. PMID:24489737

  14. Testing gene set enrichment for subset of genes: Sub-GSE.

    PubMed

    Yan, Xiting; Sun, Fengzhu

    2008-09-02

    Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene expression data with functionally related gene sets as defined in databases such as GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are generally more biologically interpretable, accurate and robust than the results based on individual gene analysis. However, while most available methods for gene set enrichment analysis test the enrichment of the entire gene set, it is more likely that only a subset of the genes in the gene set may be related to the phenotypes of interest. In this paper, we develop a novel method, termed Sub-GSE, which measures the enrichment of a predefined gene set, or pathway, by testing its subsets. The application of Sub-GSE to two simulated and two real datasets shows Sub-GSE to be more sensitive than previous methods, such as GSEA, GSA, and SigPath, in detecting gene sets assiated with a phenotype of interest. This is particularly true for cases in which only a fraction of the genes in the gene set are associated with the phenotypes. Furthermore, the application of Sub-GSE to two real data sets demonstrates that it can detect more biologically meaningful gene sets than GSEA. We developed a new method to measure the gene set enrichment. Applications to two simulated datasets and two real datasets show that this method is sensitive to the associations between gene sets and phenotype. The program Sub-GSE can be downloaded from http://www-rcf.usc.edu/~fsun.

  15. Gene Identification Algorithms Using Exploratory Statistical Analysis of Periodicity

    NASA Astrophysics Data System (ADS)

    Mukherjee, Shashi Bajaj; Sen, Pradip Kumar

    2010-10-01

    Studying periodic pattern is expected as a standard line of attack for recognizing DNA sequence in identification of gene and similar problems. But peculiarly very little significant work is done in this direction. This paper studies statistical properties of DNA sequences of complete genome using a new technique. A DNA sequence is converted to a numeric sequence using various types of mappings and standard Fourier technique is applied to study the periodicity. Distinct statistical behaviour of periodicity parameters is found in coding and non-coding sequences, which can be used to distinguish between these parts. Here DNA sequences of Drosophila melanogaster were analyzed with significant accuracy.

  16. Regulation of SET Gene Expression by NFkB.

    PubMed

    Feng, Yi; Li, Xiaoyong; Zhou, Weitao; Lou, Dandan; Huang, Daochao; Li, Yanhua; Kang, Yu; Xiang, Yan; Li, Tingyu; Zhou, Weihui; Song, Weihong

    2017-08-01

    SET is elevated and mislocalized in the neuronal cytoplasm in brains of Alzheimer's disease (AD) and Down syndrome (DS) patients. Cytoplasm SET leads to inhibition of protein phosphatase 2A and is involved in the tau pathology. However, the regulation of SET gene expression remains elusive. In the present study, we cloned a 1399-bp segment of the 5' flanking region of the human SET gene and identified that the transcription start site (TSS) of SET transcript 1 is located at 123 bp upstream of the translation start site ATG in exon 1. Sequence analysis reveals several putative regulatory elements including NFkB, Sp1, and HSE. Luciferase assay and electrophoretic mobility shift assay (EMSA) identified a functional cis-acting NFkB-responsive element in the SET gene promoter. Overexpression and activation of NFkB upregulate transcription of SET isoform 1 but not isoform 2, indicating that the expression of these two isoforms is differentially regulated. The results demonstrate that NFkB plays an important role in regulation of the human SET gene expression. Our findings suggest that oxidative stress and inflammatory responses could result in abnormal SET gene expression, contributing to the tauopathy in AD pathogenesis.

  17. On sufficient statistics of least-squares superposition of vector sets.

    PubMed

    Konagurthu, Arun S; Kasarapu, Parthan; Allison, Lloyd; Collier, James H; Lesk, Arthur M

    2015-06-01

    The problem of superposition of two corresponding vector sets by minimizing their sum-of-squares error under orthogonal transformation is a fundamental task in many areas of science, notably structural molecular biology. This problem can be solved exactly using an algorithm whose time complexity grows linearly with the number of correspondences. This efficient solution has facilitated the widespread use of the superposition task, particularly in studies involving macromolecular structures. This article formally derives a set of sufficient statistics for the least-squares superposition problem. These statistics are additive. This permits a highly efficient (constant time) computation of superpositions (and sufficient statistics) of vector sets that are composed from its constituent vector sets under addition or deletion operation, where the sufficient statistics of the constituent sets are already known (that is, the constituent vector sets have been previously superposed). This results in a drastic improvement in the run time of the methods that commonly superpose vector sets under addition or deletion operations, where previously these operations were carried out ab initio (ignoring the sufficient statistics). We experimentally demonstrate the improvement our work offers in the context of protein structural alignment programs that assemble a reliable structural alignment from well-fitting (substructural) fragment pairs. A C++ library for this task is available online under an open-source license.

  18. Integrative gene set analysis of multi-platform data with sample heterogeneity.

    PubMed

    Hu, Jun; Tzeng, Jung-Ying

    2014-06-01

    Gene set analysis is a popular method for large-scale genomic studies. Because genes that have common biological features are analyzed jointly, gene set analysis often achieves better power and generates more biologically informative results. With the advancement of technologies, genomic studies with multi-platform data have become increasingly common. Several strategies have been proposed that integrate genomic data from multiple platforms to perform gene set analysis. To evaluate the performances of existing integrative gene set methods under various scenarios, we conduct a comparative simulation analysis based on The Cancer Genome Atlas breast cancer dataset. We find that existing methods for gene set analysis are less effective when sample heterogeneity exists. To address this issue, we develop three methods for multi-platform genomic data with heterogeneity: two non-parametric methods, multi-platform Mann-Whitney statistics and multi-platform outlier robust T-statistics, and a parametric method, multi-platform likelihood ratio statistics. Using simulations, we show that the proposed multi-platform Mann-Whitney statistics method has higher power for heterogeneous samples and comparable performance for homogeneous samples when compared with the existing methods. Our real data applications to two datasets of The Cancer Genome Atlas also suggest that the proposed methods are able to identify novel pathways that are missed by other strategies. http://www4.stat.ncsu.edu/∼jytzeng/Software/Multiplatform_gene_set_analysis/ © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  19. Statistically invalid classification of high throughput gene expression data.

    PubMed

    Barbash, Shahar; Soreq, Hermona

    2013-01-01

    Classification analysis based on high throughput data is a common feature in neuroscience and other fields of science, with a rapidly increasing impact on both basic biology and disease-related studies. The outcome of such classifications often serves to delineate novel biochemical mechanisms in health and disease states, identify new targets for therapeutic interference, and develop innovative diagnostic approaches. Given the importance of this type of studies, we screened 111 recently-published high-impact manuscripts involving classification analysis of gene expression, and found that 58 of them (53%) based their conclusions on a statistically invalid method which can lead to bias in a statistical sense (lower true classification accuracy then the reported classification accuracy). In this report we characterize the potential methodological error and its scope, investigate how it is influenced by different experimental parameters, and describe statistically valid methods for avoiding such classification mistakes.

  20. Differential Effects of Goal Setting and Value Reappraisal on College Women's Motivation and Achievement in Statistics

    ERIC Educational Resources Information Center

    Acee, Taylor Wayne

    2009-01-01

    The purpose of this dissertation was to investigate the differential effects of goal setting and value reappraisal on female students' self-efficacy beliefs, value perceptions, exam performance and continued interest in statistics. It was hypothesized that the Enhanced Goal Setting Intervention (GS-E) would positively impact students'…

  1. Differential Effects of Goal Setting and Value Reappraisal on College Women's Motivation and Achievement in Statistics

    ERIC Educational Resources Information Center

    Acee, Taylor Wayne

    2009-01-01

    The purpose of this dissertation was to investigate the differential effects of goal setting and value reappraisal on female students' self-efficacy beliefs, value perceptions, exam performance and continued interest in statistics. It was hypothesized that the Enhanced Goal Setting Intervention (GS-E) would positively impact students'…

  2. Zebrafish Expression Ontology of Gene Sets (ZEOGS): a tool to analyze enrichment of zebrafish anatomical terms in large gene sets.

    PubMed

    Prykhozhij, Sergey V; Marsico, Annalisa; Meijsing, Sebastiaan H

    2013-09-01

    The zebrafish (Danio rerio) is an established model organism for developmental and biomedical research. It is frequently used for high-throughput functional genomics experiments, such as genome-wide gene expression measurements, to systematically analyze molecular mechanisms. However, the use of whole embryos or larvae in such experiments leads to a loss of the spatial information. To address this problem, we have developed a tool called Zebrafish Expression Ontology of Gene Sets (ZEOGS) to assess the enrichment of anatomical terms in large gene sets. ZEOGS uses gene expression pattern data from several sources: first, in situ hybridization experiments from the Zebrafish Model Organism Database (ZFIN); second, it uses the Zebrafish Anatomical Ontology, a controlled vocabulary that describes connected anatomical structures; and third, the available connections between expression patterns and anatomical terms contained in ZFIN. Upon input of a gene set, ZEOGS determines which anatomical structures are overrepresented in the input gene set. ZEOGS allows one for the first time to look at groups of genes and to describe them in terms of shared anatomical structures. To establish ZEOGS, we first tested it on random gene selections and on two public microarray datasets with known tissue-specific gene expression changes. These tests showed that ZEOGS could reliably identify the tissues affected, whereas only very few enriched terms to none were found in the random gene sets. Next we applied ZEOGS to microarray datasets of 24 and 72 h postfertilization zebrafish embryos treated with beclomethasone, a potent glucocorticoid. This analysis resulted in the identification of several anatomical terms related to glucocorticoid-responsive tissues, some of which were stage-specific. Our studies highlight the ability of ZEOGS to extract spatial information from datasets derived from whole embryos, indicating that ZEOGS could be a useful tool to automatically analyze gene expression

  3. GO2MSIG, an automated GO based multi-species gene set generator for gene set enrichment analysis.

    PubMed

    Powell, Justin Andrew Christiaan

    2014-05-17

    Despite the widespread use of high throughput expression platforms and the availability of a desktop implementation of Gene Set Enrichment Analysis (GSEA) that enables non-experts to perform gene set based analyses, the availability of the necessary precompiled gene sets is rare for species other than human. A software tool (GO2MSIG) was implemented that combines data from various publicly available sources and uses the Gene Ontology (GO) project term relationships to produce GSEA compatible hierarchical GO based gene sets for all species for which association data is available. Annotation sources include the GO association database (which contains data for over 200000 species), the Entrez gene2go table, and various manufacturers' array annotation files. This enables the creation of gene sets from the most up-to-date annotation data available. Additional features include the ability to restrict by evidence code, to remap gene descriptors, to filter by set size and to speed up repeat queries by caching the GO term hierarchy. Synonymous GO terms are remapped to the version preferred by the GO ontology supplied. The tool can be used in standalone form, or via a web interface. Prebuilt gene set collections constructed from the September 2013 GO release are also available for common species including human. In contrast human GO based sets available from the Broad Institute itself date from 2008. GO2MSIG enables the bioinformatician and non-bioinformatician alike to generate gene sets required for GSEA analysis for almost any organism for which GO term association data exists. The output gene sets may be used directly within GSEA and do not require knowledge of programming languages such as Perl, R or Python. The output sets can also be used with other analysis software such as ErmineJ that accept gene sets in the same format. Source code can be downloaded and installed locally from http://www.bioinformatics.org/go2msig/releases/ or used via the web interface at http

  4. Third party data gene data set of eutherian growth hormone genes.

    PubMed

    Premzl, Marko

    2015-12-01

    Among 146 potential coding sequences, the most comprehensive eutherian growth hormone gene data set annotated 100 complete coding sequences. The eutherian comparative genomic analysis protocol first described 5 major gene clusters of eutherian growth hormone genes. The present updated gene classification and nomenclature of eutherian growth hormone genes integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis into new framework of future experiments. The curated third party data gene data set of eutherian growth hormone genes was deposited in European Nucleotide Archive under accession numbers LM644135-LM644234.

  5. Discovery of cancer common and specific driver gene sets.

    PubMed

    Zhang, Junhua; Zhang, Shihua

    2017-06-02

    Cancer is known as a disease mainly caused by gene alterations. Discovery of mutated driver pathways or gene sets is becoming an important step to understand molecular mechanisms of carcinogenesis. However, systematically investigating commonalities and specificities of driver gene sets among multiple cancer types is still a great challenge, but this investigation will undoubtedly benefit deciphering cancers and will be helpful for personalized therapy and precision medicine in cancer treatment. In this study, we propose two optimization models to de novo discover common driver gene sets among multiple cancer types (ComMDP) and specific driver gene sets of one certain or multiple cancer types to other cancers (SpeMDP), respectively. We first apply ComMDP and SpeMDP to simulated data to validate their efficiency. Then, we further apply these methods to 12 cancer types from The Cancer Genome Atlas (TCGA) and obtain several biologically meaningful driver pathways. As examples, we construct a common cancer pathway model for BRCA and OV, infer a complex driver pathway model for BRCA carcinogenesis based on common driver gene sets of BRCA with eight cancer types, and investigate specific driver pathways of the liquid cancer lymphoblastic acute myeloid leukemia (LAML) versus other solid cancer types. In these processes more candidate cancer genes are also found. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  6. GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics

    PubMed Central

    Piovesan, Allison; Caracausi, Maria; Antonaros, Francesca; Pelleri, Maria Chiara; Vitale, Lorenza

    2016-01-01

    We release GeneBase 1.1, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank. Compared to its predecessor GeneBase (1.0), GeneBase 1.1 now allows dynamic calculation and summarization in terms of median, mean, standard deviation and total for many quantitative parameters associated with genes, gene transcripts and gene features (exons, introns, coding sequences, untranslated regions). GeneBase 1.1 thus offers the opportunity to perform analyses of the main gene structure parameters also following the search for any set of genes with the desired characteristics, allowing unique functionalities not provided by the NCBI Gene itself. In order to show the potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval, analysis and testing of biological hypotheses, we provide as a sample application a revised set of statistics for human nuclear genes, gene transcripts and gene features. In contrast with previous estimations strongly underestimating the length of human genes, a ‘mean’ human protein-coding gene is 67 kbp long, has eleven 309 bp long exons and ten 6355 bp long introns. Median, mean and extreme values are provided for many other features offering an updated reference source for human genome studies, data useful to set parameters for bioinformatic tools and interesting clues to the biomedical meaning of the gene features themselves. Database URL: http://apollo11.isto.unibo.it/software/ PMID:28025344

  7. GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics.

    PubMed

    Piovesan, Allison; Caracausi, Maria; Antonaros, Francesca; Pelleri, Maria Chiara; Vitale, Lorenza

    2016-01-01

    We release GeneBase 1.1, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank. Compared to its predecessor GeneBase (1.0), GeneBase 1.1 now allows dynamic calculation and summarization in terms of median, mean, standard deviation and total for many quantitative parameters associated with genes, gene transcripts and gene features (exons, introns, coding sequences, untranslated regions). GeneBase 1.1 thus offers the opportunity to perform analyses of the main gene structure parameters also following the search for any set of genes with the desired characteristics, allowing unique functionalities not provided by the NCBI Gene itself. In order to show the potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval, analysis and testing of biological hypotheses, we provide as a sample application a revised set of statistics for human nuclear genes, gene transcripts and gene features. In contrast with previous estimations strongly underestimating the length of human genes, a 'mean' human protein-coding gene is 67 kbp long, has eleven 309 bp long exons and ten 6355 bp long introns. Median, mean and extreme values are provided for many other features offering an updated reference source for human genome studies, data useful to set parameters for bioinformatic tools and interesting clues to the biomedical meaning of the gene features themselves.Database URL: http://apollo11.isto.unibo.it/software/.

  8. Analysis of Gene Sets Based on the Underlying Regulatory Network

    PubMed Central

    Michailidis, George

    2009-01-01

    Abstract Networks are often used to represent the interactions among genes and proteins. These interactions are known to play an important role in vital cell functions and should be included in the analysis of genes that are differentially expressed. Methods of gene set analysis take advantage of external biological information and analyze a priori defined sets of genes. These methods can potentially preserve the correlation among genes; however, they do not directly incorporate the information about the gene network. In this paper, we propose a latent variable model that directly incorporates the network information. We then use the theory of mixed linear models to present a general inference framework for the problem of testing the significance of subnetworks. Several possible test procedures are introduced and a network based method for testing the changes in expression levels of genes as well as the structure of the network is presented. The performance of the proposed method is compared with methods of gene set analysis using both simulation studies, as well as real data on genes related to the galactose utilization pathway in yeast. PMID:19254181

  9. Integrated gene set analysis for microRNA studies

    PubMed Central

    Garcia-Garcia, Francisco; Panadero, Joaquin; Dopazo, Joaquin; Montaner, David

    2016-01-01

    Motivation: Functional interpretation of miRNA expression data is currently done in a three step procedure: select differentially expressed miRNAs, find their target genes, and carry out gene set overrepresentation analysis. Nevertheless, major limitations of this approach have already been described at the gene level, while some newer arise in the miRNA scenario. Here, we propose an enhanced methodology that builds on the well-established gene set analysis paradigm. Evidence for differential expression at the miRNA level is transferred to a gene differential inhibition score which is easily interpretable in terms of gene sets or pathways. Such transferred indexes account for the additive effect of several miRNAs targeting the same gene, and also incorporate cancellation effects between cases and controls. Together, these two desirable characteristics allow for more accurate modeling of regulatory processes. Results: We analyze high-throughput sequencing data from 20 different cancer types and provide exhaustive reports of gene and Gene Ontology-term deregulation by miRNA action. Availability and Implementation: The proposed methodology was implemented in the Bioconductor library mdgsa. http://bioconductor.org/packages/mdgsa. For the purpose of reproducibility all of the scripts are available at https://github.com/dmontaner-papers/gsa4mirna Contact: david.montaner@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27324197

  10. GeneTools--application for functional annotation and statistical hypothesis testing.

    PubMed

    Beisvag, Vidar; Jünge, Frode K R; Bergum, Hallgeir; Jølsum, Lars; Lydersen, Stian; Günther, Clara-Cecilie; Ramampiaro, Heri; Langaas, Mette; Sandvik, Arne K; Laegreid, Astrid

    2006-10-24

    Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. GeneTools is the first "all in one

  11. GeneTools – application for functional annotation and statistical hypothesis testing

    PubMed Central

    Beisvag, Vidar; Jünge, Frode KR; Bergum, Hallgeir; Jølsum, Lars; Lydersen, Stian; Günther, Clara-Cecilie; Ramampiaro, Heri; Langaas, Mette; Sandvik, Arne K; Lægreid, Astrid

    2006-01-01

    Background Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. Results GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. Conclusion Gene

  12. Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models.

    PubMed

    Hirose, Osamu; Yoshida, Ryo; Imoto, Seiya; Yamaguchi, Rui; Higuchi, Tomoyuki; Charnock-Jones, D Stephen; Print, Cristin; Miyano, Satoru

    2008-04-01

    Statistical inference of gene networks by using time-course microarray gene expression profiles is an essential step towards understanding the temporal structure of gene regulatory mechanisms. Unfortunately, most of the current studies have been limited to analysing a small number of genes because the length of time-course gene expression profiles is fairly short. One promising approach to overcome such a limitation is to infer gene networks by exploring the potential transcriptional modules which are sets of genes sharing a common function or involved in the same pathway. In this article, we present a novel approach based on the state space model to identify the transcriptional modules and module-based gene networks simultaneously. The state space model has the potential to infer large-scale gene networks, e.g. of order 10(3), from time-course gene expression profiles. Particularly, we succeeded in the identification of a cell cycle system by using the gene expression profiles of Saccharomyces cerevisiae in which the length of the time-course and number of genes were 24 and 4382, respectively. However, when analysing shorter time-course data, e.g. of length 10 or less, the parameter estimations of the state space model often fail due to overfitting. To extend the applicability of the state space model, we provide an approach to use the technical replicates of gene expression profiles, which are often measured in duplicate or triplicate. The use of technical replicates is important for achieving highly-efficient inferences of gene networks with short time-course data. The potential of the proposed method has been demonstrated through the time-course analysis of the gene expression profiles of human umbilical vein endothelial cells (HUVECs) undergoing growth factor deprivation-induced apoptosis. Supplementary Information and the software (TRANS-MNET) are available at http://daweb.ism.ac.jp/~yoshidar/software/ssm/.

  13. Multivariate Gene Selection and Testing in Studying the Exposure Effects on a Gene Set.

    PubMed

    Sofer, Tamar; Maity, Arnab; Coull, Brent; Baccarelli, Andrea; Schwartz, Joel; Lin, Xihong

    2012-11-01

    Studying the association between a gene set (e.g., pathway) and exposures using multivariate regression methods is of increasing importance in genomic studies. Such an analysis is often more powerful and interpretable than individual gene analysis. Since many genes in a gene set are likely not affected by exposures, one is often interested in identifying a subset of genes in the gene set that are affected by exposures. This allows for better understanding of the underlying biological mechanism and for pursuing further biological investigation of these genes. The selected subset of "signal" genes also provides an attractive vehicle for a more powerful test for the association between the gene set and exposures. We propose two computationally simple Canonical Correlation Analysis (CCA) based variable selection methods: Sparse Outcome Selection (SOS) CCA and step CCA, to jointly select a subset of genes in a gene set that are associated with exposures. Several model selection criteria, such as BIC and the new Correlation Information Criterion (CIC), are proposed and compared. We also develop a global test procedure for testing the exposure effects on the whole gene set, accounting for gene selection. Through simulation studies, we show that the proposed methods improve upon an existing method when the genes are correlated and are more computationally efficient. We apply the proposed methods to the analysis of the Normative Aging DNA methylation Study to examine the effects of airborne particular matter exposures on DNA methylations in a genetic pathway.

  14. A hybrid approach of gene sets and single genes for the prediction of survival risks with gene expression data.

    PubMed

    Seok, Junhee; Davis, Ronald W; Xiao, Wenzhong

    2015-01-01

    Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn't been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge.

  15. A Hybrid Approach of Gene Sets and Single Genes for the Prediction of Survival Risks with Gene Expression Data

    PubMed Central

    Seok, Junhee; Davis, Ronald W.; Xiao, Wenzhong

    2015-01-01

    Accumulated biological knowledge is often encoded as gene sets, collections of genes associated with similar biological functions or pathways. The use of gene sets in the analyses of high-throughput gene expression data has been intensively studied and applied in clinical research. However, the main interest remains in finding modules of biological knowledge, or corresponding gene sets, significantly associated with disease conditions. Risk prediction from censored survival times using gene sets hasn’t been well studied. In this work, we propose a hybrid method that uses both single gene and gene set information together to predict patient survival risks from gene expression profiles. In the proposed method, gene sets provide context-level information that is poorly reflected by single genes. Complementarily, single genes help to supplement incomplete information of gene sets due to our imperfect biomedical knowledge. Through the tests over multiple data sets of cancer and trauma injury, the proposed method showed robust and improved performance compared with the conventional approaches with only single genes or gene sets solely. Additionally, we examined the prediction result in the trauma injury data, and showed that the modules of biological knowledge used in the prediction by the proposed method were highly interpretable in biology. A wide range of survival prediction problems in clinical genomics is expected to benefit from the use of biological knowledge. PMID:25933378

  16. Hepatic vessel segmentation using variational level set combined with non-local robust statistics.

    PubMed

    Lu, Siyu; Huang, Hui; Liang, Ping; Chen, Gang; Xiao, Liang

    2017-02-01

    Hepatic vessel segmentation is a challenging step in therapy guided by magnetic resonance imaging (MRI). This paper presents an improved variational level set method, which uses non-local robust statistics to suppress the influence of noise in MR images. The non-local robust statistics, which represent vascular features, are learned adaptively from seeds provided by users. K-means clustering in neighborhoods of seeds is utilized to exclude inappropriate seeds, which are obviously corrupted by noise. The neighborhoods of appropriate seeds are placed in an array to calculate the non-local robust statistics, and the variational level set formulation can be constructed. Bias correction is utilized in the level set formulation to reduce the influence of intensity inhomogeneity of MRI. Experiments were conducted over real MR images, and showed that the proposed method performed better on small hepatic vessel segmentation compared with other segmentation methods.

  17. Gene set analysis: limitations in popular existing methods and proposed improvements.

    PubMed

    Mishra, Pashupati; Törönen, Petri; Leino, Yrjö; Holm, Liisa

    2014-10-01

    Gene set analysis is the analysis of a set of genes that collectively contribute to a biological process. Most popular gene set analysis methods are based on empirical P-value that requires large number of permutations. Despite numerous gene set analysis methods developed in the past decade, the most popular methods still suffer from serious limitations. We present a gene set analysis method (mGSZ) based on Gene Set Z-scoring function (GSZ) and asymptotic P-values. Asymptotic P-value calculation requires fewer permutations, and thus speeds up the gene set analysis process. We compare the GSZ-scoring function with seven popular gene set scoring functions and show that GSZ stands out as the best scoring function. In addition, we show improved performance of the GSA method when the max-mean statistics is replaced by the GSZ scoring function. We demonstrate the importance of both gene and sample permutations by showing the consequences in the absence of one or the other. A comparison of asymptotic and empirical methods of P-value estimation demonstrates a clear advantage of asymptotic P-value over empirical P-value. We show that mGSZ outperforms the state-of-the-art methods based on two different evaluations. We compared mGSZ results with permutation and rotation tests and show that rotation does not improve our asymptotic P-values. We also propose well-known asymptotic distribution models for three of the compared methods. mGSZ is available as R package from cran.r-project.org. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  18. An unusually simple HP1 gene set in Hymenopteran insects.

    PubMed

    Fang, C; Schmitz, L; Ferree, P M

    2015-12-01

    The heterochromatin protein 1 (HP1) gene family includes a set of paralogs in higher eukaryotes that serve fundamental roles in heterochromatin structure and maintenance, and other chromatin-related functions. At least 10 full and 16 partial HP1 genes exist among Drosophila species, with multiple gene gains, losses, and sub-functionalizations within this insect group. An important question is whether this diverse set of HP1 genes and their dynamic evolution represent the standard rule in eukaryotic groups. Here we have begun to address this question by bio-informatically identifying the HP1 family genes in representative species of the insect order Hymenoptera, which includes all ants, bees, wasps, and sawflies. Compared to Drosophila species, Hymenopterans have a much simpler set of HP1 genes, including one full and two partial HP1s. All 3 genes appear to have been present in the common ancestor of the Hymenopterans and they derive from a Drosophila HP1B-like gene. In ants, a partial HP1 gene containing only a chromoshadow domain harbors amino acid changes at highly conserved sites within the PxVxL recognition region, suggesting that this gene has undergone sub-functionalization. In the jewel wasp Nasonia vitripennis, the full HP1 and partial chromoshadow-only HP1 are expressed in both germ line and somatic tissues. However, the partial chromodomain-only HP1 is expressed exclusively in the ovary and testis, suggesting that it may have a specialized chromatin role during gametogenesis. Our findings demonstrate that the HP1 gene family is much simpler and evolutionarily less dynamic within the Hymenopterans compared to the much younger Drosophila group, a pattern that may reflect major differences in the range of chromatin-related functions present in these and perhaps other insect groups.

  19. Gene set enrichment ensemble using fold change data only.

    PubMed

    Huang, Hai; Zhang, Shaohong; Shen, Wen-Jun; Wong, Hau-San; Xie, Dongqing

    2015-10-01

    In a number of biological studies, the raw gene expression data are not usually published due to different causes, such as data privacy and patent rights. Instead, significant gene lists with fold change values are usually provided in most studies. However, due to variations in data sources and profiling conditions, only a small number of common significant genes could be found among similar studies. Moreover, traditional gene set based analyses that consider these genes have not taken into account the fold change values, which may be important to distinguish between the different levels of significance of the genes. Human embryonic stem cell derived cardiomyocytes (hESC-CM) is a good representative of this category. hESC-CMs, with its role as a potentially unlimited source of human heart cells for regenerative medicine, have attracted the attentions of biological and medical researchers. Because of the difficulty of acquiring data and the resulting expenses, there are only a few related hESC-CM studies and few hESC-CM gene expression data are provided. In view of these challenges, we propose a new Gene Set Enrichment Ensemble (GSEE) approach to perform gene set based analysis on individual studies based on significant up-regulated gene lists with fold change data only. Our approach provides both explicit and implicit ways to utilize the fold change data, in order to make full use of scarce data. We validate our approach with hESC-CM data and fetal heart data, respectively. Experimental results on significant gene lists from different studies illustrate the effectiveness of our proposed approach. Copyright © 2015 Elsevier Inc. All rights reserved.

  20. Genes2GO: A web application for querying gene sets for specific GO terms.

    PubMed

    Chawla, Konika; Kuiper, Martin

    2016-01-01

    Gene ontology annotations have become an essential resource for biological interpretations of experimental findings. The process of gathering basic annotation information in tables that link gene sets with specific gene ontology terms can be cumbersome, in particular if it requires above average computer skills or bioinformatics expertise. We have therefore developed Genes2GO, an intuitive R-based web application. Genes2GO uses the biomaRt package of Bioconductor in order to retrieve custom sets of gene ontology annotations for any list of genes from organisms covered by the Ensembl database. Genes2GO produces a binary matrix file, indicating for each gene the presence or absence of specific annotations for a gene. It should be noted that other GO tools do not offer this user-friendly access to annotations. Genes2GO is freely available and listed under http://www.semantic-systems-biology.org/tools/externaltools/.

  1. Methods for Determining the Statistical Significance of Enrichment or Depletion of Gene Ontology Classifications under Weighted Membership.

    PubMed

    Iacucci, Ernesto; Zingg, Hans H; Perkins, Theodore J

    2012-01-01

    High-throughput molecular biology studies, such as microarray assays of gene expression, two-hybrid experiments for detecting protein interactions, or ChIP-Seq experiments for transcription factor binding, often result in an "interesting" set of genes - say, genes that are co-expressed or bound by the same factor. One way of understanding the biological meaning of such a set is to consider what processes or functions, as defined in an ontology, are over-represented (enriched) or under-represented (depleted) among genes in the set. Usually, the significance of enrichment or depletion scores is based on simple statistical models and on the membership of genes in different classifications. We consider the more general problem of computing p-values for arbitrary integer additive statistics, or weighted membership functions. Such membership functions can be used to represent, for example, prior knowledge on the role of certain genes or classifications, differential importance of different classifications or genes to the experimenter, hierarchical relationships between classifications, or different degrees of interestingness or evidence for specific genes. We describe a generic dynamic programming algorithm that can compute exact p-values for arbitrary integer additive statistics. We also describe several optimizations for important special cases, which can provide orders-of-magnitude speed up in the computations. We apply our methods to datasets describing oxidative phosphorylation and parturition and compare p-values based on computations of several different statistics for measuring enrichment. We find major differences between p-values resulting from these statistics, and that some statistics recover "gold standard" annotations of the data better than others. Our work establishes a theoretical and algorithmic basis for far richer notions of enrichment or depletion of gene sets with respect to gene ontologies than has previously been available.

  2. GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space.

    PubMed

    Zhong, Sheng; Storch, Kai-Florian; Lipan, Ovidiu; Kao, Ming-Chih J; Weitz, Charles J; Wong, Wing H

    2004-01-01

    The analysis of complex patterns of gene regulation is central to understanding the biology of cells, tissues and organisms. Patterns of gene regulation pertaining to specific biological processes can be revealed by a variety of experimental strategies, particularly microarrays and other highly parallel methods, which generate large datasets linking many genes. Although methods for detecting gene expression have improved substantially in recent years, understanding the physiological implications of complex patterns in gene expression data is a major challenge. This article presents GoSurfer, an easy-to-use graphical exploration tool with built-in statistical features that allow a rapid assessment of the biological functions represented in large gene sets. GoSurfer takes one or two list(s) of gene identifiers (Affymetrix probe set ID) as input and retrieves all the Gene Ontology (GO) terms associated with the input genes. GoSurfer visualises these GO terms in a hierarchical tree format. With GoSurfer, users can perform statistical tests to search for the GO terms that are enriched in the annotations of the input genes. These GO terms can be highlighted on the GO tree. Users can manipulate the GO tree in various ways and interactively query the genes associated with any GO term. The user-generated graphics can be saved as graphics files, and all the GO information related to the input genes can be exported as text files. GoSurfer is a Windows-based program freely available for noncommercial use and can be downloaded at http://www.gosurfer.org. Datasets used to construct the trees shown in the figures in this article are available at http://www.gosurfer.org/download/GoSurfer.zip.

  3. A review of statistical methods for data sets with multiple censoring points

    SciTech Connect

    Gilbert, R.O.

    1995-07-06

    This report reviews and summarizes recent literature on statistical methods for analyzing data sets that are censored by multiple censoring points. This report is organized as follows. Following the introductory comments in Section 2, a brief discussion of detection limits is given in Section 3. Sections 4 and 5 focus on data analysis methods for estimating parameters and testing hypotheses, respectively, when data sets are left censored with multiple censoring points. A list of publications that deal with a variety of other applications for censored data sets is provided in Section 6. Recommendations on future research for developing new or improved tools for statistically analyzing multiple left-censored data sets are provided in Section 7. The list of references is in Section 8.

  4. Optimization of gene set annotations via entropy minimization over variable clusters (EMVC).

    PubMed

    Frost, H Robert; Moore, Jason H

    2014-06-15

    Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets. We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results. http://cran.r-project.org/web/packages/EMVC/index.html. © The Author 2014. Published by Oxford University Press.

  5. A statistical approach to set classification by feature selection with applications to classification of histopathology images.

    PubMed

    Jung, Sungkyu; Qiao, Xingye

    2014-09-01

    Set classification problems arise when classification tasks are based on sets of observations as opposed to individual observations. In set classification, a classification rule is trained with N sets of observations, where each set is labeled with class information, and the prediction of a class label is performed also with a set of observations. Data sets for set classification appear, for example, in diagnostics of disease based on multiple cell nucleus images from a single tissue. Relevant statistical models for set classification are introduced, which motivate a set classification framework based on context-free feature extraction. By understanding a set of observations as an empirical distribution, we employ a data-driven method to choose those features which contain information on location and major variation. In particular, the method of principal component analysis is used to extract the features of major variation. Multidimensional scaling is used to represent features as vector-valued points on which conventional classifiers can be applied. The proposed set classification approaches achieve better classification results than competing methods in a number of simulated data examples. The benefits of our method are demonstrated in an analysis of histopathology images of cell nuclei related to liver cancer.

  6. Turning publicly available gene expression data into discoveries using gene set context analysis.

    PubMed

    Ji, Zhicheng; Vokes, Steven A; Dang, Chi V; Ji, Hongkai

    2016-01-08

    Gene Set Context Analysis (GSCA) is an open source software package to help researchers use massive amounts of publicly available gene expression data (PED) to make discoveries. Users can interactively visualize and explore gene and gene set activities in 25,000+ consistently normalized human and mouse gene expression samples representing diverse biological contexts (e.g. different cells, tissues and disease types, etc.). By providing one or multiple genes or gene sets as input and specifying a gene set activity pattern of interest, users can query the expression compendium to systematically identify biological contexts associated with the specified gene set activity pattern. In this way, researchers with new gene sets from their own experiments may discover previously unknown contexts of gene set functions and hence increase the value of their experiments. GSCA has a graphical user interface (GUI). The GUI makes the analysis convenient and customizable. Analysis results can be conveniently exported as publication quality figures and tables. GSCA is available at https://github.com/zji90/GSCA. This software significantly lowers the bar for biomedical investigators to use PED in their daily research for generating and screening hypotheses, which was previously difficult because of the complexity, heterogeneity and size of the data.

  7. Reproducibility-optimized test statistic for ranking genes in microarray studies.

    PubMed

    Elo, Laura L; Filén, Sanna; Lahesmaa, Riitta; Aittokallio, Tero

    2008-01-01

    A principal goal of microarray studies is to identify the genes showing differential expression under distinct conditions. In such studies, the selection of an optimal test statistic is a crucial challenge, which depends on the type and amount of data under analysis. While previous studies on simulated or spike-in datasets do not provide practical guidance on how to choose the best method for a given real dataset, we introduce an enhanced reproducibility-optimization procedure, which enables the selection of a suitable gene- anking statistic directly from the data. In comparison with existing ranking methods, the reproducibilityoptimized statistic shows good performance consistently under various simulated conditions and on Affymetrix spike-in dataset. Further, the feasibility of the novel statistic is confirmed in a practical research setting using data from an in-house cDNA microarray study of asthma-related gene expression changes. These results suggest that the procedure facilitates the selection of an appropriate test statistic for a given dataset without relying on a priori assumptions, which may bias the findings and their interpretation. Moreover, the general reproducibilityoptimization procedure is not limited to detecting differential expression only but could be extended to a wide range of other applications as well.

  8. Joint Clustering and Component Analysis of Correspondenceless Point Sets: Application to Cardiac Statistical Modeling.

    PubMed

    Gooya, Ali; Lekadir, Karim; Alba, Xenia; Swift, Andrew J; Wild, Jim M; Frangi, Alejandro F

    2015-01-01

    Construction of Statistical Shape Models (SSMs) from arbitrary point sets is a challenging problem due to significant shape variation and lack of explicit point correspondence across the training data set. In medical imaging, point sets can generally represent different shape classes that span healthy and pathological exemplars. In such cases, the constructed SSM may not generalize well, largely because the probability density function (pdf) of the point sets deviates from the underlying assumption of Gaussian statistics. To this end, we propose a generative model for unsupervised learning of the pdf of point sets as a mixture of distinctive classes. A Variational Bayesian (VB) method is proposed for making joint inferences on the labels of point sets, and the principal modes of variations in each cluster. The method provides a flexible framework to handle point sets with no explicit point-to-point correspondences. We also show that by maximizing the marginalized likelihood of the model, the optimal number of clusters of point sets can be determined. We illustrate this work in the context of understanding the anatomical phenotype of the left and right ventricles in heart. To this end, we use a database containing hearts of healthy subjects, patients with Pulmonary Hypertension (PH), and patients with Hypertrophic Cardiomyopathy (HCM). We demonstrate that our method can outperform traditional PCA in both generalization and specificity measures.

  9. WhichGenes: a web-based tool for gathering, building, storing and exporting gene sets with application in gene set enrichment analysis.

    PubMed

    Glez-Peña, Daniel; Gómez-López, Gonzalo; Pisano, David G; Fdez-Riverola, Florentino

    2009-07-01

    WhichGenes is a web-based interactive gene set building tool offering a very simple interface to extract always-updated gene lists from multiple databases and unstructured biological data sources. While the user can specify new gene sets of interest by following a simple four-step wizard, the tool is able to run several queries in parallel. Every time a new set is generated, it is automatically added to the private gene-set cart and the user is notified by an e-mail containing a direct link to the new set stored in the server. WhichGenes provides functionalities to edit, delete and rename existing sets as well as the capability of generating new ones by combining previous existing sets (intersection, union and difference operators). The user can export his sets configuring the output format and selecting among multiple gene identifiers. In addition to the user-friendly environment, WhichGenes allows programmers to access its functionalities in a programmatic way through a Representational State Transfer web service. WhichGenes front-end is freely available at http://www.whichgenes.org/, WhichGenes API is accessible at http://www.whichgenes.org/api/.

  10. The essential gene set of a photosynthetic organism.

    PubMed

    Rubin, Benjamin E; Wetmore, Kelly M; Price, Morgan N; Diamond, Spencer; Shultzaberger, Ryan K; Lowe, Laura C; Curtin, Genevieve; Arkin, Adam P; Deutschbauer, Adam; Golden, Susan S

    2015-12-01

    Synechococcus elongatus PCC 7942 is a model organism used for studying photosynthesis and the circadian clock, and it is being developed for the production of fuel, industrial chemicals, and pharmaceuticals. To identify a comprehensive set of genes and intergenic regions that impacts fitness in S. elongatus, we created a pooled library of ∼ 250,000 transposon mutants and used sequencing to identify the insertion locations. By analyzing the distribution and survival of these mutants, we identified 718 of the organism's 2,723 genes as essential for survival under laboratory conditions. The validity of the essential gene set is supported by its tight overlap with well-conserved genes and its enrichment for core biological processes. The differences noted between our dataset and these predictors of essentiality, however, have led to surprising biological insights. One such finding is that genes in a large portion of the TCA cycle are dispensable, suggesting that S. elongatus does not require a cyclic TCA process. Furthermore, the density of the transposon mutant library enabled individual and global statements about the essentiality of noncoding RNAs, regulatory elements, and other intergenic regions. In this way, a group I intron located in tRNA(Leu), which has been used extensively for phylogenetic studies, was shown here to be essential for the survival of S. elongatus. Our survey of essentiality for every locus in the S. elongatus genome serves as a powerful resource for understanding the organism's physiology and defines the essential gene set required for the growth of a photosynthetic organism.

  11. A statistical framework for improving genomic annotations of prokaryotic essential genes.

    PubMed

    Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J; Lu, Long Jason

    2013-01-01

    Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model-based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene's essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model's broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our

  12. TransFind—predicting transcriptional regulators for gene sets

    PubMed Central

    Kiełbasa, Szymon M.; Klein, Holger; Roider, Helge G.; Vingron, Martin; Blüthgen, Nils

    2010-01-01

    The analysis of putative transcription factor binding sites in promoter regions of coregulated genes allows to infer the transcription factors that underlie observed changes in gene expression. While such analyses constitute a central component of the in-silico characterization of transcriptional regulatory networks, there is still a lack of simple-to-use web servers able to combine state-of-the-art prediction methods with phylogenetic analysis and appropriate multiple testing corrected statistics, which returns the results within a short time. Having these aims in mind we developed TransFind, which is freely available at http://transfind.sys-bio.net/. PMID:20511592

  13. Improving Gene-Set Enrichment Analysis of RNA-Seq Data with Small Replicates.

    PubMed

    Yoon, Sora; Kim, Seon-Young; Nam, Dougu

    2016-01-01

    Deregulated pathways identified from transcriptome data of two sample groups have played a key role in many genomic studies. Gene-set enrichment analysis (GSEA) has been commonly used for pathway or functional analysis of microarray data, and it is also being applied to RNA-seq data. However, most RNA-seq data so far have only small replicates. This enforces to apply the gene-permuting GSEA method (or preranked GSEA) which results in a great number of false positives due to the inter-gene correlation in each gene-set. We demonstrate that incorporating the absolute gene statistic in one-tailed GSEA considerably improves the false-positive control and the overall discriminatory ability of the gene-permuting GSEA methods for RNA-seq data. To test the performance, a simulation method to generate correlated read counts within a gene-set was newly developed, and a dozen of currently available RNA-seq enrichment analysis methods were compared, where the proposed methods outperformed others that do not account for the inter-gene correlation. Analysis of real RNA-seq data also supported the proposed methods in terms of false positive control, ranks of true positives and biological relevance. An efficient R package (AbsFilterGSEA) coded with C++ (Rcpp) is available from CRAN.

  14. Methods of artificial enlargement of the training set for statistical shape models.

    PubMed

    Koikkalainen, Juha; Tölli, Tuomas; Lauerma, Kirsi; Antila, Kari; Mattila, Elina; Lilja, Mikko; Lötjönen, Jyrki

    2008-11-01

    Due to the small size of training sets, statistical shape models often over-constrain the deformation in medical image segmentation. Hence, artificial enlargement of the training set has been proposed as a solution for the problem to increase the flexibility of the models. In this paper, different methods were evaluated to artificially enlarge a training set. Furthermore, the objectives were to study the effects of the size of the training set, to estimate the optimal number of deformation modes, to study the effects of different error sources, and to compare different deformation methods. The study was performed for a cardiac shape model consisting of ventricles, atria, and epicardium, and built from magnetic resonance (MR) volume images of 25 subjects. Both shape modeling and image segmentation accuracies were studied. The objectives were reached by utilizing different training sets and datasets, and two deformation methods. The evaluation proved that artificial enlargement of the training set improves both the modeling and segmentation accuracy. All but one enlargement techniques gave statistically significantly (p < 0.05) better segmentation results than the standard method without enlargement. The two best enlargement techniques were the nonrigid movement technique and the technique that combines principal component analysis (PCA) and finite element model (FEM). The optimal number of deformation modes was found to be near 100 modes in our application. The active shape model segmentation gave better segmentation accuracy than the one based on the simulated annealing optimization of the model weights.

  15. Statistical Mechanics of Horizontal Gene Transfer in Evolutionary Ecology

    NASA Astrophysics Data System (ADS)

    Chia, Nicholas; Goldenfeld, Nigel

    2011-04-01

    The biological world, especially its majority microbial component, is strongly interacting and may be dominated by collective effects. In this review, we provide a brief introduction for statistical physicists of the way in which living cells communicate genetically through transferred genes, as well as the ways in which they can reorganize their genomes in response to environmental pressure. We discuss how genome evolution can be thought of as related to the physical phenomenon of annealing, and describe the sense in which genomes can be said to exhibit an analogue of information entropy. As a direct application of these ideas, we analyze the variation with ocean depth of transposons in marine microbial genomes, predicting trends that are consistent with recent observations using metagenomic surveys.

  16. Gene expression data analysis using closed item set mining for labeled data.

    PubMed

    Rotter, Ana; Novak, Petra Kralj; Baebler, Spela; Toplak, Natasa; Blejec, Andrej; Lavrac, Nada; Gruden, Kristina

    2010-04-01

    This article presents an approach to microarray data analysis using discretised expression values in combination with a methodology of closed item set mining for class labeled data (RelSets). A statistical 2 x 2 factorial design analysis was run in parallel. The approach was validated on two independent sets of two-color microarray experiments using potato plants. Our results demonstrate that the two different analytical procedures, applied on the same data, are adequate for solving two different biological questions being asked. Statistical analysis is appropriate if an overview of the consequences of treatments and their interaction terms on the studied system is needed. If, on the other hand, a list of genes whose expression (upregulation or downregulation) differentiates between classes of data is required, the use of the RelSets algorithm is preferred. The used algorithms are freely available upon request to the authors.

  17. Gene set analysis of survival following ovarian cancer implicates macrolide binding and intracellular signaling genes.

    PubMed

    Fridley, Brooke L; Jenkins, Gregory D; Tsai, Ya-Yu; Song, Honglin; Bolton, Kelly L; Fenstermacher, David; Tyrer, Jonathan; Ramus, Susan J; Cunningham, Julie M; Vierkant, Robert A; Chen, Zhihua; Chen, Y Ann; Iversen, Ed; Menon, Usha; Gentry-Maharaj, Aleksandra; Schildkraut, Joellen; Sutphen, Rebecca; Gayther, Simon A; Hartmann, Lynn C; Pharoah, Paul D P; Sellers, Thomas A; Goode, Ellen L

    2012-03-01

    Genome-wide association studies (GWAS) for epithelial ovarian cancer (EOC), the most lethal gynecologic malignancy, have identified novel susceptibility loci. GWAS for survival after EOC have had more limited success. The association of each single-nucleotide polymorphism (SNP) individually may not be well suited to detect small effects of multiple SNPs, such as those operating within the same biologic pathway. Gene set analysis (GSA) overcomes this limitation by assessing overall evidence for association of a phenotype with all measured variation in a set of genes. To determine gene sets associated with EOC overall survival, we conducted GSA using data from two large GWAS (N cases = 2,813, N deaths = 1,116), with a novel Principal Component-Gamma GSA method. Analysis was completed for all cases and then separately for high-grade serous histologic subtype. Analysis of the high-grade serous subjects resulted in 43 gene sets with P < 0.005 (1.7%); of these, 21 gene sets had P < 0.10 in both GWAS, including intracellular signaling pathway (P = 7.3 × 10(-5)) and macrolide binding (P = 6.2 × 10(-4)) gene sets. The top gene sets in analysis of all cases were meiotic mismatch repair (P = 6.3 × 10(-4)) and macrolide binding (P = 1.0 × 10(-3)). Of 18 gene sets with P < 0.005 (0.7%), eight had P < 0.10 in both GWAS. This research detected novel gene sets associated with EOC survival. Novel gene sets associated with EOC survival might lead to new insights and avenues for development of novel therapies for EOC and pharmacogenomic studies. ©2012 AACR.

  18. A Complete Set of Nascent Transcription Rates for Yeast Genes

    PubMed Central

    Pelechano, Vicent; Chávez, Sebastián; Pérez-Ortín, José E.

    2010-01-01

    The amount of mRNA in a cell is the result of two opposite reactions: transcription and mRNA degradation. These reactions are governed by kinetics laws, and the most regulated step for many genes is the transcription rate. The transcription rate, which is assumed to be exercised mainly at the RNA polymerase recruitment level, can be calculated using the RNA polymerase densities determined either by run-on or immunoprecipitation using specific antibodies. The yeast Saccharomyces cerevisiae is the ideal model organism to generate a complete set of nascent transcription rates that will prove useful for many gene regulation studies. By combining genomic data from both the GRO (Genomic Run-on) and the RNA pol ChIP-on-chip methods we generated a new, more accurate nascent transcription rate dataset. By comparing this dataset with the indirect ones obtained from the mRNA stabilities and mRNA amount datasets, we are able to obtain biological information about posttranscriptional regulation processes and a genomic snapshot of the location of the active transcriptional machinery. We have obtained nascent transcription rates for 4,670 yeast genes. The median RNA polymerase II density in the genes is 0.078 molecules/kb, which corresponds to an average of 0.096 molecules/gene. Most genes have transcription rates of between 2 and 30 mRNAs/hour and less than 1% of yeast genes have >1 RNA polymerase molecule/gene. Histone and ribosomal protein genes are the highest transcribed groups of genes and other than these exceptions the transcription of genes is an infrequent phenomenon in a yeast cell. PMID:21103382

  19. Exact statistical tests for the intersection of independent lists of genes

    PubMed Central

    NATARAJAN, LOKI; PU, MINYA; MESSER, KAREN

    2012-01-01

    Public data repositories have enabled researchers to compare results across multiple genomic studies in order to replicate findings. A common approach is to first rank genes according to an hypothesis of interest within each study. Then, lists of the top-ranked genes within each study are compared across studies. Genes recaptured as highly ranked (usually above some threshold) in multiple studies are considered to be significant. However, this comparison strategy often remains informal, in that Type I error and false discovery rate are usually uncontrolled. In this paper, we formalize an inferential strategy for this kind of list-intersection discovery test. We show how to compute a p-value associated with a `recaptured' set of genes, using a closed-form Poisson approximation to the distribution of the size of the recaptured set. The distribution of the test statistic depends on the rank threshold and the number of studies within which a gene must be recaptured. We use a Poisson approximation to investigate operating characteristics of the test. We give practical guidance on how to design a bioinformatic list-intersection study with prespecified control of Type I error (at the set level) and false discovery rate (at the gene level). We show how choice of test parameters will affect the expected proportion of significant genes identified. We present a strategy for identifying optimal choice of parameters, depending on the particular alternative hypothesis which might hold. We illustrate our methods using prostate cancer gene-expression datasets from the curated Oncomine database. PMID:23335952

  20. Key genes and pathways in thyroid cancer based on gene set enrichment analysis.

    PubMed

    He, Wenwu; Qi, Bin; Zhou, Qiuxi; Lu, Chuansen; Huang, Qi; Xian, Lei; Chen, Mingwu

    2013-09-01

    The incidence of thyroid cancer and its associated morbidity has shown the most rapid increase among all cancers since 1982, but the mechanisms involved in thyroid cancer, particularly significant key genes induced in thyroid cancer, remain undefined. In many studies, gene probes have been used to search for key genes involved in causing and facilitating thyroid cancer. As a result, many possible virulence genes and pathways have been identified. However, these studies lack a case contrast for selecting the most possible virulence genes and pathways, as well as conclusive results with which to clarify the mechanisms of cancer development. In the present study, we used gene set enrichment and meta-analysis to select key genes and pathways. Based on gene set enrichment, we identified 5 downregulated and 4 upregulated mixed pathways in 6 tissue datasets. Based on the meta-analysis, there were 17 common pathways in the tissue datasets. One pathway, the p53 signaling pathway, which includes 13 genes, was identified by both the gene set enrichment analysis and meta-analysis. Genes are important elements that form key pathways. These pathways can induce the development of thyroid cancer later in life. The key pathways and genes identified in the present study can be used in the next stage of research, which will involve gene elimination and other methods of experimentation.

  1. Comparisons of power of statistical methods for gene-environment interaction analyses.

    PubMed

    Ege, Markus J; Strachan, David P

    2013-10-01

    Any genome-wide analysis is hampered by reduced statistical power due to multiple comparisons. This is particularly true for interaction analyses, which have lower statistical power than analyses of associations. To assess gene-environment interactions in population settings we have recently proposed a statistical method based on a modified two-step approach, where first genetic loci are selected by their associations with disease and environment, respectively, and subsequently tested for interactions. We have simulated various data sets resembling real world scenarios and compared single-step and two-step approaches with respect to true positive rate (TPR) in 486 scenarios and (study-wide) false positive rate (FPR) in 252 scenarios. Our simulations confirmed that in all two-step methods the two steps are not correlated. In terms of TPR, two-step approaches combining information on gene-disease association and gene-environment association in the first step were superior to all other methods, while preserving a low FPR in over 250 million simulations under the null hypothesis. Our weighted modification yielded the highest power across various degrees of gene-environment association in the controls. An optimal threshold for step 1 depended on the interacting allele frequency and the disease prevalence. In all scenarios, the least powerful method was to proceed directly to an unbiased full interaction model, applying conventional genome-wide significance thresholds. This simulation study confirms the practical advantage of two-step approaches to interaction testing over more conventional one-step designs, at least in the context of dichotomous disease outcomes and other parameters that might apply in real-world settings.

  2. CoGA: An R Package to Identify Differentially Co-Expressed Gene Sets by Analyzing the Graph Spectra.

    PubMed

    Santos, Suzana de Siqueira; Galatro, Thais Fernanda de Almeida; Watanabe, Rodrigo Akira; Oba-Shinjo, Sueli Mieko; Nagahashi Marie, Suely Kazue; Fujita, André

    2015-01-01

    Gene set analysis aims to identify predefined sets of functionally related genes that are differentially expressed between two conditions. Although gene set analysis has been very successful, by incorporating biological knowledge about the gene sets and enhancing statistical power over gene-by-gene analyses, it does not take into account the correlation (association) structure among the genes. In this work, we present CoGA (Co-expression Graph Analyzer), an R package for the identification of groups of differentially associated genes between two phenotypes. The analysis is based on concepts of Information Theory applied to the spectral distributions of the gene co-expression graphs, such as the spectral entropy to measure the randomness of a graph structure and the Jensen-Shannon divergence to discriminate classes of graphs. The package also includes common measures to compare gene co-expression networks in terms of their structural properties, such as centrality, degree distribution, shortest path length, and clustering coefficient. Besides the structural analyses, CoGA also includes graphical interfaces for visual inspection of the networks, ranking of genes according to their "importance" in the network, and the standard differential expression analysis. We show by both simulation experiments and analyses of real data that the statistical tests performed by CoGA indeed control the rate of false positives and is able to identify differentially co-expressed genes that other methods failed.

  3. Multi-edge gene set networks reveal novel insights into global relationships between biological themes.

    PubMed

    Parikh, Jignesh R; Xia, Yu; Marto, Jarrod A

    2012-01-01

    Curated gene sets from databases such as KEGG Pathway and Gene Ontology are often used to systematically organize lists of genes or proteins derived from high-throughput data. However, the information content inherent to some relationships between the interrogated gene sets, such as pathway crosstalk, is often underutilized. A gene set network, where nodes representing individual gene sets such as KEGG pathways are connected to indicate a functional dependency, is well suited to visualize and analyze global gene set relationships. Here we introduce a novel gene set network construction algorithm that integrates gene lists derived from high-throughput experiments with curated gene sets to construct co-enrichment gene set networks. Along with previously described co-membership and linkage algorithms, we apply the co-enrichment algorithm to eight gene set collections to construct integrated multi-evidence gene set networks with multiple edge types connecting gene sets. We demonstrate the utility of approach through examples of novel gene set networks such as the chromosome map co-differential expression gene set network. A total of twenty-four gene set networks are exposed via a web tool called MetaNet, where context-specific multi-edge gene set networks are constructed from enriched gene sets within user-defined gene lists. MetaNet is freely available at http://blaispathways.dfci.harvard.edu/metanet/.

  4. Statistical plant set estimation using Schroeder-phased multisinusoidal input design

    NASA Technical Reports Server (NTRS)

    Bayard, D. S.

    1992-01-01

    A frequency domain method is developed for plant set estimation. The estimation of a plant 'set' rather than a point estimate is required to support many methods of modern robust control design. The approach here is based on using a Schroeder-phased multisinusoid input design which has the special property of placing input energy only at the discrete frequency points used in the computation. A detailed analysis of the statistical properties of the frequency domain estimator is given, leading to exact expressions for the probability distribution of the estimation error, and many important properties. It is shown that, for any nominal parametric plant estimate, one can use these results to construct an overbound on the additive uncertainty to any prescribed statistical confidence. The 'soft' bound thus obtained can be used to replace 'hard' bounds presently used in many robust control analysis and synthesis methods.

  5. Parallel evolution of nacre building gene sets in molluscs.

    PubMed

    Jackson, Daniel J; McDougall, Carmel; Woodcroft, Ben; Moase, Patrick; Rose, Robert A; Kube, Michael; Reinhardt, Richard; Rokhsar, Daniel S; Montagnani, Caroline; Joubert, Caroline; Piquemal, David; Degnan, Bernard M

    2010-03-01

    The capacity to biomineralize is closely linked to the rapid expansion of animal life during the early Cambrian, with many skeletonized phyla first appearing in the fossil record at this time. The appearance of disparate molluscan forms during this period leaves open the possibility that shells evolved independently and in parallel in at least some groups. To test this proposition and gain insight into the evolution of structural genes that contribute to shell fabrication, we compared genes expressed in nacre (mother-of-pearl) forming cells in the mantle of the bivalve Pinctada maxima and the gastropod Haliotis asinina. Despite both species having highly lustrous nacre, we find extensive differences in these expressed gene sets. Following the removal of housekeeping genes, less than 10% of all gene clusters are shared between these molluscs, with some being conserved biomineralization genes that are also found in deuterostomes. These differences extend to secreted proteins that may localize to the organic shell matrix, with less than 15% of this secretome being shared. Despite these differences, H. asinina and P. maxima both secrete proteins with repetitive low-complexity domains (RLCDs). Pinctada maxima RLCD proteins-for example, the shematrins-are predominated by silk/fibroin-like domains, which are absent from the H. asinina data set. Comparisons of shematrin genes across three species of Pinctada indicate that this gene family has undergone extensive divergent evolution within pearl oysters. We also detect fundamental bivalve-gastropod differences in extracellular matrix proteins involved in mollusc-shell formation. Pinctada maxima expresses a chitin synthase at high levels and several chitin deacetylation genes, whereas only one protein involved in chitin interactions is present in the H. asinina data set, suggesting that the organic matrix on which calcification proceeds differs fundamentally between these species. Large-scale differences in genes expressed

  6. Quantum Statistical Mechanical Derivation of the Second Law of Thermodynamics: A Hybrid Setting Approach.

    PubMed

    Tasaki, Hal

    2016-04-29

    Based on quantum statistical mechanics and microscopic quantum dynamics, we prove Planck's and Kelvin's principles for macroscopic systems in a general and realistic setting. We consider a hybrid quantum system that consists of the thermodynamic system, which is initially in thermal equilibrium, and the "apparatus" which operates on the former, and assume that the whole system evolves autonomously. This provides a satisfactory derivation of the second law for macroscopic systems.

  7. The essential gene set of a photosynthetic organism

    DOE PAGES

    Rubin, Benjamin E.; Wetmore, Kelly M.; Price, Morgan N.; ...

    2015-10-27

    Synechococcus elongatus PCC 7942 is a model organism used for studying photosynthesis and the circadian clock, and it is being developed for the production of fuel, industrial chemicals, and pharmaceuticals. To identify a comprehensive set of genes and intergenic regions that impacts fitness in S. elongatus, we created a pooled library of ~250,000 transposon mutants and used sequencing to identify the insertion locations. By analyzing the distribution and survival of these mutants, we identified 718 of the organism's 2,723 genes as essential for survival under laboratory conditions. The validity of the essential gene set is supported by its tight overlapmore » with wellconserved genes and its enrichment for core biological processes. The differences noted between our dataset and these predictors of essentiality, however, have led to surprising biological insights. One such finding is that genes in a large portion of the TCA cycle are dispensable, suggesting that S. elongatus does not require a cyclic TCA process. Furthermore, the density of the transposon mutant library enabled individual and global statements about the essentiality of noncoding RNAs, regulatory elements, and other intergenic regions. In this way, a group I intron located in tRNA Leu , which has been used extensively for phylogenetic studies, was shown here to be essential for the survival of S. elongatus. Our survey of essentiality for every locus in the S. elongatus genome serves as a powerful resource for understanding the organism's physiology and defines the essential gene set required for the growth of a photosynthetic organism.« less

  8. The essential gene set of a photosynthetic organism

    PubMed Central

    Rubin, Benjamin E.; Wetmore, Kelly M.; Price, Morgan N.; Diamond, Spencer; Shultzaberger, Ryan K.; Lowe, Laura C.; Curtin, Genevieve; Arkin, Adam P.; Deutschbauer, Adam; Golden, Susan S.

    2015-01-01

    Synechococcus elongatus PCC 7942 is a model organism used for studying photosynthesis and the circadian clock, and it is being developed for the production of fuel, industrial chemicals, and pharmaceuticals. To identify a comprehensive set of genes and intergenic regions that impacts fitness in S. elongatus, we created a pooled library of ∼250,000 transposon mutants and used sequencing to identify the insertion locations. By analyzing the distribution and survival of these mutants, we identified 718 of the organism’s 2,723 genes as essential for survival under laboratory conditions. The validity of the essential gene set is supported by its tight overlap with well-conserved genes and its enrichment for core biological processes. The differences noted between our dataset and these predictors of essentiality, however, have led to surprising biological insights. One such finding is that genes in a large portion of the TCA cycle are dispensable, suggesting that S. elongatus does not require a cyclic TCA process. Furthermore, the density of the transposon mutant library enabled individual and global statements about the essentiality of noncoding RNAs, regulatory elements, and other intergenic regions. In this way, a group I intron located in tRNALeu, which has been used extensively for phylogenetic studies, was shown here to be essential for the survival of S. elongatus. Our survey of essentiality for every locus in the S. elongatus genome serves as a powerful resource for understanding the organism’s physiology and defines the essential gene set required for the growth of a photosynthetic organism. PMID:26508635

  9. GeneTopics - interpretation of gene sets via literature-driven topic models

    PubMed Central

    2013-01-01

    Background Annotation of a set of genes is often accomplished through comparison to a library of labelled gene sets such as biological processes or canonical pathways. However, this approach might fail if the employed libraries are not up to date with the latest research, don't capture relevant biological themes or are curated at a different level of granularity than is required to appropriately analyze the input gene set. At the same time, the vast biomedical literature offers an unstructured repository of the latest research findings that can be tapped to provide thematic sub-groupings for any input gene set. Methods Our proposed method relies on a gene-specific text corpus and extracts commonalities between documents in an unsupervised manner using a topic model approach. We automatically determine the number of topics summarizing the corpus and calculate a gene relevancy score for each topic allowing us to eliminate non-specific topics. As a result we obtain a set of literature topics in which each topic is associated with a subset of the input genes providing directly interpretable keywords and corresponding documents for literature research. Results We validate our method based on labelled gene sets from the KEGG metabolic pathway collection and the genetic association database (GAD) and show that the approach is able to detect topics consistent with the labelled annotation. Furthermore, we discuss the results on three different types of experimentally derived gene sets, (1) differentially expressed genes from a cardiac hypertrophy experiment in mice, (2) altered transcript abundance in human pancreatic beta cells, and (3) genes implicated by GWA studies to be associated with metabolite levels in a healthy population. In all three cases, we are able to replicate findings from the original papers in a quick and semi-automated manner. Conclusions Our approach provides a novel way of automatically generating meaningful annotations for gene sets that are directly

  10. Generalized shrinkage F-like statistics for testing an interaction term in gene expression analysis in the presence of heteroscedasticity

    PubMed Central

    2011-01-01

    Background Many analyses of gene expression data involve hypothesis tests of an interaction term between two fixed effects, typically tested using a residual variance. In expression studies, the issue of variance heteroscedasticity has received much attention, and previous work has focused on either between-gene or within-gene heteroscedasticity. However, in a single experiment, heteroscedasticity may exist both within and between genes. Here we develop flexible shrinkage error estimators considering both between-gene and within-gene heteroscedasticity and use them to construct F-like test statistics for testing interactions, with cutoff values obtained by permutation. These permutation tests are complicated, and several permutation tests are investigated here. Results Our proposed test statistics are compared with other existing shrinkage-type test statistics through extensive simulation studies and a real data example. The results show that the choice of permutation procedures has dramatically more influence on detection power than the choice of F or F-like test statistics. When both types of gene heteroscedasticity exist, our proposed test statistics can control preselected type-I errors and are more powerful. Raw data permutation is not valid in this setting. Whether unrestricted or restricted residual permutation should be used depends on the specific type of test statistic. Conclusions The F-like test statistic that uses the proposed flexible shrinkage error estimator considering both types of gene heteroscedasticity and unrestricted residual permutation can provide a statistically valid and powerful test. Therefore, we recommended that it should always applied in the analysis of real gene expression data analysis to test an interaction term. PMID:22044602

  11. Generalized shrinkage F-like statistics for testing an interaction term in gene expression analysis in the presence of heteroscedasticity.

    PubMed

    Yang, Jie; Casella, George; McIntyre, Lauren M

    2011-11-01

    Many analyses of gene expression data involve hypothesis tests of an interaction term between two fixed effects, typically tested using a residual variance. In expression studies, the issue of variance heteroscedasticity has received much attention, and previous work has focused on either between-gene or within-gene heteroscedasticity. However, in a single experiment, heteroscedasticity may exist both within and between genes. Here we develop flexible shrinkage error estimators considering both between-gene and within-gene heteroscedasticity and use them to construct F-like test statistics for testing interactions, with cutoff values obtained by permutation. These permutation tests are complicated, and several permutation tests are investigated here. Our proposed test statistics are compared with other existing shrinkage-type test statistics through extensive simulation studies and a real data example. The results show that the choice of permutation procedures has dramatically more influence on detection power than the choice of F or F-like test statistics. When both types of gene heteroscedasticity exist, our proposed test statistics can control preselected type-I errors and are more powerful. Raw data permutation is not valid in this setting. Whether unrestricted or restricted residual permutation should be used depends on the specific type of test statistic. The F-like test statistic that uses the proposed flexible shrinkage error estimator considering both types of gene heteroscedasticity and unrestricted residual permutation can provide a statistically valid and powerful test. Therefore, we recommended that it should always applied in the analysis of real gene expression data analysis to test an interaction term.

  12. Fully moderated T-statistic for small sample size gene expression arrays.

    PubMed

    Yu, Lianbo; Gulati, Parul; Fernandez, Soledad; Pennell, Michael; Kirschner, Lawrence; Jarjoura, David

    2011-09-15

    Gene expression microarray experiments with few replications lead to great variability in estimates of gene variances. Several Bayesian methods have been developed to reduce this variability and to increase power. Thus far, moderated t methods assumed a constant coefficient of variation (CV) for the gene variances. We provide evidence against this assumption, and extend the method by allowing the CV to vary with gene expression. Our CV varying method, which we refer to as the fully moderated t-statistic, was compared to three other methods (ordinary t, and two moderated t predecessors). A simulation study and a familiar spike-in data set were used to assess the performance of the testing methods. The results showed that our CV varying method had higher power than the other three methods, identified a greater number of true positives in spike-in data, fit simulated data under varying assumptions very well, and in a real data set better identified higher expressing genes that were consistent with functional pathways associated with the experiments.

  13. Experiences Running a Parallel Answer Set Solver on Blue Gene

    NASA Astrophysics Data System (ADS)

    Schneidenbach, Lars; Schnor, Bettina; Gebser, Martin; Kaminski, Roland; Kaufmann, Benjamin; Schaub, Torsten

    This paper presents the concept of parallelisation of a solver for Answer Set Programming (ASP). While there already exist some approaches to parallel ASP solving, there was a lack of a parallel version of the powerful clasp solver. We implemented a parallel version of clasp based on message-passing. Experimental results on Blue Gene P/L indicate the potential of such an approach.

  14. Gene set analysis approaches for RNA-seq data: performance evaluation and application guideline

    PubMed Central

    Rahmatallah, Yasir; Emmert-Streib, Frank

    2016-01-01

    Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis. Here we provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. We evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, we found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. We also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. Our evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq. PMID:26342128

  15. Shrinkage covariance matrix approach based on robust trimmed mean in gene sets detection

    NASA Astrophysics Data System (ADS)

    Karjanto, Suryaefiza; Ramli, Norazan Mohamed; Ghani, Nor Azura Md; Aripin, Rasimah; Yusop, Noorezatty Mohd

    2015-02-01

    Microarray involves of placing an orderly arrangement of thousands of gene sequences in a grid on a suitable surface. The technology has made a novelty discovery since its development and obtained an increasing attention among researchers. The widespread of microarray technology is largely due to its ability to perform simultaneous analysis of thousands of genes in a massively parallel manner in one experiment. Hence, it provides valuable knowledge on gene interaction and function. The microarray data set typically consists of tens of thousands of genes (variables) from just dozens of samples due to various constraints. Therefore, the sample covariance matrix in Hotelling's T2 statistic is not positive definite and become singular, thus it cannot be inverted. In this research, the Hotelling's T2 statistic is combined with a shrinkage approach as an alternative estimation to estimate the covariance matrix to detect significant gene sets. The use of shrinkage covariance matrix overcomes the singularity problem by converting an unbiased to an improved biased estimator of covariance matrix. Robust trimmed mean is integrated into the shrinkage matrix to reduce the influence of outliers and consequently increases its efficiency. The performance of the proposed method is measured using several simulation designs. The results are expected to outperform existing techniques in many tested conditions.

  16. Statistical analysis of differential gene expression relative to a fold change threshold on NanoString data of mouse odorant receptor genes.

    PubMed

    Vaes, Evelien; Khan, Mona; Mombaerts, Peter

    2014-02-04

    A challenge in gene expression studies is the reliable identification of differentially expressed genes. In many high-throughput studies, genes are accepted as differentially expressed only if they satisfy simultaneously a p value criterion and a fold change criterion. A statistical method, TREAT, has been developed for microarray data to assess formally if fold changes are significantly higher than a predefined threshold. We have recently applied the NanoString digital platform to study expression of mouse odorant receptor genes, which form with 1,200 members the largest gene family in the mouse genome. Our objectives are, on these data, to decrease false discoveries when formally assessing the genes relative to a fold change threshold, and to provide a guided selection in the choice of this threshold. Statistical tests have been developed for microarray data to identify genes that are differentially expressed relative to a fold change threshold. Here we report that another approach, which we refer to as tTREAT, is more appropriate for our NanoString data, where false discoveries lead to costly and time-consuming follow-up experiments. Methods that we refer to as tTREAT2 and the running fold change model improve the performance of the statistical tests by protecting or selecting the fold change threshold more objectively. We show the benefits on simulated and real data. Gene-wise statistical analyses of gene expression data, for which the significance relative to a fold change threshold is important, give reproducible and reliable results on NanoString data of mouse odorant receptor genes. Because it can be difficult to set in advance a fold change threshold that is meaningful for the available data, we developed methods that enable a better choice (thus reducing false discoveries and/or missed genes) or avoid this choice altogether. This set of tools may be useful for the analysis of other types of gene expression data.

  17. Statistical methods for mapping quantitative trait loci from a dense set of markers.

    PubMed Central

    Dupuis, J; Siegmund, D

    1999-01-01

    Lander and Botstein introduced statistical methods for searching an entire genome for quantitative trait loci (QTL) in experimental organisms, with emphasis on a backcross design and QTL having only additive effects. We extend their results to intercross and other designs, and we compare the power of the resulting test as a function of the magnitude of the additive and dominance effects, the sample size and intermarker distances. We also compare three methods for constructing confidence regions for a QTL: likelihood regions, Bayesian credible sets, and support regions. We show that with an appropriate evaluation of the coverage probability a support region is approximately a confidence region, and we provide a theroretical explanation of the empirical observation that the size of the support region is proportional to the sample size, not the square root of the sample size, as one might expect from standard statistical theory. PMID:9872974

  18. GSMA: Gene Set Matrix Analysis, An Automated Method for Rapid Hypothesis Testing of Gene Expression Data

    PubMed Central

    Cheadle, Chris; Watkins, Tonya; Fan, Jinshui; Williams, Marc A.; Georas, Steven; Hall, John; Rosen, Antony; Barnes, Kathleen C.

    2007-01-01

    Background: Microarray technology has become highly valuable for identifying complex global changes in gene expression patterns. The assignment of functional information to these complex patterns remains a challenging task in effectively interpreting data and correlating results from across experiments, projects and laboratories. Methods which allow the rapid and robust evaluation of multiple functional hypotheses increase the power of individual researchers to data mine gene expression data more efficiently. Results: We have developed (gene set matrix analysis) GSMA as a useful method for the rapid testing of group-wise up- or down-regulation of gene expression simultaneously for multiple lists of genes (gene sets) against entire distributions of gene expression changes (datasets) for single or multiple experiments. The utility of GSMA lies in its flexibility to rapidly poll gene sets related by known biological function or as designated solely by the end-user against large numbers of datasets simultaneously. Conclusions: GSMA provides a simple and straightforward method for hypothesis testing in which genes are tested by groups across multiple datasets for patterns of expression enrichment. PMID:20066124

  19. Space Object Detection and Tracking Within a Finite Set Statistics Framework

    DTIC Science & Technology

    2017-04-13

    AFRL-AFOSR-CL-TR-2017-0005 Space Object Detection & Tracking Within a Finite Set Statistics Framework Martin Adams Department of Electrical...MM-YYYY)      21-04-2017 2. REPORT TYPE Final 3. DATES COVERED (From - To) 01 Feb 2015 to 31 Jan 2017 4. TITLE AND SUBTITLE Space Object Detection...Grant No. FA9550-15-1-0069, devoted to the investigation and improvement of the detection and tracking methods of inactive Resident Space Objects (RSOs

  20. Solar cycle signal in stratospheric ozone: Statistical analysis of satellite data sets and comparisons with models

    NASA Astrophysics Data System (ADS)

    Hood, Lon

    Three independent satellite ozone profile data sets with lengths extending up to 25 years are analyzed using a multiple regression statistical model. Column ozone measurements are also compared with ozone profile data during the 1992 - 2003 period when no major volcanic eruptions occurred. In addition to the standard linear trend, QBO, and solar cycle explanatory variables, we also consider the effect of including an ENSO term and an equivalent effective stratospheric chlorine (EESC) term in the statistical model. Results show that the vertical structure of the tropical ozone solar cycle response has been consistently characterized by statistically significant positive responses in the upper and lower stratosphere and by statistically insignificant responses in the middle stratosphere (about 28 - 38 km altitude). The similar vertical structure in the tropics obtained for separate time intervals (with minimum response invariably near 10 hPa) is difficult to explain by random interference from the QBO and volcanic eruptions in the statistical analysis. The observed increase in tropical total column ozone approaching the cycle 23 maximum during the late 1990s occurred primarily in the lower stratosphere below the 30 hPa level. This lower stratospheric solar cycle variation may be caused mainly by decadal changes in the upwelling branch of the meridional (Brewer-Dobson) circulation resulting from direct effects of solar UV variability on the upper and middle stratosphere. The observed vertical structure of the tropical response differs from that simulated by most models. However, several recent models have begun to yield a double-peaked structure in the tropics that is similar to that derived from observations (e.g., Austin et al., Atmos. Chem. Phys., 2007).

  1. Regionalisation of statistical model outputs creating gridded data sets for Germany

    NASA Astrophysics Data System (ADS)

    Höpp, Simona Andrea; Rauthe, Monika; Deutschländer, Thomas

    2016-04-01

    The goal of the German research program ReKliEs-De (regional climate projection ensembles for Germany, http://.reklies.hlug.de) is to distribute robust information about the range and the extremes of future climate for Germany and its neighbouring river catchment areas. This joint research project is supported by the German Federal Ministry of Education and Research (BMBF) and was initiated by the German Federal States. The Project results are meant to support the development of adaptation strategies to mitigate the impacts of future climate change. The aim of our part of the project is to adapt and transfer the regionalisation methods of the gridded hydrological data set (HYRAS) from daily station data to the station based statistical regional climate model output of WETTREG (regionalisation method based on weather patterns). The WETTREG model output covers the period of 1951 to 2100 with a daily temporal resolution. For this, we generate a gridded data set of the WETTREG output for precipitation, air temperature and relative humidity with a spatial resolution of 12.5 km x 12.5 km, which is common for regional climate models. Thus, this regionalisation allows comparing statistical to dynamical climate model outputs. The HYRAS data set was developed by the German Meteorological Service within the German research program KLIWAS (www.kliwas.de) and consists of daily gridded data for Germany and its neighbouring river catchment areas. It has a spatial resolution of 5 km x 5 km for the entire domain for the hydro-meteorological elements precipitation, air temperature and relative humidity and covers the period of 1951 to 2006. After conservative remapping the HYRAS data set is also convenient for the validation of climate models. The presentation will consist of two parts to present the actual state of the adaptation of the HYRAS regionalisation methods to the statistical regional climate model WETTREG: First, an overview of the HYRAS data set and the regionalisation

  2. Gene set analysis for self-contained tests: complex null and specific alternative hypotheses

    PubMed Central

    Rahmatallah, Y.; Glazko, G.

    2012-01-01

    Motivation: The analysis of differentially expressed gene sets became a routine in the analyses of gene expression data. There is a multitude of tests available, ranging from aggregation tests that summarize gene-level statistics for a gene set to true multivariate tests, accounting for intergene correlations. Most of them detect complex departures from the null hypothesis but when the null hypothesis is rejected, the specific alternative leading to the rejection is not easily identifiable. Results: In this article we compare the power and Type I error rates of minimum-spanning tree (MST)-based non-parametric multivariate tests with several multivariate and aggregation tests, which are frequently used for pathway analyses. In our simulation study, we demonstrate that MST-based tests have power that is for many settings comparable with the power of conventional approaches, but outperform them in specific regions of the parameter space corresponding to biologically relevant configurations. Further, we find for simulated and for gene expression data that MST-based tests discriminate well against shift and scale alternatives. As a general result, we suggest a two-step practical analysis strategy that may increase the interpretability of experimental data: first, apply the most powerful multivariate test to find the subset of pathways for which the null hypothesis is rejected and second, apply MST-based tests to these pathways to select those that support specific alternative hypotheses. Contact: gvglazko@uams.edu or yrahmatallah@uams.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23044539

  3. Statistical analysis of EQ-5D profiles: does the use of value sets bias inference?

    PubMed

    Parkin, David; Rice, Nigel; Devlin, Nancy

    2010-01-01

    Health state profile data, such as those provided by the EQ-5D, are widely collected in clinical trials, population surveys, and a growing range of other important health sector applications. However, these profile data are difficult to summarize to give an overall view of the health of a given population that can be analyzed for differences between groups or within groups over time. A common way of short cutting this problem is to transform profiles into a single number, or index, using sets of weights, often elicited from the general public in the form of values. Are there any problems with this procedure? In this article, the authors demonstrate the underlying effects of the use of value sets as a means of weighting profile data. They show that any set of weights introduces an exogenous source of variance to health profile data. These can distort findings about the significance of changes in health between groups or over time. No set of weights is neutral in its effect. If a summary of patient-reported outcomes is required, it may be better to use an instrument that yields this directly (such as the EQ VAS) along with the descriptive instrument. If this is not possible, researchers should have a clear rationale for their choice of weights and be aware that those weights may exert a nontrivial effect on their analysis. This article focuses on the EQ-5D, but the arguments and their implications for statistical analysis are relevant to all health state descriptive systems.

  4. Discriminatory power of game-related statistics in 14-15 year age group male volleyball, according to set.

    PubMed

    García-Hermoso, Antonio; Dávila-Romero, Carlos; Saavedra, Jose M

    2013-02-01

    This study compared volleyball game-related statistics by outcome (winners and losers of sets) and set number (total, initial, and last) to identify characteristics that discriminated game performance. Game-related statistics from 314 sets (44 matches) played by teams of male 14- to 15-year-olds in a regional volleyball championship were analysed (2011). Differences between contexts (winning or losing teams) and "set number" (total, initial, and last) were assessed. A discriminant analysis was then performed according to outcome (winners and losers of sets) and "set number" (total, initial, and last). The results showed differences (winning or losing sets) in several variables of Complexes I (attack point and error reception) and II (serve and aces). Game-related statistics which discriminate performance in the sets index the serve, positive reception, and attack point. The predictors of performance at these ages when players are still learning could help coaches plan their training.

  5. MeDiA: Mean Distance Association and Its Applications in Nonlinear Gene Set Analysis.

    PubMed

    Peng, Hesen; Ma, Junjie; Bai, Yun; Lu, Jianwei; Yu, Tianwei

    2015-01-01

    Probabilistic association discovery aims at identifying the association between random vectors, regardless of number of variables involved or linear/nonlinear functional forms. Recently, applications in high-dimensional data have generated rising interest in probabilistic association discovery. We developed a framework based on functions on the observation graph, named MeDiA (Mean Distance Association). We generalize its property to a group of functions on the observation graph. The group of functions encapsulates major existing methods in association discovery, e.g. mutual information and Brownian Covariance, and can be expanded to more complicated forms. We conducted numerical comparison of the statistical power of related methods under multiple scenarios. We further demonstrated the application of MeDiA as a method of gene set analysis that captures a broader range of responses than traditional gene set analysis methods.

  6. geneCommittee: a web-based tool for extensively testing the discriminatory power of biologically relevant gene sets in microarray data classification.

    PubMed

    Reboiro-Jato, Miguel; Arrais, Joel P; Oliveira, José Luis; Fdez-Riverola, Florentino

    2014-01-30

    The diagnosis and prognosis of several diseases can be shortened through the use of different large-scale genome experiments. In this context, microarrays can generate expression data for a huge set of genes. However, to obtain solid statistical evidence from the resulting data, it is necessary to train and to validate many classification techniques in order to find the best discriminative method. This is a time-consuming process that normally depends on intricate statistical tools. geneCommittee is a web-based interactive tool for routinely evaluating the discriminative classification power of custom hypothesis in the form of biologically relevant gene sets. While the user can work with different gene set collections and several microarray data files to configure specific classification experiments, the tool is able to run several tests in parallel. Provided with a straightforward and intuitive interface, geneCommittee is able to render valuable information for diagnostic analyses and clinical management decisions based on systematically evaluating custom hypothesis over different data sets using complementary classifiers, a key aspect in clinical research. geneCommittee allows the enrichment of microarrays raw data with gene functional annotations, producing integrated datasets that simplify the construction of better discriminative hypothesis, and allows the creation of a set of complementary classifiers. The trained committees can then be used for clinical research and diagnosis. Full documentation including common use cases and guided analysis workflows is freely available at http://sing.ei.uvigo.es/GC/.

  7. A gene pattern mining algorithm using interchangeable gene sets for prokaryotes.

    PubMed

    Hu, Meng; Choi, Kwangmin; Su, Wei; Kim, Sun; Yang, Jiong

    2008-02-26

    Mining gene patterns that are common to multiple genomes is an important biological problem, which can lead us to novel biological insights. When family classification of genes is available, this problem is similar to the pattern mining problem in the data mining community. However, when family classification information is not available, mining gene patterns is a challenging problem. There are several well developed algorithms for predicting gene patterns in a pair of genomes, such as FISH and DAGchainer. These algorithms use the optimization problem formulation which is solved using the dynamic programming technique. Unfortunately, extending these algorithms to multiple genome cases is not trivial due to the rapid increase in time and space complexity. In this paper, we propose a novel algorithm for mining gene patterns in more than two prokaryote genomes using interchangeable sets. The basic idea is to extend the pattern mining technique from the data mining community to handle the situation where family classification information is not available using interchangeable sets. In an experiment with four newly sequenced genomes (where the gene annotation is unavailable), we show that the gene pattern can capture important biological information. To examine the effectiveness of gene patterns further, we propose an ortholog prediction method based on our gene pattern mining algorithm and compare our method to the bi-directional best hit (BBH) technique in terms of COG orthologous gene classification information. The experiment show that our algorithm achieves a 3% increase in recall compared to BBH without sacrificing the precision of ortholog detection. The discovered gene patterns can be used for the detecting of ortholog and genes that collaborate for a common biological function.

  8. Breast cancer diagnosis using level-set statistics and support vector machines.

    PubMed

    Liu, Jianguo; Yuan, Xiaohui; Buckles, Bill P

    2008-01-01

    Breast cancer diagnosis based on microscopic biopsy images and machine learning has demonstrated great promise in the past two decades. Various feature selection (or extraction) and classification algorithms have been attempted with success. However, some feature selection processes are complex and the number of features used can be quite large. We propose a new feature selection method based on level-set statistics. This procedure is simple and, when used with support vector machines (SVM), only a small number of features is needed to achieve satisfactory accuracy that is comparable to those using more sophisticated features. Therefore, the classification can be completed in much shorter time. We use multi-class support vector machines as the classification tool. Numerical results are reported to support the viability of this new procedure.

  9. Statistics of power injection in a plate set into chaotic vibration

    NASA Astrophysics Data System (ADS)

    Cadot, O.; Boudaoud, A.; Touzé, C.

    2008-12-01

    A vibrating plate is set into a chaotic state of wave turbulence by either a periodic or a random local forcing. Correlations between the forcing and the local velocity response of the plate at the forcing point are studied. Statistical models with fairly good agreement with the experiments are proposed for each forcing. Both distributions of injected power have a logarithmic cusp for zero power, while the tails are Gaussian for the periodic driving and exponential for the random one. The distributions of injected work over long time intervals are investigated in the framework of the fluctuation theorem, also known as the Gallavotti-Cohen theorem. It appears that the conclusions of the theorem are verified only for the periodic, deterministic forcing. Using independent estimates of the phase space contraction, this result is discussed in the light of available theoretical framework.

  10. Statistical criteria to set alarm levels for continuous measurements of ground contamination.

    PubMed

    Brandl, A; Jimenez, A D Herrera

    2008-08-01

    In the course of the decommissioning of the ASTRA research reactor at the site of the Austrian Research Centers at Seibersdorf, the operator and licensee, Nuclear Engineering Seibersdorf, conducted an extensive site survey and characterization to demonstrate compliance with regulatory site release criteria. This survey included radiological characterization of approximately 400,000 m(2) of open land on the Austrian Research Centers premises. Part of this survey was conducted using a mobile large-area gas proportional counter, continuously recording measurements while it was moved at a speed of 0.5 ms(-1). In order to set reasonable investigation levels, two alarm levels based on statistical considerations were developed. This paper describes the derivation of these alarm levels and the operational experience gained by detector deployment in the field.

  11. Gene set enrichment analyses revealed differences in gene expression patterns between males and females.

    PubMed

    Zhang, Wei; Huang, R Stephanie; Duan, Shiwei; Dolan, M Eileen

    2009-01-01

    Men and women differ not only in their physical attributes and reproductive functions but also in many other characteristics, including the risks for some diseases as well as response to certain therapeutic treatments. Though genetically-identical for autosomal chromosomes, males and females could have gender-specific transcriptional or translational regulation, leading to differential mRNAs or protein products for some genes. To illustrate the gender-specific differences in mRNA-level expression, we compared gene expression patterns between males and females using a whole-genome microarray dataset on the unrelated HapMap lymphoblastoid cell lines derived from individuals of European (58 individuals) and African (59 individuals) ancestry. We applied the Gene Set Enrichment Analysis to identify any overrepresented predefined gene sets in either men or women. Distinct patterns of upregulation and downregulation of certain chromoSomal regions and other gene sets such as targets for certain microRNAs and transcription factors were identified in males or females, suggesting their potential roles in defining the gender-specific phenotypes. Gender-specific patterns of gene expression also appeared to be different between these two populations.

  12. A clone-based statistical test for localizing disease genes using genomic mismatch scanning

    SciTech Connect

    Palmer, C.G.S.; Woodward, A.; Smalley, S.L.

    1994-09-01

    Genomic mismatch scanning (GMS) is a technique for isolating regions of DNA that are identical-by-descent (IBD) within pairs of relatives. GMS selected data are hybridized to an ordered array of DNA, e.g., metaphase chromosomes, YACs, to identify and localize enhanced region(s) of IBD across pairs of relatives affected with a trait of interest. If the trait has a genetic basis, it is reasonable to assume that the trait gene(s) will be located in these enhanced regions. Our approach to localize these enhanced regions is based on the availability of an ordered array of clones, e.g., YACs, which span the entire human genome. We use an exact binomial order statistic to develop a test for enhanced regions of IBD in sets of clones 1 cM in size selected for being biologically independent (i.e., separated by 50 cM). The test statistic is the maximum proportion of IBD pairs selected from the independent YACs within a set. Thus far, we have defined the power of the test under the alternative hypothesis of a single gene conditional on the maximum proportion IBD being located at the disease locus. As an example, for 60 grandparent-grandchild pairs, the exact power of the test with alpha=0.001 is 0.83 when the relative risk of the disease is 4.0 and the maximum proportion is at the disease locus. This method can be used in small samples and is not dependent on any specific mapping function.

  13. Three gene expression vector sets for concurrently expressing multiple genes in Saccharomyces cerevisiae.

    PubMed

    Ishii, Jun; Kondo, Takashi; Makino, Harumi; Ogura, Akira; Matsuda, Fumio; Kondo, Akihiko

    2014-05-01

    Yeast has the potential to be used in bulk-scale fermentative production of fuels and chemicals due to its tolerance for low pH and robustness for autolysis. However, expression of multiple external genes in one host yeast strain is considerably labor-intensive due to the lack of polycistronic transcription. To promote the metabolic engineering of yeast, we generated systematic and convenient genetic engineering tools to express multiple genes in Saccharomyces cerevisiae. We constructed a series of multi-copy and integration vector sets for concurrently expressing two or three genes in S. cerevisiae by embedding three classical promoters. The comparative expression capabilities of the constructed vectors were monitored with green fluorescent protein, and the concurrent expression of genes was monitored with three different fluorescent proteins. Our multiple gene expression tool will be helpful to the advanced construction of genetically engineered yeast strains in a variety of research fields other than metabolic engineering.

  14. PECA: a novel statistical tool for deconvoluting time-dependent gene expression regulation.

    PubMed

    Teo, Guoshou; Vogel, Christine; Ghosh, Debashis; Kim, Sinae; Choi, Hyungwon

    2014-01-03

    Protein expression varies as a result of intricate regulation of synthesis and degradation of messenger RNAs (mRNA) and proteins. Studies of dynamic regulation typically rely on time-course data sets of mRNA and protein expression, yet there are no statistical methods that integrate these multiomics data and deconvolute individual regulatory processes of gene expression control underlying the observed concentration changes. To address this challenge, we developed Protein Expression Control Analysis (PECA), a method to quantitatively dissect protein expression variation into the contributions of mRNA synthesis/degradation and protein synthesis/degradation, termed RNA-level and protein-level regulation respectively. PECA computes the rate ratios of synthesis versus degradation as the statistical summary of expression control during a given time interval at each molecular level and computes the probability that the rate ratio changed between adjacent time intervals, indicating regulation change at the time point. Along with the associated false-discovery rates, PECA gives the complete description of dynamic expression control, that is, which proteins were up- or down-regulated at each molecular level and each time point. Using PECA, we analyzed two yeast data sets monitoring the cellular response to hyperosmotic and oxidative stress. The rate ratio profiles reported by PECA highlighted a large magnitude of RNA-level up-regulation of stress response genes in the early response and concordant protein-level regulation with time delay. However, the contributions of RNA- and protein-level regulation and their temporal patterns were different between the two data sets. We also observed several cases where protein-level regulation counterbalanced transcriptomic changes in the early stress response to maintain the stability of protein concentrations, suggesting that proteostasis is a proteome-wide phenomenon mediated by post-transcriptional regulation.

  15. Efficient computation of minimal perturbation sets in gene regulatory networks

    PubMed Central

    Garg, Abhishek; Mohanram, Kartik; Di Cara, Alessandro; Degueurce, Gwendoline; Ibberson, Mark; Dorier, Julien; Xenarios, Ioannis

    2013-01-01

    In the last few decades, technological and experimental advancements have enabled a more precise understanding of the mode of action of drugs with respect to human cell signaling pathways and have positively influenced the design of new drug compounds. However, as the design of compounds has become increasingly target-specific, the overall effects of a drug on adjacent cellular signaling pathways remain difficult to predict because of the complexity of the interactions involved. Off-target effects of drugs are known to influence their efficacy and safety. Similarly, drugs which are more target-specific also suffer from lack of efficacy because their scope might be too limited in the context of cellular signaling. Even in situations where the signaling pathways targeted by a drug are known, the presence of point mutations in some of the components of the pathways can render a therapy ineffective in a considerable target subpopulation. Some of these issues can be addressed by predicting Minimal Intervention Sets (MIS) of elements of the signaling pathways that when perturbed give rise to a pre-defined cellular phenotype. These minimal gene perturbation sets can then be further used to screen a library of drug compounds in order to discover effective drug therapies. This manuscript describes algorithms that can be used to discover MIS in a gene regulatory network that can lead to a defined cellular phenotype. Algorithms are implemented in our Boolean modeling toolbox, GenYsis. The software binaries of GenYsis are available for download from http://www.vital-it.ch/software/genYsis/. PMID:24391592

  16. Statistics

    Cancer.gov

    Links to sources of cancer-related statistics, including the Surveillance, Epidemiology and End Results (SEER) Program, SEER-Medicare datasets, cancer survivor prevalence data, and the Cancer Trends Progress Report.

  17. The histone methyltransferases Set5 and Set1 have overlapping functions in gene silencing and telomere maintenance.

    PubMed

    Jezek, Meagan; Gast, Alison; Choi, Grace; Kulkarni, Rushmie; Quijote, Jeremiah; Graham-Yooll, Andrew; Park, DoHwan; Green, Erin M

    2017-02-01

    Genes adjacent to telomeres are subject to transcriptional repression mediated by an integrated set of chromatin modifying and remodeling factors. The telomeres of Saccharomyces cerevisiae have served as a model for dissecting the function of diverse chromatin proteins in gene silencing, and their study has revealed overlapping roles for many chromatin proteins in either promoting or antagonizing gene repression. The H3K4 methyltransferase Set1, which is commonly linked to transcriptional activation, has been implicated in telomere silencing. Set5 is an H4 K5, K8, and K12 methyltransferase that functions with Set1 to promote repression at telomeres. Here, we analyzed the combined role for Set1 and Set5 in gene expression control at native yeast telomeres. Our data reveal that Set1 and Set5 promote a Sir protein-independent mechanism of repression that may primarily rely on regulation of H4K5ac and H4K8ac at telomeric regions. Furthermore, cells lacking both Set1 and Set5 have highly correlated transcriptomes to mutants in telomere maintenance pathways and display defects in telomere stability, linking their roles in silencing to protection of telomeres. Our data therefore provide insight into and clarify potential mechanisms by which Set1 contributes to telomere silencing and shed light on the function of Set5 at telomeres.

  18. COMBAT: A Combined Association Test for Genes Using Summary Statistics.

    PubMed

    Wang, Minghui; Huang, Jianfei; Liu, Yiyuan; Ma, Li; Potash, James B; Han, Shizhong

    2017-09-06

    Genome-wide association studies (GWAS) have been widely used for identifying common variants associated with complex diseases. Traditional analysis of GWAS typically examines one marker at a time, usually single nucleotide polymorphisms (SNPs), to identify individual variants associated with a disease. However, due to the small effect sizes of common variants, the power to detect individual risk variants is generally low. As a complementary approach to SNP-level analysis, a variety of gene-based association tests have been proposed. However, the power of existing gene-based tests is often dependent on the underlying genetic models, and it is not known a priori which test is optimal. Here we propose a combined association test (COMBAT) for genes, which incorporates strengths from existing gene-based tests and shows higher overall performance than any individual test. Our method does not require raw genotype or phenotype data, but needs only SNP-level p-values and correlations between SNPs from ancestry-matched samples. Extensive simulations showed that COMBAT has an appropriate type I error rate, maintains higher power across a wide range of genetic models, and is more robust than any individual gene-based test. We further demonstrated the superior performance of COMBAT over several other gene-based tests through reanalysis of the meta-analytic results of GWAS for bipolar disorder. Our method allows for the more powerful application of gene-based analysis to complex diseases, which will have broad use given that GWAS summary results are increasingly publicly available. Copyright © 2017, Genetics.

  19. Analyzing Planck and low redshift data sets with advanced statistical methods

    NASA Astrophysics Data System (ADS)

    Eifler, Tim

    The recent ESA/NASA Planck mission has provided a key data set to constrain cosmology that is most sensitive to physics of the early Universe, such as inflation and primordial NonGaussianity (Planck 2015 results XIII). In combination with cosmological probes of the LargeScale Structure (LSS), the Planck data set is a powerful source of information to investigate late time phenomena (Planck 2015 results XIV), e.g. the accelerated expansion of the Universe, the impact of baryonic physics on the growth of structure, and the alignment of galaxies in their dark matter halos. It is the main objective of this proposal to re-analyze the archival Planck data, 1) with different, more recently developed statistical methods for cosmological parameter inference, and 2) to combine Planck and ground-based observations in an innovative way. We will make the corresponding analysis framework publicly available and believe that it will set a new standard for future CMB-LSS analyses. Advanced statistical methods, such as the Gibbs sampler (Jewell et al 2004, Wandelt et al 2004) have been critical in the analysis of Planck data. More recently, Approximate Bayesian Computation (ABC, see Weyant et al 2012, Akeret et al 2015, Ishida et al 2015, for cosmological applications) has matured to an interesting tool in cosmological likelihood analyses. It circumvents several assumptions that enter the standard Planck (and most LSS) likelihood analyses, most importantly, the assumption that the functional form of the likelihood of the CMB observables is a multivariate Gaussian. Beyond applying new statistical methods to Planck data in order to cross-check and validate existing constraints, we plan to combine Planck and DES data in a new and innovative way and run multi-probe likelihood analyses of CMB and LSS observables. The complexity of multiprobe likelihood analyses scale (non-linearly) with the level of correlations amongst the individual probes that are included. For the multi

  20. Partitioning large data sets: Use of statistical methods applied to a set of Russian igneous-rock chemical analyses

    NASA Astrophysics Data System (ADS)

    Hernández Encinas, L.

    1994-12-01

    A method of cluster analysis has been applied on the following ideas: (a) stabilizing the variances of the variables; (b) reducing the number of variables by principal component analysis; and (c) generating a moderate number of groups containing large numbers of samples and applying a method of cluster analysis to them. This method was applied to a large data set to divide it into groups of samples without taking into consideration their origin. The sample set available consists of 1271 rock chemical analyses from 37 Massifs in the Ural Mountains (Russia), which then were divided into 6 differentiated groups. Later, discriminant functions were calculated to assign new samples to the groups determined.

  1. A Parallel Finite Set Statistical Simulator for Multi-Target Detection and Tracking

    NASA Astrophysics Data System (ADS)

    Hussein, I.; MacMillan, R.

    2014-09-01

    Finite Set Statistics (FISST) is a powerful Bayesian inference tool for the joint detection, classification and tracking of multi-target environments. FISST is capable of handling phenomena such as clutter, misdetections, and target birth and decay. Implicit within the approach are solutions to the data association and target label-tracking problems. Finally, FISST provides generalized information measures that can be used for sensor allocation across different types of tasks such as: searching for new targets, and classification and tracking of known targets. These FISST capabilities have been demonstrated on several small-scale illustrative examples. However, for implementation in a large-scale system as in the Space Situational Awareness problem, these capabilities require a lot of computational power. In this paper, we implement FISST in a parallel environment for the joint detection and tracking of multi-target systems. In this implementation, false alarms and misdetections will be modeled. Target birth and decay will not be modeled in the present paper. We will demonstrate the success of the method for as many targets as we possibly can in a desktop parallel environment. Performance measures will include: number of targets in the simulation, certainty of detected target tracks, computational time as a function of clutter returns and number of targets, among other factors.

  2. From biophysics to evolutionary genetics: statistical aspects of gene regulation

    PubMed Central

    Lässig, Michael

    2007-01-01

    This is an introductory review on how genes interact to produce biological functions. Transcriptional interactions involve the binding of proteins to regulatory DNA. Specific binding sites can be identified by genomic analysis, and these undergo a stochastic evolution process governed by selection, mutations, and genetic drift. We focus on the links between the biophysical function and the evolution of regulatory elements. In particular, we infer fitness landscapes of binding sites from genomic data, leading to a quantitative evolutionary picture of regulation. PMID:17903288

  3. Evidence for a Global Sampling Process in Extraction of Summary Statistics of Item Sizes in a Set.

    PubMed

    Tokita, Midori; Ueda, Sachiyo; Ishiguchi, Akira

    2016-01-01

    Several studies have shown that our visual system may construct a "summary statistical representation" over groups of visual objects. Although there is a general understanding that human observers can accurately represent sets of a variety of features, many questions on how summary statistics, such as an average, are computed remain unanswered. This study investigated sampling properties of visual information used by human observers to extract two types of summary statistics of item sets, average and variance. We presented three models of ideal observers to extract the summary statistics: a global sampling model without sampling noise, global sampling model with sampling noise, and limited sampling model. We compared the performance of an ideal observer of each model with that of human observers using statistical efficiency analysis. Results suggest that summary statistics of items in a set may be computed without representing individual items, which makes it possible to discard the limited sampling account. Moreover, the extraction of summary statistics may not necessarily require the representation of individual objects with focused attention when the sets of items are larger than 4.

  4. Evidence for a Global Sampling Process in Extraction of Summary Statistics of Item Sizes in a Set

    PubMed Central

    Tokita, Midori; Ueda, Sachiyo; Ishiguchi, Akira

    2016-01-01

    Several studies have shown that our visual system may construct a “summary statistical representation” over groups of visual objects. Although there is a general understanding that human observers can accurately represent sets of a variety of features, many questions on how summary statistics, such as an average, are computed remain unanswered. This study investigated sampling properties of visual information used by human observers to extract two types of summary statistics of item sets, average and variance. We presented three models of ideal observers to extract the summary statistics: a global sampling model without sampling noise, global sampling model with sampling noise, and limited sampling model. We compared the performance of an ideal observer of each model with that of human observers using statistical efficiency analysis. Results suggest that summary statistics of items in a set may be computed without representing individual items, which makes it possible to discard the limited sampling account. Moreover, the extraction of summary statistics may not necessarily require the representation of individual objects with focused attention when the sets of items are larger than 4. PMID:27242622

  5. Can You Explain that in Plain English? Making Statistics Group Projects Work in a Multicultural Setting

    ERIC Educational Resources Information Center

    Sisto, Michelle

    2009-01-01

    Students increasingly need to learn to communicate statistical results clearly and effectively, as well as to become competent consumers of statistical information. These two learning goals are particularly important for business students. In line with reform movements in Statistics Education and the GAISE guidelines, we are working to implement…

  6. Can You Explain that in Plain English? Making Statistics Group Projects Work in a Multicultural Setting

    ERIC Educational Resources Information Center

    Sisto, Michelle

    2009-01-01

    Students increasingly need to learn to communicate statistical results clearly and effectively, as well as to become competent consumers of statistical information. These two learning goals are particularly important for business students. In line with reform movements in Statistics Education and the GAISE guidelines, we are working to implement…

  7. Variance component score test for time-course gene set analysis of longitudinal RNA-seq data.

    PubMed

    Agniel, Denis; Hejblum, Boris P

    2017-03-10

    As gene expression measurement technology is shifting from microarrays to sequencing, the statistical tools available for their analysis must be adapted since RNA-seq data are measured as counts. It has been proposed to model RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity. In this vein, we propose tcgsaseq, a principled, model-free, and efficient method for detecting longitudinal changes in RNA-seq gene sets defined a priori. The method identifies those gene sets whose expression varies over time, based on an original variance component score test accounting for both covariates and heteroscedasticity without assuming any specific parametric distribution for the (transformed) counts. We demonstrate that despite the presence of a nonparametric component, our test statistic has a simple form and limiting distribution, and both may be computed quickly. A permutation version of the test is additionally proposed for very small sample sizes. Applied to both simulated data and two real datasets, tcgsaseq is shown to exhibit very good statistical properties, with an increase in stability and power when compared to state-of-the-art methods ROAST (rotation gene set testing), edgeR, and DESeq2, which can fail to control the type I error under certain realistic settings. We have made the method available for the community in the R package tcgsaseq.

  8. Reduced Set of Virulence Genes Allows High Accuracy Prediction of Bacterial Pathogenicity in Humans

    PubMed Central

    Iraola, Gregorio; Vazquez, Gustavo; Spangenberg, Lucía; Naya, Hugo

    2012-01-01

    Although there have been great advances in understanding bacterial pathogenesis, there is still a lack of integrative information about what makes a bacterium a human pathogen. The advent of high-throughput sequencing technologies has dramatically increased the amount of completed bacterial genomes, for both known human pathogenic and non-pathogenic strains; this information is now available to investigate genetic features that determine pathogenic phenotypes in bacteria. In this work we determined presence/absence patterns of different virulence-related genes among more than finished bacterial genomes from both human pathogenic and non-pathogenic strains, belonging to different taxonomic groups (i.e: Actinobacteria, Gammaproteobacteria, Firmicutes, etc.). An accuracy of 95% using a cross-fold validation scheme with in-fold feature selection is obtained when classifying human pathogens and non-pathogens. A reduced subset of highly informative genes () is presented and applied to an external validation set. The statistical model was implemented in the BacFier v1.0 software (freely available at ), that displays not only the prediction (pathogen/non-pathogen) and an associated probability for pathogenicity, but also the presence/absence vector for the analyzed genes, so it is possible to decipher the subset of virulence genes responsible for the classification on the analyzed genome. Furthermore, we discuss the biological relevance for bacterial pathogenesis of the core set of genes, corresponding to eight functional categories, all with evident and documented association with the phenotypes of interest. Also, we analyze which functional categories of virulence genes were more distinctive for pathogenicity in each taxonomic group, which seems to be a completely new kind of information and could lead to important evolutionary conclusions. PMID:22916122

  9. Impact of benchmark data set topology on the validation of virtual screening methods: exploration and quantification by spatial statistics.

    PubMed

    Rohrer, Sebastian G; Baumann, Knut

    2008-04-01

    A common finding of many reports evaluating ligand-based virtual screening methods is that validation results vary considerably with changing benchmark data sets. It is widely assumed that these data set specific effects are caused by the redundancy, self-similarity, and cluster structure inherent to those data sets. These phenomena manifest themselves in the data sets' representation in descriptor space, which is termed the data set topology. A methodology for the characterization of data set topology based on spatial statistics is introduced. The method is nonparametric and can deal with arbitrary distributions of descriptor values. With this methodology it is possible to associate differences in virtual screening performance on different data sets with differences in data set topology. Moreover, the better virtual screening performance of certain descriptors can be explained by their ability of representing the benchmark data sets by a more favorable topology. Finally it is shown, that the composition of some benchmark data sets causes topologies that lead to overoptimistic validation results even in very "simple" descriptor spaces. Spatial statistics analysis as proposed here facilitates the detection of such biased data sets and may provide a tool for the future design of unbiased benchmark data sets.

  10. Along signal paths: an empirical gene set approach exploiting pathway topology

    PubMed Central

    Martini, Paolo; Sales, Gabriele; Massa, M. Sofia; Chiogna, Monica; Romualdi, Chiara

    2013-01-01

    Gene set analysis using biological pathways has become a widely used statistical approach for gene expression analysis. A biological pathway can be represented through a graph where genes and their interactions are, respectively, nodes and edges of the graph. From a biological point of view only some portions of a pathway are expected to be altered; however, few methods using pathway topology have been proposed and none of them tries to identify the signal paths, within a pathway, mostly involved in the biological problem. Here, we present a novel algorithm for pathway analysis clipper, that tries to fill in this gap. clipper implements a two-step empirical approach based on the exploitation of graph decomposition into a junction tree to reconstruct the most relevant signal path. In the first step clipper selects significant pathways according to statistical tests on the means and the concentration matrices of the graphs derived from pathway topologies. Then, it identifies within these pathways the signal paths having the greatest association with a specific phenotype. We test our approach on simulated and two real expression datasets. Our results demonstrate the efficacy of clipper in the identification of signal transduction paths totally coherent with the biological problem. PMID:23002139

  11. Statistical mechanics of scale-free gene expression networks

    NASA Astrophysics Data System (ADS)

    Gross, Eitan

    2012-12-01

    The gene co-expression networks of many organisms including bacteria, mice and man exhibit scale-free distribution. This heterogeneous distribution of connections decreases the vulnerability of the network to random attacks and thus may confer the genetic replication machinery an intrinsic resilience to such attacks, triggered by changing environmental conditions that the organism may be subject to during evolution. This resilience to random attacks comes at an energetic cost, however, reflected by the lower entropy of the scale-free distribution compared to the more homogenous, random network. In this study we found that the cell cycle-regulated gene expression pattern of the yeast Saccharomyces cerevisiae obeys a power-law distribution with an exponent α = 2.1 and an entropy of 1.58. The latter is very close to the maximal value of 1.65 obtained from linear optimization of the entropy function under the constraint of a constant cost function, determined by the average degree connectivity . We further show that the yeast's gene expression network can achieve scale-free distribution in a process that does not involve growth but rather via re-wiring of the connections between nodes of an ordered network. Our results support the idea of an evolutionary selection, which acts at the level of the protein sequence, and is compatible with the notion of greater biological importance of highly connected nodes in the protein interaction network. Our constrained re-wiring model provides a theoretical framework for a putative thermodynamically driven evolutionary selection process.

  12. PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data.

    PubMed

    Wigginton, Janis E; Abecasis, Gonçalo R

    2005-08-15

    We describe a tool that produces summary statistics and basic quality assessments for gene-mapping data, accommodating either pedigree or case-control datasets. Our tool can also produce graphic output in the PDF format.

  13. Human Effector / Initiator Gene Sets That Regulate Myometrial Contractility During Term and Preterm Labor

    PubMed Central

    WEINER, Carl P.; MASON, Clifford W.; DONG, Yafeng; BUHIMSCHI, Irina A.; SWAAN, Peter W.; BUHIMSCHI, Catalin S.

    2010-01-01

    Objective Distinct processes govern transition from quiescence to activation during term (TL) and preterm labor (PTL). We sought gene sets responsible for TL and PTL, along with the effector genes necessary for labor independent of gestation and underlying trigger. Methods Expression was analyzed in term and preterm +/− labor (n =6 subjects/group). Gene sets were generated using logic operations. Results 34 genes were similarly expressed in PTL/TL but absent from nonlabor samples (Effector Set). 49 genes were specific to PTL (Preterm Initiator Set) and 174 to TL (Term Initiator Set). The gene ontogeny processes comprising Term Initiator and Effector Sets were diverse, though inflammation was represented in 4 of the top 10; inflammation dominated the Preterm Initiator Set. Comments TL and PTL differ dramatically in initiator profiles. Though inflammation is part of the Term Initiator and the Effector Sets, it is an overwhelming part of PTL associated with intraamniotic inflammation. PMID:20452493

  14. Detection of viruses via statistical gene expression analysis.

    PubMed

    Chen, Minhua; Carlson, David; Zaas, Aimee; Woods, Christopher W; Ginsburg, Geoffrey S; Hero, Alfred; Lucas, Joseph; Carin, Lawrence

    2011-03-01

    We develop a new bayesian construction of the elastic net (ENet), with variational bayesian analysis. This modeling framework is motivated by analysis of gene expression data for viruses, with a focus on H3N2 and H1N1 influenza, as well as Rhino virus and RSV (respiratory syncytial virus). Our objective is to understand the biological pathways responsible for the host response to such viruses, with the ultimate objective of developing a clinical test to distinguish subjects infected by such viruses from subjects with other symptom causes (e.g., bacteria). In addition to analyzing these new datasets, we provide a detailed analysis of the bayesian ENet and compare it to related models.

  15. JAG: A Computational Tool to Evaluate the Role of Gene-Sets in Complex Traits.

    PubMed

    Lips, Esther S; Kooyman, Maarten; de Leeuw, Christiaan; Posthuma, Danielle

    2015-05-14

    Gene-set analysis has been proposed as a powerful tool to deal with the highly polygenic architecture of complex traits, as well as with the small effect sizes typically found in GWAS studies for complex traits. We developed a tool, Joint Association of Genetic variants (JAG), which can be applied to Genome Wide Association (GWA) data and tests for the joint effect of all single nucleotide polymorphisms (SNPs) located in a user-specified set of genes or biological pathway. JAG assigns SNPs to genes and incorporates self-contained and/or competitive tests for gene-set analysis. JAG uses permutation to evaluate gene-set significance, which implicitly controls for linkage disequilibrium, sample size, gene size, the number of SNPs per gene and the number of genes in the gene-set. We conducted a power analysis using the Wellcome Trust Case Control Consortium (WTCCC) Crohn's disease data set and show that JAG correctly identifies validated gene-sets for Crohn's disease and has more power than currently available tools for gene-set analysis. JAG is a powerful, novel tool for gene-set analysis, and can be freely downloaded from the CTG Lab website.

  16. Inferring biological functions and associated transcriptional regulators using gene set expression coherence analysis

    PubMed Central

    Kim, Tae-Min; Chung, Yeun-Jun; Rhyu, Mun-Gan; Ho Jung, Myeong

    2007-01-01

    Background Gene clustering has been widely used to group genes with similar expression pattern in microarray data analysis. Subsequent enrichment analysis using predefined gene sets can provide clues on which functional themes or regulatory sequence motifs are associated with individual gene clusters. In spite of the potential utility, gene clustering and enrichment analysis have been used in separate platforms, thus, the development of integrative algorithm linking both methods is highly challenging. Results In this study, we propose an algorithm for discovery of molecular functions and elucidation of transcriptional logics using two kinds of gene information, functional and regulatory motif gene sets. The algorithm, termed gene set expression coherence analysis first selects functional gene sets with significantly high expression coherences. Those candidate gene sets are further processed into a number of functionally related themes or functional clusters according to the expression similarities. Each functional cluster is then, investigated for the enrichment of transcriptional regulatory motifs using modified gene set enrichment analysis and regulatory motif gene sets. The method was tested for two publicly available expression profiles representing murine myogenesis and erythropoiesis. For respective profiles, our algorithm identified myocyte- and erythrocyte-related molecular functions, along with the putative transcriptional regulators for the corresponding molecular functions. Conclusion As an integrative and comprehensive method for the analysis of large-scaled gene expression profiles, our method is able to generate a set of testable hypotheses: the transcriptional regulator X regulates function Y under cellular condition Z. GSECA algorithm is implemented into freely available software package. PMID:18021416

  17. GeneSet2miRNA: finding the signature of cooperative miRNA activities in the gene lists

    PubMed Central

    Antonov, Alexey V.; Dietmann, Sabine; Wong, Philip; Lutter, Dominik; Mewes, Hans W.

    2009-01-01

    GeneSet2miRNA is the first web-based tool which is able to identify whether or not a gene list has a signature of miRNA-regulatory activity. As input, GeneSet2miRNA accepts a list of genes. As output, a list of miRNA-regulatory models is provided. A miRNA-regulatory model is a group of miRNAs (single, pair, triplet or quadruplet) that is predicted to regulate a significant subset of genes from the submitted list. GeneSet2miRNA provides a user friendly dialog-driven web page submission available for several model organisms. GeneSet2miRNA is freely available at http://mips.helmholtz-muenchen.de/proj/gene2mir/. PMID:19420064

  18. snpGeneSets: An R Package for Genome-Wide Study Annotation.

    PubMed

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-12-07

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/.

  19. snpGeneSets: An R Package for Genome-Wide Study Annotation

    PubMed Central

    Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian

    2016-01-01

    Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048

  20. New cyt b gene universal primer set for forensic analysis.

    PubMed

    Lopez-Oceja, A; Gamarra, D; Borragan, S; Jiménez-Moreno, S; de Pancorbo, M M

    2016-07-01

    Analysis of mitochondrial DNA, and in particular the cytochrome b gene (cyt b), has become an essential tool for species identification in routine forensic practice. In cases of degraded samples, where the DNA is fractionated, universal primers that are highly efficient for the amplification of the target region are necessary. Therefore, in the present study a new universal cyt b primer set with high species identification capabilities, even in samples with highly degraded DNA, has been developed. In order to achieve this objective, the primers were designed following the alignment of complete sequences of the cyt b from 751 species from the Class of Mammalia listed in GenBank. A highly variable region of 148bp flanked by highly conserved sequences was chosen for placing the primers. The effectiveness of the new pair of primers was examined in 63 animal species belonging to 38 Families from 14 Orders and 5 Classes (Mammalia, Aves, Reptilia, Actinopterygii, and Malacostraca). Species determination was possible in all cases, which shows that the fragment analyzed provided a high capability for species identification. Furthermore, to ensure the efficiency of the 148bp fragment, the intraspecific variability was analyzed by calculating the concordance between individuals with the BLAST tool from the NCBI (National Center for Biotechnological Information). The intraspecific concordance levels were superior to 97% in all species. Likewise, the phylogenetic information from the selected fragment was confirmed by obtaining the phylogenetic tree from the sequences of the species analyzed. Evidence of the high power of phylogenetic discrimination of the analyzed fragment of the cyt b was obtained, as 93.75% of the species were grouped within their corresponding Orders. Finally, the analysis of 40 degraded samples with small-size DNA fragments showed that the new pair of primers permits identifying the species, even when the DNA is highly degraded as it is very common in

  1. Gene-set activity toolbox (GAT): A platform for microarray-based cancer diagnosis using an integrative gene-set analysis approach.

    PubMed

    Engchuan, Worrawat; Meechai, Asawin; Tongsima, Sissades; Doungpan, Narumol; Chan, Jonathan H

    2016-08-01

    Cancer is a complex disease that cannot be diagnosed reliably using only single gene expression analysis. Using gene-set analysis on high throughput gene expression profiling controlled by various environmental factors is a commonly adopted technique used by the cancer research community. This work develops a comprehensive gene expression analysis tool (gene-set activity toolbox: (GAT)) that is implemented with data retriever, traditional data pre-processing, several gene-set analysis methods, network visualization and data mining tools. The gene-set analysis methods are used to identify subsets of phenotype-relevant genes that will be used to build a classification model. To evaluate GAT performance, we performed a cross-dataset validation study on three common cancers namely colorectal, breast and lung cancers. The results show that GAT can be used to build a reasonable disease diagnostic model and the predicted markers have biological relevance. GAT can be accessed from http://gat.sit.kmutt.ac.th where GAT's java library for gene-set analysis, simple classification and a database with three cancer benchmark datasets can be downloaded.

  2. Selection of the Maximum Spatial Cluster Size of the Spatial Scan Statistic by Using the Maximum Clustering Set-Proportion Statistic.

    PubMed

    Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong

    2016-01-01

    Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set-proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters.

  3. A method for developing regulatory gene set networks to characterize complex biological systems.

    PubMed

    Suphavilai, Chayaporn; Zhu, Liugen; Chen, Jake Y

    2015-01-01

    Traditional approaches to studying molecular networks are based on linking genes or proteins. Higher-level networks linking gene sets or pathways have been proposed recently. Several types of gene set networks have been used to study complex molecular networks such as co-membership gene set networks (M-GSNs) and co-enrichment gene set networks (E-GSNs). Gene set networks are useful for studying biological mechanism of diseases and drug perturbations. In this study, we proposed a new approach for constructing directed, regulatory gene set networks (R-GSNs) to reveal novel relationships among gene sets or pathways. We collected several gene set collections and high-quality gene regulation data in order to construct R-GSNs in a comparative study with co-membership gene set networks (M-GSNs). We described a method for constructing both global and disease-specific R-GSNs and determining their significance. To demonstrate the potential applications to disease biology studies, we constructed and analysed an R-GSN specifically built for Alzheimer's disease. R-GSNs can provide new biological insights complementary to those derived at the protein regulatory network level or M-GSNs. When integrated properly to functional genomics data, R-GSNs can help enable future research on systems biology and translational bioinformatics.

  4. Estimation of geosynchronous space objects using finite set statistics filtering methods

    NASA Astrophysics Data System (ADS)

    Gehly, Steve

    The use of near Earth space has increased dramatically in the past few decades, and operational satellites are an integral part of modern society. The increased presence in space has led to an increase in the amount of orbital debris, which poses a growing threat to current and future space missions. Characterization of the debris environment is crucial to our continued use of high value orbit regimes such as the geosynchronous (GEO) belt. Objects in GEO pose unique challenges, by virtue of being densely spaced and tracked by a limited number of sensors in short observation windows. This research examines the use of a new class of multitarget filters to approach the problem of orbit determination for the large number of objects present. The filters make use of a recently developed mathematical toolbox derived from point process theory known as Finite Set Statistics (FISST). Details of implementing FISST-derived filters are discussed, and a qualitative and quantitative comparison between FISST and traditional multitarget estimators demonstrates the suitability of the new methods for space object estimation. Specific challenges in the areas of sensor allocation and initial orbit determination are addressed in the framework. The sensor allocation scheme makes use of information gain functionals as formulated for FISST to efficiently collect measurements on the full multitarget system. Results from a simulated network of three ground stations tracking a large catalog of geosynchronous objects demonstrate improved performance as compared to simpler, non-information theoretic tasking schemes. Further studies incorporate an initial orbit determination technique to initiate new tracks in the multitarget filter. Together with a sensor allocation scheme designed to search for new targets and maintain knowledge of the existing catalog, the method comprises a solution to the search-detect-track problem. Simulation results for a single sensor case show that the problem can be

  5. Accurate Gene Expression-Based Biodosimetry Using a Minimal Set of Human Gene Transcripts

    SciTech Connect

    Tucker, James D.; Joiner, Michael C.; Thomas, Robert A.; Grever, William E.; Bakhmutsky, Marina V.; Chinkhota, Chantelle N.; Smolinski, Joseph M.; Divine, George W.; Auner, Gregory W.

    2014-03-15

    Purpose: Rapid and reliable methods for conducting biological dosimetry are a necessity in the event of a large-scale nuclear event. Conventional biodosimetry methods lack the speed, portability, ease of use, and low cost required for triaging numerous victims. Here we address this need by showing that polymerase chain reaction (PCR) on a small number of gene transcripts can provide accurate and rapid dosimetry. The low cost and relative ease of PCR compared with existing dosimetry methods suggest that this approach may be useful in mass-casualty triage situations. Methods and Materials: Human peripheral blood from 60 adult donors was acutely exposed to cobalt-60 gamma rays at doses of 0 (control) to 10 Gy. mRNA expression levels of 121 selected genes were obtained 0.5, 1, and 2 days after exposure by reverse-transcriptase real-time PCR. Optimal dosimetry at each time point was obtained by stepwise regression of dose received against individual gene transcript expression levels. Results: Only 3 to 4 different gene transcripts, ASTN2, CDKN1A, GDF15, and ATM, are needed to explain ≥0.87 of the variance (R{sup 2}). Receiver-operator characteristics, a measure of sensitivity and specificity, of 0.98 for these statistical models were achieved at each time point. Conclusions: The actual and predicted radiation doses agree very closely up to 6 Gy. Dosimetry at 8 and 10 Gy shows some effect of saturation, thereby slightly diminishing the ability to quantify higher exposures. Analyses of these gene transcripts may be advantageous for use in a field-portable device designed to assess exposures in mass casualty situations or in clinical radiation emergencies.

  6. Set statistics in conductive bridge random access memory device with Cu/HfO{sub 2}/Pt structure

    SciTech Connect

    Zhang, Meiyun; Long, Shibing Wang, Guoming; Xu, Xiaoxin; Li, Yang; Liu, Qi; Lv, Hangbing; Liu, Ming; Lian, Xiaojuan; Miranda, Enrique; Suñé, Jordi

    2014-11-10

    The switching parameter variation of resistive switching memory is one of the most important challenges in its application. In this letter, we have studied the set statistics of conductive bridge random access memory with a Cu/HfO{sub 2}/Pt structure. The experimental distributions of the set parameters in several off resistance ranges are shown to nicely fit a Weibull model. The Weibull slopes of the set voltage and current increase and decrease logarithmically with off resistance, respectively. This experimental behavior is perfectly captured by a Monte Carlo simulator based on the cell-based set voltage statistics model and the Quantum Point Contact electron transport model. Our work provides indications for the improvement of the switching uniformity.

  7. Toward a comprehensive set of asthma susceptibility genes.

    PubMed

    Bossé, Yohan; Hudson, Thomas J

    2007-01-01

    Epidemiological and twin studies have demonstrated that asthma is under genetic and environmental influences. Numerous candidate gene association studies as well as genome-wide linkage scans have followed, aiming to elucidate the genetic architecture underlying this complex disease. Several promising asthma susceptibility genes were identified, and a comprehensive catalogue of these genes seems a realistic goal within 5 to 10 years. However, a key challenge is to understand the combination of genes and environmental factors that gives rise to the disease in a specific individual. Currently, most of the reports of asthma susceptibility genes are either preliminary or controversial, with little knowledge about the genetic mechanisms leading to abnormal function of the gene that promotes the development of asthma. Replications of published associations are relatively few. Many factors, including the inherent complexity of asthma as well as methodological issues, can explain these inconsistencies. Promising genetic tools are emerging with the completion of the International HapMap Project that will increase the scope of gene-discovery investigations. It is hoped that these tools, combined with validation studies in additional populations, will enable the creation of a comprehensive catalogue of susceptibility genes for asthma. Notwithstanding the difficulties in making sense of the vast amount of new genetic data, we already see the emergence of new biological pathways of atopy, airway remodeling, and asthma that may lead to novel therapeutic approaches.

  8. Comparison of three summary statistics for ranking genes in genome-wide association studies.

    PubMed

    Freytag, Saskia; Bickeböller, Heike

    2014-05-20

    Problems associated with insufficient power have haunted the analysis of genome-wide association studies and are likely to be the main challenge for the analysis of next-generation sequencing data. Ranking genes according to their strength of association with the investigated phenotype is one solution. To obtain rankings for genes, researchers can draw from a wide range of statistics summarizing the relationships between variants mapped to a gene and the phenotype. Hence, it is of interest to explore the performance of these statistics in the context of rankings. To this end, we conducted a simulation study (limited to genes of equal sizes) of three different summary statistics examining the ability to rank genes in a meaningful order. The weighted sum of squared marginal score test (Pan, 2009), RareCover algorithm (Bahtia et al., 2010) and the elastic net regularization (Zou and Hastie, 2005) were chosen, because they can handle common as well as rare variants. The test based on the score statistic outperformed both other methods in almost all investigated scenarios. It was the only measure to consistently detect genes with interacting causal variants. However, the RareCover algorithm proved better at identifying genes including causal variants with small effect sizes and low minor allele frequency than the weighted sum of squared marginal score test. The performance of the elastic net regularization was unimpressive for all but the simplest scenarios. Copyright © 2013 John Wiley & Sons, Ltd.

  9. An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles

    PubMed Central

    Thomas, Jeffrey G.; Olson, James M.; Tapscott, Stephen J.; Zhao, Lue Ping

    2001-01-01

    We have developed a statistical regression modeling approach to discover genes that are differentially expressed between two predefined sample groups in DNA microarray experiments. Our model is based on well-defined assumptions, uses rigorous and well-characterized statistical measures, and accounts for the heterogeneity and genomic complexity of the data. In contrast to cluster analysis, which attempts to define groups of genes and/or samples that share common overall expression profiles, our modeling approach uses known sample group membership to focus on expression profiles of individual genes in a sensitive and robust manner. Further, this approach can be used to test statistical hypotheses about gene expression. To demonstrate this methodology, we compared the expression profiles of 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) samples from a previous study (Golub et al. 1999) and found 141 genes differentially expressed between AML and ALL with a 1% significance at the genomic level. Using this modeling approach to compare different sample groups within the AML samples, we identified a group of genes whose expression profiles correlated with that of thrombopoietin and found that genes whose expression associated with AML treatment outcome lie in recurrent chromosomal locations. Our results are compared with those obtained using t-tests or Wilcoxon rank sum statistics. PMID:11435405

  10. Fundamental limitations of high contrast imaging set by small sample statistics

    SciTech Connect

    Mawet, D.; Milli, J.; Wahhaj, Z.; Pelat, D.; Absil, O.; Delacroix, C.; Boccaletti, A.; Kasper, M.; Kenworthy, M.; Marois, C.; Mennesson, B.; Pueyo, L.

    2014-09-10

    In this paper, we review the impact of small sample statistics on detection thresholds and corresponding confidence levels (CLs) in high-contrast imaging at small angles. When looking close to the star, the number of resolution elements decreases rapidly toward small angles. This reduction of the number of degrees of freedom dramatically affects CLs and false alarm probabilities. Naively using the same ideal hypothesis and methods as for larger separations, which are well understood and commonly assume Gaussian noise, can yield up to one order of magnitude error in contrast estimations at fixed CL. The statistical penalty exponentially increases toward very small inner working angles. Even at 5-10 resolution elements from the star, false alarm probabilities can be significantly higher than expected. Here we present a rigorous statistical analysis that ensures robustness of the CL, but also imposes a substantial limitation on corresponding achievable detection limits (thus contrast) at small angles. This unavoidable fundamental statistical effect has a significant impact on current coronagraphic and future high-contrast imagers. Finally, the paper concludes with practical recommendations to account for small number statistics when computing the sensitivity to companions at small angles and when exploiting the results of direct imaging planet surveys.

  11. Fundamental Limitations of High Contrast Imaging Set by Small Sample Statistics

    NASA Astrophysics Data System (ADS)

    Mawet, D.; Milli, J.; Wahhaj, Z.; Pelat, D.; Absil, O.; Delacroix, C.; Boccaletti, A.; Kasper, M.; Kenworthy, M.; Marois, C.; Mennesson, B.; Pueyo, L.

    2014-09-01

    In this paper, we review the impact of small sample statistics on detection thresholds and corresponding confidence levels (CLs) in high-contrast imaging at small angles. When looking close to the star, the number of resolution elements decreases rapidly toward small angles. This reduction of the number of degrees of freedom dramatically affects CLs and false alarm probabilities. Naively using the same ideal hypothesis and methods as for larger separations, which are well understood and commonly assume Gaussian noise, can yield up to one order of magnitude error in contrast estimations at fixed CL. The statistical penalty exponentially increases toward very small inner working angles. Even at 5-10 resolution elements from the star, false alarm probabilities can be significantly higher than expected. Here we present a rigorous statistical analysis that ensures robustness of the CL, but also imposes a substantial limitation on corresponding achievable detection limits (thus contrast) at small angles. This unavoidable fundamental statistical effect has a significant impact on current coronagraphic and future high-contrast imagers. Finally, the paper concludes with practical recommendations to account for small number statistics when computing the sensitivity to companions at small angles and when exploiting the results of direct imaging planet surveys.

  12. Functional gene-set analysis does not support a major role for synaptic function in attention deficit/hyperactivity disorder (ADHD).

    PubMed

    Hammerschlag, Anke R; Polderman, Tinca J C; de Leeuw, Christiaan; Tiemeier, Henning; White, Tonya; Smit, August B; Verhage, Matthijs; Posthuma, Danielle

    2014-07-22

    Attention Deficit/Hyperactivity Disorder (ADHD) is one of the most common childhood-onset neuropsychiatric disorders. Despite high heritability estimates, genome-wide association studies (GWAS) have failed to find significant genetic associations, likely due to the polygenic character of ADHD. Nevertheless, genetic studies suggested the involvement of several processes important for synaptic function. Therefore, we applied a functional gene-set analysis to formally test whether synaptic functions are associated with ADHD. Gene-set analysis tests the joint effect of multiple genetic variants in groups of functionally related genes. This method provides increased statistical power compared to conventional GWAS. We used data from the Psychiatric Genomics Consortium including 896 ADHD cases and 2455 controls, and 2064 parent-affected offspring trios, providing sufficient statistical power to detect gene sets representing a genotype relative risk of at least 1.17. Although all synaptic genes together showed a significant association with ADHD, this association was not stronger than that of randomly generated gene sets matched for same number of genes. Further analyses showed no association of specific synaptic function categories with ADHD after correction for multiple testing. Given current sample size and gene sets based on current knowledge of genes related to synaptic function, our results do not support a major role for common genetic variants in synaptic genes in the etiology of ADHD.

  13. Constellation Map: Downstream visualization and interpretation of gene set enrichment results.

    PubMed

    Tan, Yan; Wu, Felix; Tamayo, Pablo; Haining, W Nicholas; Mesirov, Jill P

    2015-01-01

    Gene set enrichment analysis (GSEA) approaches are widely used to identify coordinately regulated genes associated with phenotypes of interest. Here, we present Constellation Map, a tool to visualize and interpret the results when enrichment analyses yield a long list of significantly enriched gene sets. Constellation Map identifies commonalities that explain the enrichment of multiple top-scoring gene sets and maps the relationships between them. Constellation Map can help investigators take full advantage of GSEA and facilitates the biological interpretation of enrichment results. Constellation Map is freely available as a GenePattern module at http://www.genepattern.org.

  14. Constellation Map: Downstream visualization and interpretation of gene set enrichment results

    PubMed Central

    Tamayo, Pablo; Haining, W. Nicholas; Mesirov, Jill P.

    2015-01-01

    Summary: Gene set enrichment analysis (GSEA) approaches are widely used to identify coordinately regulated genes associated with phenotypes of interest. Here, we present Constellation Map, a tool to visualize and interpret the results when enrichment analyses yield a long list of significantly enriched gene sets. Constellation Map identifies commonalities that explain the enrichment of multiple top-scoring gene sets and maps the relationships between them. Constellation Map can help investigators take full advantage of GSEA and facilitates the biological interpretation of enrichment results. Availability: Constellation Map is freely available as a GenePattern module at http://www.genepattern.org. PMID:26594333

  15. Degrees of separation as a statistical tool for evaluating candidate genes.

    PubMed

    Nelson, Ronald M; Pettersson, Mats E

    2014-12-01

    Selection of candidate genes is an important step in the exploration of complex genetic architecture. The number of gene networks available is increasing and these can provide information to help with candidate gene selection. It is currently common to use the degree of connectedness in gene networks as validation in Genome Wide Association (GWA) and Quantitative Trait Locus (QTL) mapping studies. However, it can cause misleading results if not validated properly. Here we present a method and tool for validating the gene pairs from GWA studies given the context of the network they co-occur in. It ensures that proposed interactions and gene associations are not statistical artefacts inherent to the specific gene network architecture. The CandidateBacon package provides an easy and efficient method to calculate the average degree of separation (DoS) between pairs of genes to currently available gene networks. We show how these empirical estimates of average connectedness are used to validate candidate gene pairs. Validation of interacting genes by comparing their connectedness with the average connectedness in the gene network will provide support for said interactions by utilising the growing amount of gene network information available. Copyright © 2014 Elsevier Ltd. All rights reserved.

  16. Selection of the Maximum Spatial Cluster Size of the Spatial Scan Statistic by Using the Maximum Clustering Set-Proportion Statistic

    PubMed Central

    Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong

    2016-01-01

    Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set–proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters. PMID:26820646

  17. A reference gene set for chemosensory receptor genes of Manduca sexta.

    PubMed

    Koenig, Christopher; Hirsh, Ariana; Bucks, Sascha; Klinner, Christian; Vogel, Heiko; Shukla, Aditi; Mansfield, Jennifer H; Morton, Brian; Hansson, Bill S; Grosse-Wilde, Ewald

    2015-11-01

    The order of Lepidoptera has historically been crucial for chemosensory research, with many important advances coming from the analysis of species like Bombyx mori or the tobacco hornworm, Manduca sexta. Specifically M. sexta has long been a major model species in the field, especially regarding the importance of olfaction in an ecological context, mainly the interaction with its host plants. In recent years transcriptomic data has led to the discovery of members of all major chemosensory receptor families in the species, but the data was fragmentary and incomplete. Here we present the analysis of the newly available high-quality genome data for the species, supplemented by additional transcriptome data to generate a high quality reference gene set for the three major chemosensory receptor gene families, the gustatory (GR), olfactory (OR) and antennal ionotropic receptors (IR). Coupled with gene expression analysis our approach allows association of specific receptor types and behaviors, like pheromone and host detection. The dataset will provide valuable support for future analysis of these essential chemosensory modalities in this species and in Lepidoptera in general.

  18. Gene selection for tumor classification using neighborhood rough sets and entropy measures.

    PubMed

    Chen, Yumin; Zhang, Zunjun; Zheng, Jianzhong; Ma, Ying; Xue, Yu

    2017-03-01

    With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update.

    PubMed

    Kuleshov, Maxim V; Jones, Matthew R; Rouillard, Andrew D; Fernandez, Nicolas F; Duan, Qiaonan; Wang, Zichen; Koplev, Simon; Jenkins, Sherry L; Jagodnik, Kathleen M; Lachmann, Alexander; McDermott, Michael G; Monteiro, Caroline D; Gundersen, Gregory W; Ma'ayan, Avi

    2016-07-08

    Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.

  20. Assessment and improvement of statistical tools for comparative proteomics analysis of sparse data sets with few experimental replicates.

    PubMed

    Schwämmle, Veit; León, Ileana Rodríguez; Jensen, Ole Nørregaard

    2013-09-06

    Large-scale quantitative analyses of biological systems are often performed with few replicate experiments, leading to multiple nonidentical data sets due to missing values. For example, mass spectrometry driven proteomics experiments are frequently performed with few biological or technical replicates due to sample-scarcity or due to duty-cycle or sensitivity constraints, or limited capacity of the available instrumentation, leading to incomplete results where detection of significant feature changes becomes a challenge. This problem is further exacerbated for the detection of significant changes on the peptide level, for example, in phospho-proteomics experiments. In order to assess the extent of this problem and the implications for large-scale proteome analysis, we investigated and optimized the performance of three statistical approaches by using simulated and experimental data sets with varying numbers of missing values. We applied three tools, including standard t test, moderated t test, also known as limma, and rank products for the detection of significantly changing features in simulated and experimental proteomics data sets with missing values. The rank product method was improved to work with data sets containing missing values. Extensive analysis of simulated and experimental data sets revealed that the performance of the statistical analysis tools depended on simple properties of the data sets. High-confidence results were obtained by using the limma and rank products methods for analyses of triplicate data sets that exhibited more than 1000 features and more than 50% missing values. The maximum number of differentially represented features was identified by using limma and rank products methods in a complementary manner. We therefore recommend combined usage of these methods as a novel and optimal way to detect significantly changing features in these data sets. This approach is suitable for large quantitative data sets from stable isotope labeling

  1. Size distribution of function-based human gene sets and the split-merge model.

    PubMed

    Li, Wentian; Fontanelli, Oscar; Miramontes, Pedro

    2016-08-01

    The sizes of paralogues-gene families produced by ancestral duplication-are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.

  2. Size distribution of function-based human gene sets and the split–merge model

    PubMed Central

    Fontanelli, Oscar; Miramontes, Pedro

    2016-01-01

    The sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families. PMID:27853602

  3. Gene set of nuclear-encoded mitochondrial regulators is enriched for common inherited variation in obesity.

    PubMed

    Knoll, Nadja; Jarick, Ivonne; Volckmar, Anna-Lena; Klingenspor, Martin; Illig, Thomas; Grallert, Harald; Gieger, Christian; Wichmann, Heinz-Erich; Peters, Annette; Hebebrand, Johannes; Scherag, André; Hinney, Anke

    2013-01-01

    There are hints of an altered mitochondrial function in obesity. Nuclear-encoded genes are relevant for mitochondrial function (3 gene sets of known relevant pathways: (1) 16 nuclear regulators of mitochondrial genes, (2) 91 genes for oxidative phosphorylation and (3) 966 nuclear-encoded mitochondrial genes). Gene set enrichment analysis (GSEA) showed no association with type 2 diabetes mellitus in these gene sets. Here we performed a GSEA for the same gene sets for obesity. Genome wide association study (GWAS) data from a case-control approach on 453 extremely obese children and adolescents and 435 lean adult controls were used for GSEA. For independent confirmation, we analyzed 705 obesity GWAS trios (extremely obese child and both biological parents) and a population-based GWAS sample (KORA F4, n = 1,743). A meta-analysis was performed on all three samples. In each sample, the distribution of significance levels between the respective gene set and those of all genes was compared using the leading-edge-fraction-comparison test (cut-offs between the 50(th) and 95(th) percentile of the set of all gene-wise corrected p-values) as implemented in the MAGENTA software. In the case-control sample, significant enrichment of associations with obesity was observed above the 50(th) percentile for the set of the 16 nuclear regulators of mitochondrial genes (p(GSEA,50) = 0.0103). This finding was not confirmed in the trios (p(GSEA,50) = 0.5991), but in KORA (p(GSEA,50) = 0.0398). The meta-analysis again indicated a trend for enrichment (p(MAGENTA,50) = 0.1052, p(MAGENTA,75) = 0.0251). The GSEA revealed that weak association signals for obesity might be enriched in the gene set of 16 nuclear regulators of mitochondrial genes.

  4. Gene-Set Local Hierarchical Clustering (GSLHC)--A Gene Set-Based Approach for Characterizing Bioactive Compounds in Terms of Biological Functional Groups.

    PubMed

    Chung, Feng-Hsiang; Jin, Zhen-Hua; Hsu, Tzu-Ting; Hsu, Chueh-Lin; Liu, Hsueh-Chuan; Lee, Hoong-Chien

    2015-01-01

    Gene-set-based analysis (GSA), which uses the relative importance of functional gene-sets, or molecular signatures, as units for analysis of genome-wide gene expression data, has exhibited major advantages with respect to greater accuracy, robustness, and biological relevance, over individual gene analysis (IGA), which uses log-ratios of individual genes for analysis. Yet IGA remains the dominant mode of analysis of gene expression data. The Connectivity Map (CMap), an extensive database on genomic profiles of effects of drugs and small molecules and widely used for studies related to repurposed drug discovery, has been mostly employed in IGA mode. Here, we constructed a GSA-based version of CMap, Gene-Set Connectivity Map (GSCMap), in which all the genomic profiles in CMap are converted, using gene-sets from the Molecular Signatures Database, to functional profiles. We showed that GSCMap essentially eliminated cell-type dependence, a weakness of CMap in IGA mode, and yielded significantly better performance on sample clustering and drug-target association. As a first application of GSCMap we constructed the platform Gene-Set Local Hierarchical Clustering (GSLHC) for discovering insights on coordinated actions of biological functions and facilitating classification of heterogeneous subtypes on drug-driven responses. GSLHC was shown to tightly clustered drugs of known similar properties. We used GSLHC to identify the therapeutic properties and putative targets of 18 compounds of previously unknown characteristics listed in CMap, eight of which suggest anti-cancer activities. The GSLHC website http://cloudr.ncu.edu.tw/gslhc/ contains 1,857 local hierarchical clusters accessible by querying 555 of the 1,309 drugs and small molecules listed in CMap. We expect GSCMap and GSLHC to be widely useful in providing new insights in the biological effect of bioactive compounds, in drug repurposing, and in function-based classification of complex diseases.

  5. Statistical energy analysis modelling of complex structures as coupled sets of oscillators: Ensemble mean and variance of energy

    NASA Astrophysics Data System (ADS)

    Ji, L.; Mace, B. R.

    2008-11-01

    Expressions are derived for the ensemble means and variances of the subsystem energies of built-up systems comprising two subsystems. The approach is based on the Statistical Energy Analysis of two spring-coupled oscillators and sets of oscillators, or coupled continuous subsystems, described by Mace and Ji [The statistical energy analysis of coupled sets of oscillators, Proceedings of the Royal Society A 1824 (2007)]. The paper focuses on spring coupling, although similar results hold for more general forms of conservative coupling. Randomness is introduced into the system by assuming that the natural frequency spacings in each subsystem conform to certain statistical distributions. A "coupling coefficient parameter" is introduced which, together with the "coupling strength parameter" defined by Mace and Ji (2007), accounts for the statistics of the coupling stiffness. Various approximations and assumptions are made. It is seen that the variance of the excited subsystem depends primarily on the variance of the input power, which in turn depends on the variance of the number of modes of the excited subsystem in the frequency band of excitation and their mode shapes. The variance of the undriven subsystem, on the other hand, depends primarily on the variance of the intermodal coupling coefficients, which in turn depend on the variances of the number of in-band modes of both subsystems and their mode shapes. The cases of Poisson and Gaussian Orthogonal Ensemble natural frequency spacing statistics are considered. Numerical examples of two plates coupled by one or a number of springs are presented.

  6. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics.

    PubMed

    Lamparter, David; Marbach, Daniel; Rueedi, Rico; Kutalik, Zoltán; Bergmann, Sven

    2016-01-01

    Integrating single nucleotide polymorphism (SNP) p-values from genome-wide association studies (GWAS) across genes and pathways is a strategy to improve statistical power and gain biological insight. Here, we present Pascal (Pathway scoring algorithm), a powerful tool for computing gene and pathway scores from SNP-phenotype association summary statistics. For gene score computation, we implemented analytic and efficient numerical solutions to calculate test statistics. We examined in particular the sum and the maximum of chi-squared statistics, which measure the strongest and the average association signals per gene, respectively. For pathway scoring, we use a modified Fisher method, which offers not only significant power improvement over more traditional enrichment strategies, but also eliminates the problem of arbitrary threshold selection inherent in any binary membership based pathway enrichment approach. We demonstrate the marked increase in power by analyzing summary statistics from dozens of large meta-studies for various traits. Our extensive testing indicates that our method not only excels in rigorous type I error control, but also results in more biologically meaningful discoveries.

  7. Performance of Single and Concatenated Sets of Mitochondrial Genes at Inferring Metazoan Relationships Relative to Full Mitogenome Data

    PubMed Central

    Havird, Justin C.; Santos, Scott R.

    2014-01-01

    Mitochondrial (mt) genes are some of the most popular and widely-utilized genetic loci in phylogenetic studies of metazoan taxa. However, their linked nature has raised questions on whether using the entire mitogenome for phylogenetics is overkill (at best) or pseudoreplication (at worst). Moreover, no studies have addressed the comparative phylogenetic utility of mitochondrial genes across individual lineages within the entire Metazoa. To comment on the phylogenetic utility of individual mt genes as well as concatenated subsets of genes, we analyzed mitogenomic data from 1865 metazoan taxa in 372 separate lineages spanning genera to subphyla. Specifically, phylogenies inferred from these datasets were statistically compared to ones generated from all 13 mt protein-coding (PC) genes (i.e., the “supergene” set) to determine which single genes performed “best” at, and the minimum number of genes required to, recover the “supergene” topology. Surprisingly, the popular marker COX1 performed poorest, while ND5, ND4, and ND2 were most likely to reproduce the “supergene” topology. Averaged across all lineages, the longest ∼2 mt PC genes were sufficient to recreate the “supergene” topology, although this average increased to ∼5 genes for datasets with 40 or more taxa. Furthermore, concatenation of the three “best” performing mt PC genes outperformed that of the three longest mt PC genes (i.e, ND5, COX1, and ND4). Taken together, while not all mt PC genes are equally interchangeable in phylogenetic studies of the metazoans, some subset can serve as a proxy for the 13 mt PC genes. However, the exact number and identity of these genes is specific to the lineage in question and cannot be applied indiscriminately across the Metazoa. PMID:24454717

  8. Comparative study on gene set and pathway topology-based enrichment methods.

    PubMed

    Bayerlová, Michaela; Jung, Klaus; Kramer, Frank; Klemm, Florian; Bleckmann, Annalen; Beißbarth, Tim

    2015-10-22

    Enrichment analysis is a popular approach to identify pathways or sets of genes which are significantly enriched in the context of differentially expressed genes. The traditional gene set enrichment approach considers a pathway as a simple gene list disregarding any knowledge of gene or protein interactions. In contrast, the new group of so called pathway topology-based methods integrates the topological structure of a pathway into the analysis. We comparatively investigated gene set and pathway topology-based enrichment approaches, considering three gene set and four topological methods. These methods were compared in two extensive simulation studies and on a benchmark of 36 real datasets, providing the same pathway input data for all methods. In the benchmark data analysis both types of methods showed a comparable ability to detect enriched pathways. The first simulation study was conducted with KEGG pathways, which showed considerable gene overlaps between each other. In this study with original KEGG pathways, none of the topology-based methods outperformed the gene set approach. Therefore, a second simulation study was performed on non-overlapping pathways created by unique gene IDs. Here, methods accounting for pathway topology reached higher accuracy than the gene set methods, however their sensitivity was lower. We conducted one of the first comprehensive comparative works on evaluating gene set against pathway topology-based enrichment methods. The topological methods showed better performance in the simulation scenarios with non-overlapping pathways, however, they were not conclusively better in the other scenarios. This suggests that simple gene set approach might be sufficient to detect an enriched pathway under realistic circumstances. Nevertheless, more extensive studies and further benchmark data are needed to systematically evaluate these methods and to assess what gain and cost pathway topology information introduces into enrichment analysis. Both

  9. Re-Conceptualization of Modified Angoff Standard Setting: Unified Statistical, Measurement, Cognitive, and Social Psychological Theories

    ERIC Educational Resources Information Center

    Iyioke, Ifeoma Chika

    2013-01-01

    This dissertation describes a design for training, in accordance with probability judgment heuristics principles, for the Angoff standard setting method. The new training with instruction, practice, and feedback tailored to the probability judgment heuristics principles was called the Heuristic training and the prevailing Angoff method training…

  10. Re-Conceptualization of Modified Angoff Standard Setting: Unified Statistical, Measurement, Cognitive, and Social Psychological Theories

    ERIC Educational Resources Information Center

    Iyioke, Ifeoma Chika

    2013-01-01

    This dissertation describes a design for training, in accordance with probability judgment heuristics principles, for the Angoff standard setting method. The new training with instruction, practice, and feedback tailored to the probability judgment heuristics principles was called the Heuristic training and the prevailing Angoff method training…

  11. Phylogenetics and evolution of Trx SET genes in fully sequenced land plants.

    PubMed

    Zhu, Xinyu; Chen, Caoyi; Wang, Baohua

    2012-04-01

    Plant Trx SET proteins are involved in H3K4 methylation and play a key role in plant floral development. Genes encoding Trx SET proteins constitute a multigene family in which the copy number varies among plant species and functional divergence appears to have occurred repeatedly. To investigate the evolutionary history of the Trx SET gene family, we made a comprehensive evolutionary analysis on this gene family from 13 major representatives of green plants. A novel clustering (here named as cpTrx clade), which included the III-1, III-2, and III-4 orthologous groups, previously resolved was identified. Our analysis showed that plant Trx proteins possessed a variety of domain organizations and gene structures among paralogs. Additional domains such as PHD, PWWP, and FYR were early integrated into primordial SET-PostSET domain organization of cpTrx clade. We suggested that the PostSET domain was lost in some members of III-4 orthologous group during the evolution of land plants. At least four classes of gene structures had been formed at the early evolutionary stage of land plants. Three intronless orphan Trx SET genes from the Physcomitrella patens (moss) were identified, and supposedly, their parental genes have been eliminated from the genome. The structural differences among evolutionary groups of plant Trx SET genes with different functions were described, contributing to the design of further experimental studies.

  12. A statistical model for bacterial speciation triggered by lateral gene transfer

    NASA Astrophysics Data System (ADS)

    Sidhu, Sunjeet; Peng, Wequin

    2006-03-01

    The process of bacterial speciation has been a major unresolved issue in the study of bacterial evolution. It has been proposed that lateral gene transfer and homologous recombination play critical and complementary roles in speciation. We introduce a statistical model, of a population, for the evolution under lateral gene transfer and local homologous recombination. We examine the evolutionary dynamics and its dependence on various evolutionary operators. J. G. Lawrence, Theor. Popul. Biol. 61, 449(2002).

  13. Weighted-SAMGSR: combining significance analysis of microarray-gene set reduction algorithm with pathway topology-based weights to select relevant genes.

    PubMed

    Tian, Suyan; Chang, Howard H; Wang, Chi

    2016-09-29

    It has been demonstrated that a pathway-based feature selection method that incorporates biological information within pathways during the process of feature selection usually outperforms a gene-based feature selection algorithm in terms of predictive accuracy and stability. Significance analysis of microarray-gene set reduction algorithm (SAMGSR), an extension to a gene set analysis method with further reduction of the selected pathways to their respective core subsets, can be regarded as a pathway-based feature selection method. In SAMGSR, whether a gene is selected is mainly determined by its expression difference between the phenotypes, and partially by the number of pathways to which this gene belongs. It ignores the topology information among pathways. In this study, we propose a weighted version of the SAMGSR algorithm by constructing weights based on the connectivity among genes and then combing these weights with the test statistics. Using both simulated and real-world data, we evaluate the performance of the proposed SAMGSR extension and demonstrate that the weighted version outperforms its original version. CONCLUSIONS: To conclude, the additional gene connectivity information does faciliatate feature selection. This article was reviewed by Drs. Limsoon Wong, Lev Klebanov, and, I. King Jordan.

  14. Micro-foundations for macroeconomics: New set-up based on statistical physics

    NASA Astrophysics Data System (ADS)

    Yoshikawa, Hiroshi

    2016-12-01

    Modern macroeconomics is built on "micro foundations." Namely, optimization of micro agent such as consumer and firm is explicitly analyzed in model. Toward this goal, standard model presumes "the representative" consumer/firm, and analyzes its behavior in detail. However, the macroeconomy consists of 107 consumers and 106 firms. For the purpose of analyzing such macro system, it is meaningless to pursue the micro behavior in detail. In this respect, there is no essential difference between economics and physics. The method of statistical physics can be usefully applied to the macroeconomy, and provides Keynesian economics with correct micro-foundations.

  15. Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics.

    PubMed

    Clark, Robert D

    2003-01-01

    It is becoming increasingly common in quantitative structure/activity relationship (QSAR) analyses to use external test sets to evaluate the likely stability and predictivity of the models obtained. In some cases, such as those involving variable selection, an internal test set--i.e., a cross-validation set--is also used. Care is sometimes taken to ensure that the subsets used exhibit response and/or property distributions similar to those of the data set as a whole, but more often the individual observations are simply assigned 'at random.' In the special case of MLR without variable selection, it can be analytically demonstrated that this strategy is inferior to others. Most particularly, D-optimal design performs better if the form of the regression equation is known and the variables involved are well behaved. This report introduces an alternative, non-parametric approach termed 'boosted leave-many-out' (boosted LMO) cross-validation. In this method, relatively small training sets are chosen by applying optimizable k-dissimilarity selection (OptiSim) using a small subsample size (k = 4, in this case), with the unselected observations being reserved as a test set for the corresponding reduced model. Predictive errors for the full model are then estimated by aggregating results over several such analyses. The countervailing effects of training and test set size, diversity, and representativeness on PLS model statistics are described for CoMFA analysis of a large data set of COX2 inhibitors.

  16. Mechanical Unloading of Mouse Bone in Microgravity Significantly Alters Cell Cycle Gene Set Expression

    NASA Astrophysics Data System (ADS)

    Blaber, Elizabeth; Dvorochkin, Natalya; Almeida, Eduardo; Kaplan, Warren; Burns, Brnedan

    2012-07-01

    unloading in spaceflight, we conducted genome wide microarray analysis of total RNA isolated from the mouse pelvis. Specifically, 16 week old mice were subjected to 15 days spaceflight onboard NASA's STS-131 space shuttle mission. The pelvis of the mice was dissected, the bone marrow was flushed and the bones were briefly stored in RNAlater. The pelvii were then homogenized, and RNA was isolated using TRIzol. RNA concentration and quality was measured using a Nanodrop spectrometer, and 0.8% agarose gel electrophoresis. Samples of cDNA were analyzed using an Affymetrix GeneChip\\S Gene 1.0 ST (Sense Target) Array System for Mouse and GenePattern Software. We normalized the ST gene arrays using Robust Multichip Average (RMA) normalization, which summarizes perfectly matched spots on the array through the median polish algorithm, rather than normalizing according to mismatched spots. We also used Limma for statistical analysis, using the BioConductor Limma Library by Gordon Smyth, and differential expression analysis to identify genes with significant changes in expression between the two experimental conditions. Finally we used GSEApreRanked for Gene Set Enrichment Analysis (GSEA), with Kolmogorov-Smirnov style statistics to identify groups of genes that are regulated together using the t-statistics derived from Limma. Preliminary results show that 6,603 genes expressed in pelvic bone had statistically significant alterations in spaceflight compared to ground controls. These prominently included cell cycle arrest molecules p21, and p18, cell survival molecule Crbp1, and cell cycle molecules cyclin D1, and Cdk1. Additionally, GSEA results indicated alterations in molecular targets of cyclin D1 and Cdk4, senescence pathways resulting from abnormal laminin maturation, cell-cell contacts via E-cadherin, and several pathways relating to protein translation and metabolism. In total 111 gene sets out of 2,488, about 4%, showed statistically significant set alterations. These

  17. GSA-PCA: gene set generation by principal component analysis of the Laplacian matrix of a metabolic network

    PubMed Central

    2012-01-01

    Background Gene Set Analysis (GSA) has proven to be a useful approach to microarray analysis. However, most of the method development for GSA has focused on the statistical tests to be used rather than on the generation of sets that will be tested. Existing methods of set generation are often overly simplistic. The creation of sets from individual pathways (in isolation) is a poor reflection of the complexity of the underlying metabolic network. We have developed a novel approach to set generation via the use of Principal Component Analysis of the Laplacian matrix of a metabolic network. We have analysed a relatively simple data set to show the difference in results between our method and the current state-of-the-art pathway-based sets. Results The sets generated with this method are semi-exhaustive and capture much of the topological complexity of the metabolic network. The semi-exhaustive nature of this method has also allowed us to design a hypergeometric enrichment test to determine which genes are likely responsible for set significance. We show that our method finds significant aspects of biology that would be missed (i.e. false negatives) and addresses the false positive rates found with the use of simple pathway-based sets. Conclusions The set generation step for GSA is often neglected but is a crucial part of the analysis as it defines the full context for the analysis. As such, set generation methods should be robust and yield as complete a representation of the extant biological knowledge as possible. The method reported here achieves this goal and is demonstrably superior to previous set analysis methods. PMID:22876834

  18. Integrated Data Collection Analysis (IDCA) Program - Statistical Analysis of RDX Standard Data Sets

    SciTech Connect

    Sandstrom, Mary M.; Brown, Geoffrey W.; Preston, Daniel N.; Pollard, Colin J.; Warner, Kirstin F.; Sorensen, Daniel N.; Remmers, Daniel L.; Phillips, Jason J.; Shelley, Timothy J.; Reyes, Jose A.; Hsu, Peter C.; Reynolds, John G.

    2015-10-30

    The Integrated Data Collection Analysis (IDCA) program is conducting a Proficiency Test for Small- Scale Safety and Thermal (SSST) testing of homemade explosives (HMEs). Described here are statistical analyses of the results for impact, friction, electrostatic discharge, and differential scanning calorimetry analysis of the RDX Type II Class 5 standard. The material was tested as a well-characterized standard several times during the proficiency study to assess differences among participants and the range of results that may arise for well-behaved explosive materials. The analyses show that there are detectable differences among the results from IDCA participants. While these differences are statistically significant, most of them can be disregarded for comparison purposes to assess potential variability when laboratories attempt to measure identical samples using methods assumed to be nominally the same. The results presented in this report include the average sensitivity results for the IDCA participants and the ranges of values obtained. The ranges represent variation about the mean values of the tests of between 26% and 42%. The magnitude of this variation is attributed to differences in operator, method, and environment as well as the use of different instruments that are also of varying age. The results appear to be a good representation of the broader safety testing community based on the range of methods, instruments, and environments included in the IDCA Proficiency Test.

  19. Statistics of dark matter halos in the excursion set peak framework

    SciTech Connect

    Lapi, A.; Danese, L. E-mail: danese@sissa.it

    2014-07-01

    We derive approximated, yet very accurate analytical expressions for the abundance and clustering properties of dark matter halos in the excursion set peak framework; the latter relies on the standard excursion set approach, but also includes the effects of a realistic filtering of the density field, a mass-dependent threshold for collapse, and the prescription from peak theory that halos tend to form around density maxima. We find that our approximations work excellently for diverse power spectra, collapse thresholds and density filters. Moreover, when adopting a cold dark matter power spectra, a tophat filtering and a mass-dependent collapse threshold (supplemented with conceivable scatter), our approximated halo mass function and halo bias represent very well the outcomes of cosmological N-body simulations.

  20. Combining multiple tools outperforms individual methods in gene set enrichment analyses

    PubMed Central

    Ng, Milica; Wilson, Nicholas J.; Sheridan, Julie M.; Huynh, Huy; Wilson, Michael J.

    2017-01-01

    Abstract Motivation: Gene set enrichment (GSE) analysis allows researchers to efficiently extract biological insight from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of GSE analysis methods and hence it has become increasingly difficult for researchers to select an optimal GSE tool based on their particular dataset. Moreover, the majority of GSE analysis methods do not allow researchers to simultaneously compare gene set level results between multiple experimental conditions. Results: The ensemble of genes set enrichment analyses (EGSEA) is a method developed for RNA-sequencing data that combines results from twelve algorithms and calculates collective gene set scores to improve the biological relevance of the highest ranked gene sets. EGSEA’s gene set database contains around 25 000 gene sets from sixteen collections. It has multiple visualization capabilities that allow researchers to view gene sets at various levels of granularity. EGSEA has been tested on simulated data and on a number of human and mouse datasets and, based on biologists’ feedback, consistently outperforms the individual tools that have been combined. Our evaluation demonstrates the superiority of the ensemble approach for GSE analysis, and its utility to effectively and efficiently extrapolate biological functions and potential involvement in disease processes from lists of differentially regulated genes. Availability and Implementation: EGSEA is available as an R package at http://www.bioconductor.org/packages/EGSEA/. The gene sets collections are available in the R package EGSEAdata from http://www.bioconductor.org/packages/EGSEAdata/. Contacts:monther.alhamdoosh@csl.com.au ormritchie@wehi.edu.au Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27694195

  1. Combining multiple tools outperforms individual methods in gene set enrichment analyses.

    PubMed

    Alhamdoosh, Monther; Ng, Milica; Wilson, Nicholas J; Sheridan, Julie M; Huynh, Huy; Wilson, Michael J; Ritchie, Matthew E

    2017-02-01

    Gene set enrichment (GSE) analysis allows researchers to efficiently extract biological insight from long lists of differentially expressed genes by interrogating them at a systems level. In recent years, there has been a proliferation of GSE analysis methods and hence it has become increasingly difficult for researchers to select an optimal GSE tool based on their particular dataset. Moreover, the majority of GSE analysis methods do not allow researchers to simultaneously compare gene set level results between multiple experimental conditions. The ensemble of genes set enrichment analyses (EGSEA) is a method developed for RNA-sequencing data that combines results from twelve algorithms and calculates collective gene set scores to improve the biological relevance of the highest ranked gene sets. EGSEA's gene set database contains around 25 000 gene sets from sixteen collections. It has multiple visualization capabilities that allow researchers to view gene sets at various levels of granularity. EGSEA has been tested on simulated data and on a number of human and mouse datasets and, based on biologists' feedback, consistently outperforms the individual tools that have been combined. Our evaluation demonstrates the superiority of the ensemble approach for GSE analysis, and its utility to effectively and efficiently extrapolate biological functions and potential involvement in disease processes from lists of differentially regulated genes. EGSEA is available as an R package at http://www.bioconductor.org/packages/EGSEA/ . The gene sets collections are available in the R package EGSEAdata from http://www.bioconductor.org/packages/EGSEAdata/ . monther.alhamdoosh@csl.com.au mritchie@wehi.edu.au. Supplementary data are available at Bioinformatics online.

  2. Human effector/initiator gene sets that regulate myometrial contractility during term and preterm labor.

    PubMed

    Weiner, Carl P; Mason, Clifford W; Dong, Yafeng; Buhimschi, Irina A; Swaan, Peter W; Buhimschi, Catalin S

    2010-05-01

    Distinct processes govern transition from quiescence to activation during term (TL) and preterm labor (PTL). We sought gene sets that are responsible for TL and PTL, along with the effector genes that are necessary for labor independent of gestation and underlying trigger. Expression was analyzed in term and preterm with or without labor (n=6 subjects/group). Gene sets were generated with logic operations. Thirty-four genes were expressed similarly in PTL/TL but were absent from nonlabor samples (effector set); 49 genes were specific to PTL (preterm initiator set), and 174 genes were specific to TL (term initiator set). The gene ontogeny processes that comprise term initiator and effector sets were diverse, although inflammation was represented in 4 of the top 10; inflammation dominated the preterm initiator set. TL and PTL differ dramatically in initiator profiles. Although inflammation is part of the term initiator and the effector sets, it is an overwhelming part of PTL that is associated with intraamniotic inflammation. Copyright (c) 2010 Mosby, Inc. All rights reserved.

  3. A power comparison of generalized additive models and the spatial scan statistic in a case-control setting

    PubMed Central

    2010-01-01

    Background A common, important problem in spatial epidemiology is measuring and identifying variation in disease risk across a study region. In application of statistical methods, the problem has two parts. First, spatial variation in risk must be detected across the study region and, second, areas of increased or decreased risk must be correctly identified. The location of such areas may give clues to environmental sources of exposure and disease etiology. One statistical method applicable in spatial epidemiologic settings is a generalized additive model (GAM) which can be applied with a bivariate LOESS smoother to account for geographic location as a possible predictor of disease status. A natural hypothesis when applying this method is whether residential location of subjects is associated with the outcome, i.e. is the smoothing term necessary? Permutation tests are a reasonable hypothesis testing method and provide adequate power under a simple alternative hypothesis. These tests have yet to be compared to other spatial statistics. Results This research uses simulated point data generated under three alternative hypotheses to evaluate the properties of the permutation methods and compare them to the popular spatial scan statistic in a case-control setting. Case 1 was a single circular cluster centered in a circular study region. The spatial scan statistic had the highest power though the GAM method estimates did not fall far behind. Case 2 was a single point source located at the center of a circular cluster and Case 3 was a line source at the center of the horizontal axis of a square study region. Each had linearly decreasing logodds with distance from the point. The GAM methods outperformed the scan statistic in Cases 2 and 3. Comparing sensitivity, measured as the proportion of the exposure source correctly identified as high or low risk, the GAM methods outperformed the scan statistic in all three Cases. Conclusions The GAM permutation testing methods

  4. Statistical inference of selection and divergence of rice blast resistance gene Pi-ta

    USDA-ARS?s Scientific Manuscript database

    The resistance gene Pi-ta has been effectively used to control rice blast disease worldwide. A few recent studies have described the possible evolution of Pi-ta in cultivated and weedy rice. However, evolutionary statistics used for the studies are too limited to precisely understand selection and d...

  5. Cross-cultural adaptation of research instruments: language, setting, time and statistical considerations

    PubMed Central

    2010-01-01

    Background Research questionnaires are not always translated appropriately before they are used in new temporal, cultural or linguistic settings. The results based on such instruments may therefore not accurately reflect what they are supposed to measure. This paper aims to illustrate the process and required steps involved in the cross-cultural adaptation of a research instrument using the adaptation process of an attitudinal instrument as an example. Methods A questionnaire was needed for the implementation of a study in Norway 2007. There was no appropriate instruments available in Norwegian, thus an Australian-English instrument was cross-culturally adapted. Results The adaptation process included investigation of conceptual and item equivalence. Two forward and two back-translations were synthesized and compared by an expert committee. Thereafter the instrument was pretested and adjusted accordingly. The final questionnaire was administered to opioid maintenance treatment staff (n=140) and harm reduction staff (n=180). The overall response rate was 84%. The original instrument failed confirmatory analysis. Instead a new two-factor scale was identified and found valid in the new setting. Conclusions The failure of the original scale highlights the importance of adapting instruments to current research settings. It also emphasizes the importance of ensuring that concepts within an instrument are equal between the original and target language, time and context. If the described stages in the cross-cultural adaptation process had been omitted, the findings would have been misleading, even if presented with apparent precision. Thus, it is important to consider possible barriers when making a direct comparison between different nations, cultures and times. PMID:20144247

  6. Statistical Analysis of Hurst Exponents of Essential/Nonessential Genes in 33 Bacterial Genomes

    PubMed Central

    Liu, Xiao; Wang, Baojin; Xu, Luo

    2015-01-01

    Methods for identifying essential genes currently depend predominantly on biochemical experiments. However, there is demand for improved computational methods for determining gene essentiality. In this study, we used the Hurst exponent, a characteristic parameter to describe long-range correlation in DNA, and analyzed its distribution in 33 bacterial genomes. In most genomes (31 out of 33) the significance levels of the Hurst exponents of the essential genes were significantly higher than for the corresponding full-gene-set, whereas the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased only slightly. All of the Hurst exponents of essential genes followed a normal distribution, with one exception. We therefore propose that the distribution feature of Hurst exponents of essential genes can be used as a classification index for essential gene prediction in bacteria. For computer-aided design in the field of synthetic biology, this feature can build a restraint for pre- or post-design checking of bacterial essential genes. Moreover, considering the relationship between gene essentiality and evolution, the Hurst exponents could be used as a descriptive parameter related to evolutionary level, or be added to the annotation of each gene. PMID:26067107

  7. Screening key genes and pathways in glioma based on gene set enrichment analysis and meta-analysis.

    PubMed

    Tang, Yanyan; He, Wenwu; Wei, Yunfei; Qu, Zhanli; Zeng, Jinming; Qin, Chao

    2013-06-01

    Glioma is a highly invasive, rapidly spreading form of brain cancer, while its etiology is largely unknown. A few recently reported studies have been developed using gene expression microarrays of glioma to identify differentially expressed genes from several to hundreds. This study was designed to analyze vast amounts of glioma-related microarray data and screen the key genes and pathways related to the development and progression of glioma. We used gene set enrichment analysis (GSEA) and meta-analysis of seven included studies after standardized microarray preprocessing, which increased concordance between these gene datasets. After GSEA, there were 14 mixing pathways including 13 up- and 1 down-regulated pathways. Based on the meta-analysis, 268 significant genes were screened out (P < 0.05); there were 249 genes identified by Kyoto Encyclopedia of Genes and Genomes (KEGG), and 27 KEGG pathways closely related to the set of the imported genes were identified. At last, six consistent pathways and key genes in these pathways related to glioma were obtained with combined GSEA and meta-analysis. The gene pathways that we identified could provide insight concerning the development of glioma. Further studies are needed to determine the biological function for the positive genes.

  8. Inferring disease and gene set associations with rank coherence in networks.

    PubMed

    Hwang, TaeHyun; Zhang, Wei; Xie, Maoqiang; Liu, Jinfeng; Kuang, Rui

    2011-10-01

    To validate the candidate disease genes identified from high-throughput genomic studies, a necessary step is to elucidate the associations between the set of candidate genes and disease phenotypes. The conventional gene set enrichment analysis often fails to reveal associations between disease phenotypes and the gene sets with a short list of poorly annotated genes, because the existing annotations of disease-causative genes are incomplete. This article introduces a network-based computational approach called rcNet to discover the associations between gene sets and disease phenotypes. A learning framework is proposed to maximize the coherence between the predicted phenotype-gene set relations and the known disease phenotype-gene associations. An efficient algorithm coupling ridge regression with label propagation and two variants are designed to find the optimal solution to the objective functions of the learning framework. We evaluated the rcNet algorithms with leave-one-out cross-validation on Online Mendelian Inheritance in Man (OMIM) data and an independent test set of recently discovered disease-gene associations. In the experiments, the rcNet algorithms achieved best overall rankings compared with the baselines. To further validate the reproducibility of the performance, we applied the algorithms to identify the target diseases of novel candidate disease genes obtained from recent studies of Genome-Wide Association Study (GWAS), DNA copy number variation analysis and gene expression profiling. The algorithms ranked the target disease of the candidate genes at the top of the rank list in many cases across all the three case studies. http://compbio.cs.umn.edu/dgsa_rcNet kuang@cs.umn.edu.

  9. RIDDLE: reflective diffusion and local extension reveal functional associations for unannotated gene sets via proximity in a gene network

    PubMed Central

    2012-01-01

    The growing availability of large-scale functional networks has promoted the development of many successful techniques for predicting functions of genes. Here we extend these network-based principles and techniques to functionally characterize whole sets of genes. We present RIDDLE (Reflective Diffusion and Local Extension), which uses well developed guilt-by-association principles upon a human gene network to identify associations of gene sets. RIDDLE is particularly adept at characterizing sets with no annotations, a major challenge where most traditional set analyses fail. Notably, RIDDLE found microRNA-450a to be strongly implicated in ocular diseases and development. A web application is available at http://www.functionalnet.org/RIDDLE. PMID:23268829

  10. A statistical investigation into the stability of iris recognition in diverse population sets

    NASA Astrophysics Data System (ADS)

    Howard, John J.; Etter, Delores M.

    2014-05-01

    Iris recognition is increasingly being deployed on population wide scales for important applications such as border security, social service administration, criminal identification and general population management. The error rates for this incredibly accurate form of biometric identification are established using well known, laboratory quality datasets. However, it is has long been acknowledged in biometric theory that not all individuals have the same likelihood of being correctly serviced by a biometric system. Typically, techniques for identifying clients that are likely to experience a false non-match or a false match error are carried out on a per-subject basis. This research makes the novel hypothesis that certain ethnical denominations are more or less likely to experience a biometric error. Through established statistical techniques, we demonstrate this hypothesis to be true and document the notable effect that the ethnicity of the client has on iris similarity scores. Understanding the expected impact of ethnical diversity on iris recognition accuracy is crucial to the future success of this technology as it is deployed in areas where the target population consists of clientele from a range of geographic backgrounds, such as border crossings and immigration check points.

  11. An abdominal aortic aneurysm segmentation method: Level set with region and statistical information

    SciTech Connect

    Zhuge Feng; Rubin, Geoffrey D.; Sun Shaohua; Napel, Sandy

    2006-05-15

    We present a system for segmenting the human aortic aneurysm in CT angiograms (CTA), which, in turn, allows measurements of volume and morphological aspects useful for treatment planning. The system estimates a rough 'initial surface', and then refines it using a level set segmentation scheme augmented with two external analyzers: The global region analyzer, which incorporates a priori knowledge of the intensity, volume, and shape of the aorta and other structures, and the local feature analyzer, which uses voxel location, intensity, and texture features to train and drive a support vector machine classifier. Each analyzer outputs a value that corresponds to the likelihood that a given voxel is part of the aneurysm, which is used during level set iteration to control the evolution of the surface. We tested our system using a database of 20 CTA scans of patients with aortic aneurysms. The mean and worst case values of volume overlap, volume error, mean distance error, and maximum distance error relative to human tracing were 95.3%{+-}1.4% (s.d.); worst case=92.9%, 3.5%{+-}2.5% (s.d.); worst case=7.0%, 0.6{+-}0.2 mm (s.d.); worst case=1.0 mm, and 5.2{+-}2.3mm (s.d.); worstcase=9.6 mm, respectively. When implemented on a 2.8 GHz Pentium IV personal computer, the mean time required for segmentation was 7.4{+-}3.6min (s.d.). We also performed experiments that suggest that our method is insensitive to parameter changes within 10% of their experimentally determined values. This preliminary study proves feasibility for an accurate, precise, and robust system for segmentation of the abdominal aneurysm from CTA data, and may be of benefit to patients with aortic aneurysms.

  12. An abdominal aortic aneurysm segmentation method: level set with region and statistical information.

    PubMed

    Zhuge, Feng; Rubin, Geoffrey D; Sun, Shaohua; Napel, Sandy

    2006-05-01

    We present a system for segmenting the human aortic aneurysm in CT angiograms (CTA), which, in turn, allows measurements of volume and morphological aspects useful for treatment planning. The system estimates a rough "initial surface," and then refines it using a level set segmentation scheme augmented with two external analyzers: The global region analyzer, which incorporates a priori knowledge of the intensity, volume, and shape of the aorta and other structures, and the local feature analyzer, which uses voxel location, intensity, and texture features to train and drive a support vector machine classifier. Each analyzer outputs a value that corresponds to the likelihood that a given voxel is part of the aneurysm, which is used during level set iteration to control the evolution of the surface. We tested our system using a database of 20 CTA scans of patients with aortic aneurysms. The mean and worst case values of volume overlap, volume error, mean distance error, and maximum distance error relative to human tracing were 95.3% +/- 1.4% (s.d.); worst case = 92.9%, 3.5% +/- 2.5% (s.d.); worst case = 7.0%, 0.6 +/- 0.2 mm (s.d.); worst case = 1.0 mm, and 5.2 +/- 2.3 mm (s.d.); worst case = 9.6 mm, respectively. When implemented on a 2.8 GHz Pentium IV personal computer, the mean time required for segmentation was 7.4 +/- 3.6 min (s.d.). We also performed experiments that suggest that our method is insensitive to parameter changes within 10% of their experimentally determined values. This preliminary study proves feasibility for an accurate, precise, and robust system for segmentation of the abdominal aneurysm from CTA data, and may be of benefit to patients with aortic aneurysms.

  13. Identification and characterization of the SET domain gene family in maize.

    PubMed

    Qian, Yexiong; Xi, Yilong; Cheng, Beijiu; Zhu, Suwen; Kan, Xianzhao

    2014-03-01

    Histone lysine methylation plays a pivotal role in a variety of developmental and physiological processes through modifying chromatin structure and thereby regulating eukaryotic gene transcription. The SET domain proteins represent putative candidates for lysine methyltransferases containing the evolutionarily-conserved SET domain, and important epigenetic regulators present in eukaryotes. In recent years, increasing evidence reveals that SET domain proteins are encoded by a large multigene family in plants and investigation of the SET domain gene family will serve to elucidate the epigenetic mechanism diversity in plants. Although the SET domain gene family has been thoroughly characterized in multiple plant species including two model plant systems, Arabidopsis and rice, through their sequenced genomes, analysis of the entire SET domain gene family in maize was not completed following maize (B73) genome sequencing project. Here, we performed a genome-wide structural and evolutionary analysis of maize SET domain genes from the latest version of the maize (B73) genome. A complete set of 43 SET domain genes (Zmset1-43) were identified in the maize genome using Blast search tools and categorized into seven classes (Class I-VII) based on phylogeny. Chromosomal location of these genes revealed that they are unevenly distributed on all ten chromosomes with seven segmental duplication events, suggesting that segmental duplication played a key role in expansion of the maize SET domain gene family. EST expression data mining revealed that these newly identified genes had temporal and spatial expression pattern and suggested that many maize SET domain genes play functional developmental roles in multiple tissues. Furthermore, the transcripts of the 18 genes (the Class V subfamily) were detected in the leaves by two different abiotic stress treatments using semi-quantitative RT-PCR. The data demonstrated that these genes exhibited different expression levels in stress

  14. Comprehensive analysis of SET domain gene family in foxtail millet identifies the putative role of SiSET14 in abiotic stress tolerance

    PubMed Central

    Yadav, Chandra Bhan; Muthamilarasan, Mehanathan; Dangi, Anand; Shweta, Shweta; Prasad, Manoj

    2016-01-01

    SET domain-containing genes catalyse histone lysine methylation, which alters chromatin structure and regulates the transcription of genes that are involved in various developmental and physiological processes. The present study identified 53 SET domain-containing genes in C4 panicoid model, foxtail millet (Setaria italica) and the genes were physically mapped onto nine chromosomes. Phylogenetic and structural analyses classified SiSET proteins into five classes (I–V). RNA-seq derived expression profiling showed that SiSET genes were differentially expressed in four tissues namely, leaf, root, stem and spica. Expression analyses using qRT-PCR was performed for 21 SiSET genes under different abiotic stress and hormonal treatments, which showed differential expression of these genes during late phase of stress and hormonal treatments. Significant upregulation of SiSET gene was observed during cold stress, which has been confirmed by over-expressing a candidate gene, SiSET14 in yeast. Interestingly, hypermethylation was observed in gene body of highly differentially expressed genes, whereas methylation event was completely absent in their transcription start sites. This suggested the occurrence of demethylation events during various abiotic stresses, which enhance the gene expression. Altogether, the present study would serve as a base for further functional characterization of SiSET genes towards understanding their molecular roles in conferring stress tolerance. PMID:27585852

  15. Comprehensive analysis of SET domain gene family in foxtail millet identifies the putative role of SiSET14 in abiotic stress tolerance.

    PubMed

    Yadav, Chandra Bhan; Muthamilarasan, Mehanathan; Dangi, Anand; Shweta, Shweta; Prasad, Manoj

    2016-09-02

    SET domain-containing genes catalyse histone lysine methylation, which alters chromatin structure and regulates the transcription of genes that are involved in various developmental and physiological processes. The present study identified 53 SET domain-containing genes in C4 panicoid model, foxtail millet (Setaria italica) and the genes were physically mapped onto nine chromosomes. Phylogenetic and structural analyses classified SiSET proteins into five classes (I-V). RNA-seq derived expression profiling showed that SiSET genes were differentially expressed in four tissues namely, leaf, root, stem and spica. Expression analyses using qRT-PCR was performed for 21 SiSET genes under different abiotic stress and hormonal treatments, which showed differential expression of these genes during late phase of stress and hormonal treatments. Significant upregulation of SiSET gene was observed during cold stress, which has been confirmed by over-expressing a candidate gene, SiSET14 in yeast. Interestingly, hypermethylation was observed in gene body of highly differentially expressed genes, whereas methylation event was completely absent in their transcription start sites. This suggested the occurrence of demethylation events during various abiotic stresses, which enhance the gene expression. Altogether, the present study would serve as a base for further functional characterization of SiSET genes towards understanding their molecular roles in conferring stress tolerance.

  16. Integrating gene set analysis and nonlinear predictive modeling of disease phenotypes using a Bayesian multitask formulation.

    PubMed

    Gönen, Mehmet

    2016-12-13

    Identifying molecular signatures of disease phenotypes is studied using two mainstream approaches: (i) Predictive modeling methods such as linear classification and regression algorithms are used to find signatures predictive of phenotypes from genomic data, which may not be robust due to limited sample size or highly correlated nature of genomic data. (ii) Gene set analysis methods are used to find gene sets on which phenotypes are linearly dependent by bringing prior biological knowledge into the analysis, which may not capture more complex nonlinear dependencies. Thus, formulating an integrated model of gene set analysis and nonlinear predictive modeling is of great practical importance. In this study, we propose a Bayesian binary classification framework to integrate gene set analysis and nonlinear predictive modeling. We then generalize this formulation to multitask learning setting to model multiple related datasets conjointly. Our main novelty is the probabilistic nonlinear formulation that enables us to robustly capture nonlinear dependencies between genomic data and phenotype even with small sample sizes. We demonstrate the performance of our algorithms using repeated random subsampling validation experiments on two cancer and two tuberculosis datasets by predicting important disease phenotypes from genome-wide gene expression data. We are able to obtain comparable or even better predictive performance than a baseline Bayesian nonlinear algorithm and to identify sparse sets of relevant genes and gene sets on all datasets. We also show that our multitask learning formulation enables us to further improve the generalization performance and to better understand biological processes behind disease phenotypes.

  17. ChIP-Enrich: gene set enrichment testing for ChIP-seq data

    PubMed Central

    Welch, Ryan P.; Lee, Chee; Imbriano, Paul M.; Patil, Snehal; Weymouth, Terry E.; Smith, R. Alex; Scott, Laura J.; Sartor, Maureen A.

    2014-01-01

    Gene set enrichment testing can enhance the biological interpretation of ChIP-seq data. Here, we develop a method, ChIP-Enrich, for this analysis which empirically adjusts for gene locus length (the length of the gene body and its surrounding non-coding sequence). Adjustment for gene locus length is necessary because it is often positively associated with the presence of one or more peaks and because many biologically defined gene sets have an excess of genes with longer or shorter gene locus lengths. Unlike alternative methods, ChIP-Enrich can account for the wide range of gene locus length-to-peak presence relationships (observed in ENCODE ChIP-seq data sets). We show that ChIP-Enrich has a well-calibrated type I error rate using permuted ENCODE ChIP-seq data sets; in contrast, two commonly used gene set enrichment methods, Fisher's exact test and the binomial test implemented in Genomic Regions Enrichment of Annotations Tool (GREAT), can have highly inflated type I error rates and biases in ranking. We identify DNA-binding proteins, including CTCF, JunD and glucocorticoid receptor α (GRα), that show different enrichment patterns for peaks closer to versus further from transcription start sites. We also identify known and potential new biological functions of GRα. ChIP-Enrich is available as a web interface (http://chip-enrich.med.umich.edu) and Bioconductor package. PMID:24878920

  18. Modelling gene and protein regulatory networks with answer set programming.

    PubMed

    Fayruzov, Timur; Janssen, Jeroen; Vermeir, Dirk; Cornelis, Chris; De Cock, Martine

    2011-01-01

    Recently, many approaches to model regulatory networks have been proposed in the systems biology domain. However, the task is far from being solved. In this paper, we propose an Answer Set Programming (ASP)-based approach to model interaction networks. We build a general ASP framework that describes the network semantics and allows modelling specific networks with little effort. ASP provides a rich and flexible toolbox that allows expanding the framework with desired features. In this paper, we tune our framework to mimic Boolean network behaviour and apply it to model the Budding Yeast and Fission Yeast cell cycle networks. The obtained steady states of these networks correspond to those of the Boolean networks.

  19. Gene-Based Analysis of Regionally Enriched Cortical Genes in GWAS Data Sets of Cognitive Traits and Psychiatric Disorders

    PubMed Central

    Ersland, Kari M.; Christoforou, Andrea; Stansberg, Christine; Espeseth, Thomas; Mattheisen, Manuel; Mattingsdal, Morten; Hardarson, Gudmundur A.; Hansen, Thomas; Fernandes, Carla P. D.; Giddaluru, Sudheer; Breuer, René; Strohmaier, Jana; Djurovic, Srdjan; Nöthen, Markus M.; Rietschel, Marcella; Lundervold, Astri J.; Werge, Thomas; Cichon, Sven; Andreassen, Ole A.; Reinvang, Ivar; Steen, Vidar M.; Le Hellard, Stephanie

    2012-01-01

    Background Despite its estimated high heritability, the genetic architecture leading to differences in cognitive performance remains poorly understood. Different cortical regions play important roles in normal cognitive functioning and impairment. Recently, we reported on sets of regionally enriched genes in three different cortical areas (frontomedial, temporal and occipital cortices) of the adult rat brain. It has been suggested that genes preferentially, or specifically, expressed in one region or organ reflect functional specialisation. Employing a gene-based approach to the analysis, we used the regionally enriched cortical genes to mine a genome-wide association study (GWAS) of the Norwegian Cognitive NeuroGenetics (NCNG) sample of healthy adults for association to nine psychometric tests measures. In addition, we explored GWAS data sets for the serious psychiatric disorders schizophrenia (SCZ) (n = 3 samples) and bipolar affective disorder (BP) (n = 3 samples), to which cognitive impairment is linked. Principal Findings At the single gene level, the temporal cortex enriched gene RAR-related orphan receptor B (RORB) showed the strongest overall association, namely to a test of verbal intelligence (Vocabulary, P = 7.7E-04). We also applied gene set enrichment analysis (GSEA) to test the candidate genes, as gene sets, for enrichment of association signal in the NCNG GWAS and in GWASs of BP and of SCZ. We found that genes differentially expressed in the temporal cortex showed a significant enrichment of association signal in a test measure of non-verbal intelligence (Reasoning) in the NCNG sample. Conclusion Our gene-based approach suggests that RORB could be involved in verbal intelligence differences, while the genes enriched in the temporal cortex might be important to intellectual functions as measured by a test of reasoning in the healthy population. These findings warrant further replication in independent samples on cognitive traits. PMID

  20. An attempt for combining microarray data sets by adjusting gene expressions.

    PubMed

    Kim, Ki-Yeol; Kim, Se Hyun; Ki, Dong Hyuk; Jeong, Jaeheon; Jeong, Ha Jin; Jeung, Hei-Cheul; Chung, Hyun Cheol; Rha, Sun Young

    2007-06-01

    The diverse experimental environments in microarray technology, such as the different platforms or different RNA sources, can cause biases in the analysis of multiple microarrays. These systematic effects present a substantial obstacle for the analysis of microarray data, and the resulting information may be inconsistent and unreliable. Therefore, we introduced a simple integration method for combining microarray data sets that are derived from different experimental conditions, and we expected that more reliable information can be detected from the combined data set rather than from the separated data sets. This method is based on the distributions of the gene expression ratios among the different microarray data sets and it transforms, gene by gene, the gene expression ratios into the form of the reference data set. The efficiency of the proposed integration method was evaluated using two microarray data sets, which were derived from different RNA sources, and a newly defined measure, the mixture score. The proposed integration method intermixed the two data sets that were obtained from different RNA sources, which in turn reduced the experimental bias between the two data sets, and the mixture score increased by 24.2%. A data set combined by the proposed method preserved the inter-group relationship of the separated data sets. The proposed method worked well in adjusting systematic biases, including the source effect. The ability to use an effectively integrated microarray data set yields more reliable results due to the larger sample size and this also decreases the chance of false negatives.

  1. Matrix formalism of excursion set theory: A new approach to statistics of dark matter halo counting

    NASA Astrophysics Data System (ADS)

    Nikakhtar, Farnik; Baghram, Shant

    2017-08-01

    Excursion set theory (EST) is an analytical framework to study the large-scale structure of the Universe. EST introduces a procedure to calculate the number density of structures by relating the cosmological linear perturbation theory to the nonlinear structures in late time. In this work, we introduce a novel approach to reformulate the EST in matrix formalism. We propose that the matrix representation of EST will facilitate the calculations in this framework. The method is to discretize the two-dimensional plane of variance and density contrast of EST, where the trajectories for each point in the Universe lived there. The probability of having a density contrast in a chosen variance is represented by a probability ket. Naturally, the concept of the transition matrix pops up to define the trajectories. We also define the probability transition rate which is used to obtain the first up-crossing of trajectories and the number count of the structures. In this work we show that the discretization let us study the non-Markov processes by forcing them to look like a Wiener process. Also we discuss that the zero drift processes with Gaussian and also non-Gaussian initial conditions can be studied by this formalism. The continuous limit of the formalism is discussed, and the known Fokker-Planck dispersion equation is recovered. Finally we show that the probability of the most massive progenitors can be extracted in this framework.

  2. Comparison of statistics for candidate-gene association studies using cases and parents

    SciTech Connect

    Schaid, D.J.; Sommer, S.S. )

    1994-08-01

    Studies of association between candidate genes and disease can be designed to use cases with disease, and in place of nonrelated controls, their parents. The advantage of this design is the elimination of spurious differences due to ethnic differences between cases and nonrelated controls. However, several statistical methods of analysis have been proposed in the literature, and the choice of analysis is not always clear. The authors review some of the statistical methods currently developed and present two new statistical methods aimed at specific genetic hypotheses of dominance and recessivity of the candidate gene. These new methods can be more powerful than other current methods, as demonstrated by simulations. The basis of these new statistical methods is a likelihood approach. The advantage of the likelihood framework is that regression models can be developed to assess genotype-environment interactions, as well as the relative contribution that alleles at the candidate-gene locus make to the relative risk (RR) of disease. This latter development allows testing of (1) whether interactions between alleles exist, on the scale of log RR, and (2) whether alleles originating from the mother or father of a case impart different risks, i.e., genomic imprinting. 13 refs., 2 figs., 2 tabs.

  3. An 80-gene set to predict response to preoperative chemoradiotherapy for rectal cancer by principle component analysis.

    PubMed

    Empuku, Shinichiro; Nakajima, Kentaro; Akagi, Tomonori; Kaneko, Kunihiko; Hijiya, Naoki; Etoh, Tsuyoshi; Shiraishi, Norio; Moriyama, Masatsugu; Inomata, Masafumi

    2016-05-01

    Preoperative chemoradiotherapy (CRT) for locally advanced rectal cancer not only improves the postoperative local control rate, but also induces downstaging. However, it has not been established how to individually select patients who receive effective preoperative CRT. The aim of this study was to identify a predictor of response to preoperative CRT for locally advanced rectal cancer. This study is additional to our multicenter phase II study evaluating the safety and efficacy of preoperative CRT using oral fluorouracil (UMIN ID: 03396). From April, 2009 to August, 2011, 26 biopsy specimens obtained prior to CRT were analyzed by cyclopedic microarray analysis. Response to CRT was evaluated according to a histological grading system using surgically resected specimens. To decide on the number of genes for dividing into responder and non-responder groups, we statistically analyzed the data using a dimension reduction method, a principle component analysis. Of the 26 cases, 11 were responders and 15 non-responders. No significant difference was found in clinical background data between the two groups. We determined that the optimal number of genes for the prediction of response was 80 of 40,000 and the functions of these genes were analyzed. When comparing non-responders with responders, genes expressed at a high level functioned in alternative splicing, whereas those expressed at a low level functioned in the septin complex. Thus, an 80-gene expression set that predicts response to preoperative CRT for locally advanced rectal cancer was identified using a novel statistical method.

  4. EVE (external variance estimation) increases statistical power for detecting differentially expressed genes.

    PubMed

    Wille, Anja; Gruissem, Wilhelm; Bühlmann, Peter; Hennig, Lars

    2007-11-01

    Accurately identifying differentially expressed genes from microarray data is not a trivial task, partly because of poor variance estimates of gene expression signals. Here, after analyzing 380 replicated microarray experiments, we found that probesets have typical, distinct variances that can be estimated based on a large number of microarray experiments. These probeset-specific variances depend at least in part on the function of the probed gene: genes for ribosomal or structural proteins often have a small variance, while genes implicated in stress responses often have large variances. We used these variance estimates to develop a statistical test for differentially expressed genes called EVE (external variance estimation). The EVE algorithm performs better than the t-test and LIMMA on some real-world data, where external information from appropriate databases is available. Thus, EVE helps to maximize the information gained from a typical microarray experiment. Nonetheless, only a large number of replicates will guarantee to identify nearly all truly differentially expressed genes. However, our simulation studies suggest that even limited numbers of replicates will usually result in good coverage of strongly differentially expressed genes.

  5. Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler

    PubMed Central

    2010-01-01

    Background The rate of emergence of human pathogens is steadily increasing; most of these novel agents originate in wildlife. Bats, remarkably, are the natural reservoirs of many of the most pathogenic viruses in humans. There are two bat genome projects currently underway, a circumstance that promises to speed the discovery host factors important in the coevolution of bats with their viruses. These genomes, however, are not yet assembled and one of them will provide only low coverage, making the inference of most genes of immunological interest error-prone. Many more wildlife genome projects are underway and intend to provide only shallow coverage. Results We have developed a statistical method for the assembly of gene families from partial genomes. The method takes full advantage of the quality scores generated by base-calling software, incorporating them into a complete probabilistic error model, to overcome the limitation inherent in the inference of gene family members from partial sequence information. We validated the method by inferring the human IFNA genes from the genome trace archives, and used it to infer 61 type-I interferon genes, and single type-II interferon genes in the bats Pteropus vampyrus and Myotis lucifugus. We confirmed our inferences by direct cloning and sequencing of IFNA, IFNB, IFND, and IFNK in P. vampyrus, and by demonstrating transcription of some of the inferred genes by known interferon-inducing stimuli. Conclusion The statistical trace assembler described here provides a reliable method for extracting information from the many available and forthcoming partial or shallow genome sequencing projects, thereby facilitating the study of a wider variety of organisms with ecological and biomedical significance to humans than would otherwise be possible. PMID:20663124

  6. Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler.

    PubMed

    Kepler, Thomas B; Sample, Christopher; Hudak, Kathryn; Roach, Jeffrey; Haines, Albert; Walsh, Allyson; Ramsburg, Elizabeth A

    2010-07-21

    The rate of emergence of human pathogens is steadily increasing; most of these novel agents originate in wildlife. Bats, remarkably, are the natural reservoirs of many of the most pathogenic viruses in humans. There are two bat genome projects currently underway, a circumstance that promises to speed the discovery host factors important in the coevolution of bats with their viruses. These genomes, however, are not yet assembled and one of them will provide only low coverage, making the inference of most genes of immunological interest error-prone. Many more wildlife genome projects are underway and intend to provide only shallow coverage. We have developed a statistical method for the assembly of gene families from partial genomes. The method takes full advantage of the quality scores generated by base-calling software, incorporating them into a complete probabilistic error model, to overcome the limitation inherent in the inference of gene family members from partial sequence information. We validated the method by inferring the human IFNA genes from the genome trace archives, and used it to infer 61 type-I interferon genes, and single type-II interferon genes in the bats Pteropus vampyrus and Myotis lucifugus. We confirmed our inferences by direct cloning and sequencing of IFNA, IFNB, IFND, and IFNK in P. vampyrus, and by demonstrating transcription of some of the inferred genes by known interferon-inducing stimuli. The statistical trace assembler described here provides a reliable method for extracting information from the many available and forthcoming partial or shallow genome sequencing projects, thereby facilitating the study of a wider variety of organisms with ecological and biomedical significance to humans than would otherwise be possible.

  7. Fold change rank ordering statistics: a new method for detecting differentially expressed genes

    PubMed Central

    2014-01-01

    Background Different methods have been proposed for analyzing differentially expressed (DE) genes in microarray data. Methods based on statistical tests that incorporate expression level variability are used more commonly than those based on fold change (FC). However, FC based results are more reproducible and biologically relevant. Results We propose a new method based on fold change rank ordering statistics (FCROS). We exploit the variation in calculated FC levels using combinatorial pairs of biological conditions in the datasets. A statistic is associated with the ranks of the FC values for each gene, and the resulting probability is used to identify the DE genes within an error level. The FCROS method is deterministic, requires a low computational runtime and also solves the problem of multiple tests which usually arises with microarray datasets. Conclusion We compared the performance of FCROS with those of other methods using synthetic and real microarray datasets. We found that FCROS is well suited for DE gene identification from noisy datasets when compared with existing FC based methods. PMID:24423217

  8. Application of biclustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials.

    PubMed

    Williams, Andrew; Halappanavar, Sabina

    2015-01-01

    The presence of diverse types of nanomaterials (NMs) in commerce is growing at an exponential pace. As a result, human exposure to these materials in the environment is inevitable, necessitating the need for rapid and reliable toxicity testing methods to accurately assess the potential hazards associated with NMs. In this study, we applied biclustering and gene set enrichment analysis methods to derive essential features of altered lung transcriptome following exposure to NMs that are associated with lung-specific diseases. Several datasets from public microarray repositories describing pulmonary diseases in mouse models following exposure to a variety of substances were examined and functionally related biclusters of genes showing similar expression profiles were identified. The identified biclusters were then used to conduct a gene set enrichment analysis on pulmonary gene expression profiles derived from mice exposed to nano-titanium dioxide (nano-TiO2), carbon black (CB) or carbon nanotubes (CNTs) to determine the disease significance of these data-driven gene sets. Biclusters representing inflammation (chemokine activity), DNA binding, cell cycle, apoptosis, reactive oxygen species (ROS) and fibrosis processes were identified. All of the NM studies were significant with respect to the bicluster related to chemokine activity (DAVID; FDR p-value = 0.032). The bicluster related to pulmonary fibrosis was enriched in studies where toxicity induced by CNT and CB studies was investigated, suggesting the potential for these materials to induce lung fibrosis. The pro-fibrogenic potential of CNTs is well established. Although CB has not been shown to induce fibrosis, it induces stronger inflammatory, oxidative stress and DNA damage responses than nano-TiO2 particles. The results of the analysis correctly identified all NMs to be inflammogenic and only CB and CNTs as potentially fibrogenic. In addition to identifying several previously defined, functionally relevant gene

  9. CAsubtype: An R Package to Identify Gene Sets Predictive of Cancer Subtypes and Clinical Outcomes.

    PubMed

    Kong, Hualei; Tong, Pan; Zhao, Xiaodong; Sun, Jielin; Li, Hua

    2017-01-21

    In the past decade, molecular classification of cancer has gained high popularity owing to its high predictive power on clinical outcomes as compared with traditional methods commonly used in clinical practice. In particular, using gene expression profiles, recent studies have successfully identified a number of gene sets for the delineation of cancer subtypes that are associated with distinct prognosis. However, identification of such gene sets remains a laborious task due to the lack of tools with flexibility, integration and ease of use. To reduce the burden, we have developed an R package, CAsubtype, to efficiently identify gene sets predictive of cancer subtypes and clinical outcomes. By integrating more than 13,000 annotated gene sets, CAsubtype provides a comprehensive repertoire of candidates for new cancer subtype identification. For easy data access, CAsubtype further includes the gene expression and clinical data of more than 2000 cancer patients from TCGA. CAsubtype first employs principal component analysis to identify gene sets (from user-provided or package-integrated ones) with robust principal components representing significantly large variation between cancer samples. Based on these principal components, CAsubtype visualizes the sample distribution in low-dimensional space for better understanding of the distinction between samples and classifies samples into subgroups with prevalent clustering algorithms. Finally, CAsubtype performs survival analysis to compare the clinical outcomes between the identified subgroups, assessing their clinical value as potentially novel cancer subtypes. In conclusion, CAsubtype is a flexible and well-integrated tool in the R environment to identify gene sets for cancer subtype identification and clinical outcome prediction. Its simple R commands and comprehensive data sets enable efficient examination of the clinical value of any given gene set, thus facilitating hypothesis generating and testing in biological and

  10. Review on statistical methods for gene network reconstruction using expression data.

    PubMed

    Wang, Y X Rachel; Huang, Haiyan

    2014-12-07

    Network modeling has proven to be a fundamental tool in analyzing the inner workings of a cell. It has revolutionized our understanding of biological processes and made significant contributions to the discovery of disease biomarkers. Much effort has been devoted to reconstruct various types of biochemical networks using functional genomic datasets generated by high-throughput technologies. This paper discusses statistical methods used to reconstruct gene regulatory networks using gene expression data. In particular, we highlight progress made and challenges yet to be met in the problems involved in estimating gene interactions, inferring causality and modeling temporal changes of regulation behaviors. As rapid advances in technologies have made available diverse, large-scale genomic data, we also survey methods of incorporating all these additional data to achieve better, more accurate inference of gene networks. Copyright © 2014 Elsevier Ltd. All rights reserved.

  11. Statistical analysis of nucleotide sequences of the hemagglutinin gene of human influenza A viruses.

    PubMed Central

    Ina, Y; Gojobori, T

    1994-01-01

    To examine whether positive selection operates on the hemagglutinin 1 (HA1) gene of human influenza A viruses (H1 subtype), 21 nucleotide sequences of the HA1 gene were statistically analyzed. The nucleotide sequences were divided into antigenic and nonantigenic sites. The nucleotide diversities for antigenic and nonantigenic sites of the HA1 gene were computed at synonymous and nonsynonymous sites separately. For nonantigenic sites, the nucleotide diversities were larger at synonymous sites than at nonsynonymous sites. This is consistent with the neutral theory of molecular evolution. For antigenic sites, however, the nucleotide diversities at nonsynonymous sites were larger than those at synonymous sites. These results suggest that positive selection operates on antigenic sites of the HA1 gene of human influenza A viruses (H1 subtype). PMID:8078892

  12. Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks

    PubMed Central

    Blatti, Charles; Sinha, Saurabh

    2016-01-01

    Motivation: Analysis of co-expressed gene sets typically involves testing for enrichment of different annotations or ‘properties’ such as biological processes, pathways, transcription factor binding sites, etc., one property at a time. This common approach ignores any known relationships among the properties or the genes themselves. It is believed that known biological relationships among genes and their many properties may be exploited to more accurately reveal commonalities of a gene set. Previous work has sought to achieve this by building biological networks that combine multiple types of gene–gene or gene–property relationships, and performing network analysis to identify other genes and properties most relevant to a given gene set. Most existing network-based approaches for recognizing genes or annotations relevant to a given gene set collapse information about different properties to simplify (homogenize) the networks. Results: We present a network-based method for ranking genes or properties related to a given gene set. Such related genes or properties are identified from among the nodes of a large, heterogeneous network of biological information. Our method involves a random walk with restarts, performed on an initial network with multiple node and edge types that preserve more of the original, specific property information than current methods that operate on homogeneous networks. In this first stage of our algorithm, we find the properties that are the most relevant to the given gene set and extract a subnetwork of the original network, comprising only these relevant properties. We then re-rank genes by their similarity to the given gene set, based on a second random walk with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algorithm for ranking genes related to Drosophila embryonic development and aggressive responses in the brains of social animals. Availability and Implementation: DRaWR was implemented as

  13. Prioritization of candidate disease genes by enlarging the seed set and fusing information of the network topology and gene expression.

    PubMed

    Zhang, Shao-Wu; Shao, Dong-Dong; Zhang, Song-Yao; Wang, Yi-Bin

    2014-06-01

    The identification of disease genes is very important not only to provide greater understanding of gene function and cellular mechanisms which drive human disease, but also to enhance human disease diagnosis and treatment. Recently, high-throughput techniques have been applied to detect dozens or even hundreds of candidate genes. However, experimental approaches to validate the many candidates are usually time-consuming, tedious and expensive, and sometimes lack reproducibility. Therefore, numerous theoretical and computational methods (e.g. network-based approaches) have been developed to prioritize candidate disease genes. Many network-based approaches implicitly utilize the observation that genes causing the same or similar diseases tend to correlate with each other in gene-protein relationship networks. Of these network approaches, the random walk with restart algorithm (RWR) is considered to be a state-of-the-art approach. To further improve the performance of RWR, we propose a novel method named ESFSC to identify disease-related genes, by enlarging the seed set according to the centrality of disease genes in a network and fusing information of the protein-protein interaction (PPI) network topological similarity and the gene expression correlation. The ESFSC algorithm restarts at all of the nodes in the seed set consisting of the known disease genes and their k-nearest neighbor nodes, then walks in the global network separately guided by the similarity transition matrix constructed with PPI network topological similarity properties and the correlational transition matrix constructed with the gene expression profiles. As a result, all the genes in the network are ranked by weighted fusing the above results of the RWR guided by two types of transition matrices. Comprehensive simulation results of the 10 diseases with 97 known disease genes collected from the Online Mendelian Inheritance in Man (OMIM) database show that ESFSC outperforms existing methods for

  14. GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data.

    PubMed

    Ben-Ari Fuchs, Shani; Lieder, Iris; Stelzer, Gil; Mazor, Yaron; Buzhor, Ella; Kaplan, Sergey; Bogoch, Yoel; Plaschkes, Inbar; Shitrit, Alina; Rappaport, Noa; Kohn, Asher; Edgar, Ron; Shenhav, Liraz; Safran, Marilyn; Lancet, Doron; Guan-Golan, Yaron; Warshawsky, David; Shtrichman, Ronit

    2016-03-01

    Postgenomics data are produced in large volumes by life sciences and clinical applications of novel omics diagnostics and therapeutics for precision medicine. To move from "data-to-knowledge-to-innovation," a crucial missing step in the current era is, however, our limited understanding of biological and clinical contexts associated with data. Prominent among the emerging remedies to this challenge are the gene set enrichment tools. This study reports on GeneAnalytics™ ( geneanalytics.genecards.org ), a comprehensive and easy-to-apply gene set analysis tool for rapid contextualization of expression patterns and functional signatures embedded in the postgenomics Big Data domains, such as Next Generation Sequencing (NGS), RNAseq, and microarray experiments. GeneAnalytics' differentiating features include in-depth evidence-based scoring algorithms, an intuitive user interface and proprietary unified data. GeneAnalytics employs the LifeMap Science's GeneCards suite, including the GeneCards®--the human gene database; the MalaCards-the human diseases database; and the PathCards--the biological pathways database. Expression-based analysis in GeneAnalytics relies on the LifeMap Discovery®--the embryonic development and stem cells database, which includes manually curated expression data for normal and diseased tissues, enabling advanced matching algorithm for gene-tissue association. This assists in evaluating differentiation protocols and discovering biomarkers for tissues and cells. Results are directly linked to gene, disease, or cell "cards" in the GeneCards suite. Future developments aim to enhance the GeneAnalytics algorithm as well as visualizations, employing varied graphical display items. Such attributes make GeneAnalytics a broadly applicable postgenomics data analyses and interpretation tool for translation of data to knowledge-based innovation in various Big Data fields such as precision medicine, ecogenomics, nutrigenomics, pharmacogenomics, vaccinomics

  15. Importance of collection in gene set enrichment analysis of drug response in cancer cell lines

    PubMed Central

    Bateman, Alain R.; El-Hachem, Nehme; Beck, Andrew H.; Aerts, Hugo J. W. L.; Haibe-Kains, Benjamin

    2014-01-01

    Gene set enrichment analysis (GSEA) associates gene sets and phenotypes, its use is predicated on the choice of a pre-defined collection of sets. The defacto standard implementation of GSEA provides seven collections yet there are no guidelines for the choice of collections and the impact of such choice, if any, is unknown. Here we compare each of the standard gene set collections in the context of a large dataset of drug response in human cancer cell lines. We define and test a new collection based on gene co-expression in cancer cell lines to compare the performance of the standard collections to an externally derived cell line based collection. The results show that GSEA findings vary significantly depending on the collection chosen for analysis. Henceforth, collections should be carefully selected and reported in studies that leverage GSEA. PMID:24522610

  16. Detecting Gene-Gene Interactions Associated with Multiple Complex Traits with U-Statistics.

    PubMed

    Li, Ming; Wei, Changshuai; Wen, Yalu; Wang, Tong; Lu, Qing

    2016-10-01

    Many complex diseases, such as psychiatric and behavioral disorders, are commonly characterized through various measurements that reflect physical, behavioral and psychological aspects of diseases. While it remains a great challenge to find a unified measurement to characterize a disease, the available multiple phenotypes can be analyzed jointly in the genetic association study. Simultaneously testing these phenotypes has many advantages, including considering different aspects of the disease in the analysis, and utilizing correlated phenotypes to improve the power of detecting disease-associated variants. Furthermore, complex diseases are likely caused by the interplay of multiple genetic variants through complicated mechanisms. Considering gene-gene interactions in the joint association analysis of complex diseases could further increase our ability to discover genetic variants involving complex disease pathways. In this article, we propose a stepwise U-test for joint association analysis of multiple loci and multiple phenotypes. Through simulations, we demonstrated that testing multiple phenotypes simultaneously could attain higher power than testing one single phenotype at a time, especially when there are shared genes contributing to multiple phenotypes. We also illustrated the proposed method with an application to Nicotine Dependence (ND), using datasets from the Study of Addition, Genetics and Environment (SAGE). The joint analysis of three ND phenotypes identified two SNPs, rs10508649 and rs2491397, and reached a nominal P-value of 3.79e-13. The association was further replicated in two independent datasets with P-values of 2.37e-05 and 7.46e-05.

  17. A Survey of Statistical Models for Reverse Engineering Gene Regulatory Networks

    PubMed Central

    Huang, Yufei; Tienda-Luna, Isabel M.; Wang, Yufeng

    2009-01-01

    Statistical models for reverse engineering gene regulatory networks are surveyed in this article. To provide readers with a system-level view of the modeling issues in this research, a graphical modeling framework is proposed. This framework serves as the scaffolding on which the review of different models can be systematically assembled. Based on the framework, we review many existing models for many aspects of gene regulation; the pros and cons of each model are discussed. In addition, network inference algorithms are also surveyed under the graphical modeling framework by the categories of point solutions and probabilistic solutions and the connections and differences among the algorithms are provided. This survey has the potential to elucidate the development and future of reverse engineering GRNs and bring statistical signal processing closer to the core of this research. PMID:20046885

  18. Inferring pathway crosstalk networks using gene set co-expression signatures.

    PubMed

    Wang, Ting; Gu, Jin; Yuan, Jun; Tao, Ran; Li, Yanda; Li, Shao

    2013-07-01

    Constructing molecular interaction networks in cells is important for understanding the underlying mechanisms of biological processes. Except for single gene analysis, several gene set-based methods have been proposed to infer pathway crosstalk by analyzing large-scale gene expression data. But most of them take all pathway genes as a whole to infer the crosstalk. Biological evidence suggests that the pathway crosstalk usually occurs between some subsets rather than the whole sets of pathway genes. In this study, we propose a novel method, sGSCA (signature-based gene set co-expression analysis) which can use the co-expression correlations between subsets of pathway genes to infer the pathway crosstalk networks. The method applies sparse canonical correlation analysis (sCCA) to measure the pathway level co-expression and simultaneously obtain the subsets or signature genes that contribute to the co-expression of pathways. On simulated datasets, sGSCA can efficiently detect pathway crosstalk and the corresponding highly correlated signature genes. We applied sGSCA to two cancer gene expression datasets (one for hepatocellular cancer and the other for lung cancer). In the inferred networks, we found several important pathway crosstalks related to the cancers. The identified signature genes also show high enrichment for the cancer related genes. sGSCA can infer pathway crosstalk networks using large-scale gene expression data, and should be a useful tool for systematically studying the molecular mechanisms of complex diseases on both pathway and gene levels at the same time.

  19. PlantGSEA: a gene set enrichment analysis toolkit for plant community.

    PubMed

    Yi, Xin; Du, Zhou; Su, Zhen

    2013-07-01

    Gene Set Enrichment Analysis (GSEA) is a powerful method for interpreting biological meaning of a list of genes by computing the overlaps with various previously defined gene sets. As one of the most widely used annotations for defining gene sets, Gene Ontology (GO) system has been used in many enrichment analysis tools. EasyGO and agriGO, two GO enrichment analysis toolkits developed by our laboratory, have gained extensive usage and citations since their releases because of their effective performance and consistent maintenance. Responding to the increasing demands of more comprehensive analysis from the users, we developed a web server as an important component of our bioinformatics analysis toolkit, named PlantGSEA, which is based on GSEA method and mainly focuses on plant organisms. In PlantGSEA, 20 290 defined gene sets deriving from different resources were collected and used for GSEA analysis. The PlantGSEA currently supports gene locus IDs and Affymatrix microarray probe set IDs from four plant model species (Arabidopsis thaliana, Oryza sativa, Zea mays and Gossypium raimondii). The PlantGSEA is an efficient and user-friendly web server, and now it is publicly accessible at http://structuralbiology.cau.edu.cn/PlantGSEA.

  20. Gene integrated set profile analysis: a context-based approach for inferring biological endpoints.

    PubMed

    Kowalski, Jeanne; Dwivedi, Bhakti; Newman, Scott; Switchenko, Jeffery M; Pauly, Rini; Gutman, David A; Arora, Jyoti; Gandhi, Khanjan; Ainslie, Kylie; Doho, Gregory; Qin, Zhaohui; Moreno, Carlos S; Rossi, Michael R; Vertino, Paula M; Lonial, Sagar; Bernal-Mizrachi, Leon; Boise, Lawrence H

    2016-04-20

    The identification of genes with specific patterns of change (e.g. down-regulated and methylated) as phenotype drivers or samples with similar profiles for a given gene set as drivers of clinical outcome, requires the integration of several genomic data types for which an 'integrate by intersection' (IBI) approach is often applied. In this approach, results from separate analyses of each data type are intersected, which has the limitation of a smaller intersection with more data types. We introduce a new method, GISPA (Gene Integrated Set Profile Analysis) for integrated genomic analysis and its variation, SISPA (Sample Integrated Set Profile Analysis) for defining respective genes and samples with the context of similar, a priori specified molecular profiles. With GISPA, the user defines a molecular profile that is compared among several classes and obtains ranked gene sets that satisfy the profile as drivers of each class. With SISPA, the user defines a gene set that satisfies a profile and obtains sample groups of profile activity. Our results from applying GISPA to human multiple myeloma (MM) cell lines contained genes of known profiles and importance, along with several novel targets, and their further SISPA application to MM coMMpass trial data showed clinical relevance. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  1. Gene expression profiling of peripheral blood mononuclear cells in the setting of peripheral arterial disease

    PubMed Central

    2012-01-01

    Background Peripheral arterial disease (PAD) is a relatively common manifestation of systemic atherosclerosis that leads to progressive narrowing of the lumen of leg arteries. Circulating monocytes are in contact with the arterial wall and can serve as reporters of vascular pathology in the setting of PAD. We performed gene expression analysis of peripheral blood mononuclear cells (PBMC) in patients with PAD and controls without PAD to identify differentially regulated genes. Methods PAD was defined as an ankle brachial index (ABI) ≤0.9 (n = 19) while age and gender matched controls had an ABI > 1.0 (n = 18). Microarray analysis was performed using Affymetrix HG-U133 plus 2.0 gene chips and analyzed using GeneSpring GX 11.0. Gene expression data was normalized using Robust Multichip Analysis (RMA) normalization method, differential expression was defined as a fold change ≥1.5, followed by unpaired Mann-Whitney test (P < 0.05) and correction for multiple testing by Benjamini and Hochberg False Discovery Rate. Meta-analysis of differentially expressed genes was performed using an integrated bioinformatics pipeline with tools for enrichment analysis using Gene Ontology (GO) terms, pathway analysis using Kyoto Encyclopedia of Genes and Genomes (KEGG), molecular event enrichment using Reactome annotations and network analysis using Ingenuity Pathway Analysis suite. Extensive biocuration was also performed to understand the functional context of genes. Results We identified 87 genes differentially expressed in the setting of PAD; 40 genes were upregulated and 47 genes were downregulated. We employed an integrated bioinformatics pipeline coupled with literature curation to characterize the functional coherence of differentially regulated genes. Conclusion Notably, upregulated genes mediate immune response, inflammation, apoptosis, stress response, phosphorylation, hemostasis, platelet activation and platelet aggregation. Downregulated genes included several genes from

  2. GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data

    PubMed Central

    Ben-Ari Fuchs, Shani; Lieder, Iris; Mazor, Yaron; Buzhor, Ella; Kaplan, Sergey; Bogoch, Yoel; Plaschkes, Inbar; Shitrit, Alina; Rappaport, Noa; Kohn, Asher; Edgar, Ron; Shenhav, Liraz; Safran, Marilyn; Lancet, Doron; Guan-Golan, Yaron; Warshawsky, David; Shtrichman, Ronit

    2016-01-01

    Abstract Postgenomics data are produced in large volumes by life sciences and clinical applications of novel omics diagnostics and therapeutics for precision medicine. To move from “data-to-knowledge-to-innovation,” a crucial missing step in the current era is, however, our limited understanding of biological and clinical contexts associated with data. Prominent among the emerging remedies to this challenge are the gene set enrichment tools. This study reports on GeneAnalytics™ (geneanalytics.genecards.org), a comprehensive and easy-to-apply gene set analysis tool for rapid contextualization of expression patterns and functional signatures embedded in the postgenomics Big Data domains, such as Next Generation Sequencing (NGS), RNAseq, and microarray experiments. GeneAnalytics' differentiating features include in-depth evidence-based scoring algorithms, an intuitive user interface and proprietary unified data. GeneAnalytics employs the LifeMap Science's GeneCards suite, including the GeneCards®—the human gene database; the MalaCards—the human diseases database; and the PathCards—the biological pathways database. Expression-based analysis in GeneAnalytics relies on the LifeMap Discovery®—the embryonic development and stem cells database, which includes manually curated expression data for normal and diseased tissues, enabling advanced matching algorithm for gene–tissue association. This assists in evaluating differentiation protocols and discovering biomarkers for tissues and cells. Results are directly linked to gene, disease, or cell “cards” in the GeneCards suite. Future developments aim to enhance the GeneAnalytics algorithm as well as visualizations, employing varied graphical display items. Such attributes make GeneAnalytics a broadly applicable postgenomics data analyses and interpretation tool for translation of data to knowledge-based innovation in various Big Data fields such as precision medicine, ecogenomics, nutrigenomics

  3. Identifying stably expressed genes from multiple RNA-Seq data sets

    PubMed Central

    Emerson, Sarah; Chang, Jeff H.; Di, Yanming

    2016-01-01

    We examined RNA-Seq data on 211 biological samples from 24 different Arabidopsis experiments carried out by different labs. We grouped the samples according to tissue types, and in each of the groups, we identified genes that are stably expressed across biological samples, treatment conditions, and experiments. We fit a Poisson log-linear mixed-effect model to the read counts for each gene and decomposed the total variance into between-sample, between-treatment and between-experiment variance components. Identifying stably expressed genes is useful for count normalization and differential expression analysis. The variance component analysis that we explore here is a first step towards understanding the sources and nature of the RNA-Seq count variation. When using a numerical measure to identify stably expressed genes, the outcome depends on multiple factors: the background sample set and the reference gene set used for count normalization, the technology used for measuring gene expression, and the specific numerical stability measure used. Since differential expression (DE) is measured by relative frequencies, we argue that DE is a relative concept. We advocate using an explicit reference gene set for count normalization to improve interpretability of DE results, and recommend using a common reference gene set when analyzing multiple RNA-Seq experiments to avoid potential inconsistent conclusions. PMID:28028467

  4. Expansion and diversification of the SET domain gene family following whole-genome duplications in Populus trichocarpa.

    PubMed

    Lei, Li; Zhou, Shi-Liang; Ma, Hong; Zhang, Liang-Sheng

    2012-04-12

    Histone lysine methylation modifies chromatin structure and regulates eukaryotic gene transcription and a variety of developmental and physiological processes. SET domain proteins are lysine methyltransferases containing the evolutionarily-conserved SET domain, which is known to be the catalytic domain. We identified 59 SET genes in the Populus genome. Phylogenetic analyses of 106 SET genes from Populus and Arabidopsis supported the clustering of SET genes into six distinct subfamilies and identified 19 duplicated gene pairs in Populus. The chromosome locations of these gene pairs and the distribution of synonymous substitution rates showed that the expansion of the SET gene family might be caused by large-scale duplications in Populus. Comparison of gene structures and domain architectures of each duplicate pair indicated that divergence took place at the 3'- and 5'-terminal transcribed regions and at the N- and C-termini of the predicted proteins, respectively. Expression profile analysis of Populus SET genes suggested that most Populus SET genes were expressed widely, many with the highest expression in young leaves. In particular, the expression profiles of 12 of the 19 duplicated gene pairs fell into two types of expression patterns. The 19 duplicated SET genes could have originated from whole genome duplication events. The differences in SET gene structure, domain architecture, and expression profiles in various tissues of Populus suggest that members of the SET gene family have a variety of developmental and physiological functions. Our study provides clues about the evolution of epigenetic regulation of chromatin structure and gene expression.

  5. Consistently altered expression of gene sets in postmortem brains of individuals with major psychiatric disorders

    PubMed Central

    Darby, M M; Yolken, R H; Sabunciyan, S

    2016-01-01

    The measurement of gene expression in postmortem brain is an important tool for understanding the pathogenesis of serious psychiatric disorders. We hypothesized that major molecular deficits associated with psychiatric disease would affect the entire brain, and such deficits may be shared across disorders. We performed RNA sequencing and quantified gene expression in the hippocampus of 100 brains in the Stanley Array Collection followed by replication in the orbitofrontal cortex of 57 brains in the Stanley Neuropathology Consortium. We then identified genes and canonical pathway gene sets with significantly altered expression in schizophrenia and bipolar disorder in the hippocampus and in schizophrenia, bipolar disorder and major depression in the orbitofrontal cortex. Although expression of individual genes varied, gene sets were significantly enriched in both of the brain regions, and many of these were consistent across diagnostic groups. Further examination of core gene sets with consistently increased or decreased expression in both of the brain regions and across target disorders revealed that ribosomal genes are overexpressed while genes involved in neuronal processes, GABAergic signaling, endocytosis and antigen processing have predominantly decreased expression in affected individuals compared to controls without a psychiatric disorder. Our results highlight pathways of central importance to psychiatric health and emphasize messenger RNA processing and protein synthesis as potential therapeutic targets for all three of the disorders. PMID:27622934

  6. Confero: an integrated contrast data and gene set platform for computational analysis and biological interpretation of omics data

    PubMed Central

    2013-01-01

    Background High-throughput omics technologies such as microarrays and next-generation sequencing (NGS) have become indispensable tools in biological research. Computational analysis and biological interpretation of omics data can pose significant challenges due to a number of factors, in particular the systems integration required to fully exploit and compare data from different studies and/or technology platforms. In transcriptomics, the identification of differentially expressed genes when studying effect(s) or contrast(s) of interest constitutes the starting point for further downstream computational analysis (e.g. gene over-representation/enrichment analysis, reverse engineering) leading to mechanistic insights. Therefore, it is important to systematically store the full list of genes with their associated statistical analysis results (differential expression, t-statistics, p-value) corresponding to one or more effect(s) or contrast(s) of interest (shortly termed as ” contrast data”) in a comparable manner and extract gene sets in order to efficiently support downstream analyses and further leverage data on a long-term basis. Filling this gap would open new research perspectives for biologists to discover disease-related biomarkers and to support the understanding of molecular mechanisms underlying specific biological perturbation effects (e.g. disease, genetic, environmental, etc.). Results To address these challenges, we developed Confero, a contrast data and gene set platform for downstream analysis and biological interpretation of omics data. The Confero software platform provides storage of contrast data in a simple and standard format, data transformation to enable cross-study and platform data comparison, and automatic extraction and storage of gene sets to build new a priori knowledge which is leveraged by integrated and extensible downstream computational analysis tools. Gene Set Enrichment Analysis (GSEA) and Over-Representation Analysis (ORA) are

  7. Confero: an integrated contrast data and gene set platform for computational analysis and biological interpretation of omics data.

    PubMed

    Hermida, Leandro; Poussin, Carine; Stadler, Michael B; Gubian, Sylvain; Sewer, Alain; Gaidatzis, Dimos; Hotz, Hans-Rudolf; Martin, Florian; Belcastro, Vincenzo; Cano, Stéphane; Peitsch, Manuel C; Hoeng, Julia

    2013-07-29

    High-throughput omics technologies such as microarrays and next-generation sequencing (NGS) have become indispensable tools in biological research. Computational analysis and biological interpretation of omics data can pose significant challenges due to a number of factors, in particular the systems integration required to fully exploit and compare data from different studies and/or technology platforms. In transcriptomics, the identification of differentially expressed genes when studying effect(s) or contrast(s) of interest constitutes the starting point for further downstream computational analysis (e.g. gene over-representation/enrichment analysis, reverse engineering) leading to mechanistic insights. Therefore, it is important to systematically store the full list of genes with their associated statistical analysis results (differential expression, t-statistics, p-value) corresponding to one or more effect(s) or contrast(s) of interest (shortly termed as " contrast data") in a comparable manner and extract gene sets in order to efficiently support downstream analyses and further leverage data on a long-term basis. Filling this gap would open new research perspectives for biologists to discover disease-related biomarkers and to support the understanding of molecular mechanisms underlying specific biological perturbation effects (e.g. disease, genetic, environmental, etc.). To address these challenges, we developed Confero, a contrast data and gene set platform for downstream analysis and biological interpretation of omics data. The Confero software platform provides storage of contrast data in a simple and standard format, data transformation to enable cross-study and platform data comparison, and automatic extraction and storage of gene sets to build new a priori knowledge which is leveraged by integrated and extensible downstream computational analysis tools. Gene Set Enrichment Analysis (GSEA) and Over-Representation Analysis (ORA) are currently integrated

  8. Comprehensive Analysis of MILE Gene Expression Data Set Advances Discovery of Leukaemia Type and Subtype Biomarkers.

    PubMed

    Labaj, Wojciech; Papiez, Anna; Polanski, Andrzej; Polanska, Joanna

    2017-03-01

    Large collections of data in studies on cancer such as leukaemia provoke the necessity of applying tailored analysis algorithms to ensure supreme information extraction. In this work, a custom-fit pipeline is demonstrated for thorough investigation of the voluminous MILE gene expression data set. Three analyses are accomplished, each for gaining a deeper understanding of the processes underlying leukaemia types and subtypes. First, the main disease groups are tested for differential expression against the healthy control as in a standard case-control study. Here, the basic knowledge on molecular mechanisms is confirmed quantitatively and by literature references. Second, pairwise comparison testing is performed for juxtaposing the main leukaemia types among each other. In this case by means of the Dice coefficient similarity measure the general relations are pointed out. Moreover, lists of candidate main leukaemia group biomarkers are proposed. Finally, with this approach being successful, the third analysis provides insight into all of the studied subtypes, followed by the emergence of four leukaemia subtype biomarkers. In addition, the class enhanced DEG signature obtained on the basis of novel pipeline processing leads to significantly better classification power of multi-class data classifiers. The developed methodology consisting of batch effect adjustment, adaptive noise and feature filtration coupled with adequate statistical testing and biomarker definition proves to be an effective approach towards knowledge discovery in high-throughput molecular biology experiments.

  9. Statistical inference of selection and divergence of the rice blast resistance gene Pi-ta.

    PubMed

    Amei, Amei; Lee, Seonghee; Mysore, Kirankumar S; Jia, Yulin

    2014-10-21

    The resistance gene Pi-ta has been effectively used to control rice blast disease, but some populations of cultivated and wild rice have evolved resistance. Insights into the evolutionary processes that led to this resistance during crop domestication may be inferred from the population history of domesticated and wild rice strains. In this study, we applied a recently developed statistical method, time-dependent Poisson random field model, to examine the evolution of the Pi-ta gene in cultivated and weedy rice. Our study suggests that the Pi-ta gene may have more recently introgressed into cultivated rice, indica and japonica, and U.S. weedy rice from the wild species, O. rufipogon. In addition, the Pi-ta gene is under positive selection in japonica, tropical japonica, U.S. cultivars and U.S. weedy rice. We also found that sequences of two domains of the Pi-ta gene, the nucleotide binding site and leucine-rich repeat domain, are highly conserved among all rice accessions examined. Our results provide a valuable analytical tool for understanding the evolution of disease resistance genes in crop plants.

  10. A brain region-specific predictive gene map for autism derived by profiling a reference gene set.

    PubMed

    Kumar, Ajay; Swanwick, Catherine Croft; Johnson, Nicole; Menashe, Idan; Basu, Saumyendra N; Bales, Michael E; Banerjee-Basu, Sharmila

    2011-01-01

    Molecular underpinnings of complex psychiatric disorders such as autism spectrum disorders (ASD) remain largely unresolved. Increasingly, structural variations in discrete chromosomal loci are implicated in ASD, expanding the search space for its disease etiology. We exploited the high genetic heterogeneity of ASD to derive a predictive map of candidate genes by an integrated bioinformatics approach. Using a reference set of 84 Rare and Syndromic candidate ASD genes (AutRef84), we built a composite reference profile based on both functional and expression analyses. First, we created a functional profile of AutRef84 by performing Gene Ontology (GO) enrichment analysis which encompassed three main areas: 1) neurogenesis/projection, 2) cell adhesion, and 3) ion channel activity. Second, we constructed an expression profile of AutRef84 by conducting DAVID analysis which found enrichment in brain regions critical for sensory information processing (olfactory bulb, occipital lobe), executive function (prefrontal cortex), and hormone secretion (pituitary). Disease specificity of this dual AutRef84 profile was demonstrated by comparative analysis with control, diabetes, and non-specific gene sets. We then screened the human genome with the dual AutRef84 profile to derive a set of 460 potential ASD candidate genes. Importantly, the power of our predictive gene map was demonstrated by capturing 18 existing ASD-associated genes which were not part of the AutRef84 input dataset. The remaining 442 genes are entirely novel putative ASD risk genes. Together, we used a composite ASD reference profile to generate a predictive map of novel ASD candidate genes which should be prioritized for future research.

  11. Protein and gene model inference based on statistical modeling in k-partite graphs.

    PubMed

    Gerster, Sarah; Qeli, Ermir; Ahrens, Christian H; Bühlmann, Peter

    2010-07-06

    One of the major goals of proteomics is the comprehensive and accurate description of a proteome. Shotgun proteomics, the method of choice for the analysis of complex protein mixtures, requires that experimentally observed peptides are mapped back to the proteins they were derived from. This process is also known as protein inference. We present Markovian Inference of Proteins and Gene Models (MIPGEM), a statistical model based on clearly stated assumptions to address the problem of protein and gene model inference for shotgun proteomics data. In particular, we are dealing with dependencies among peptides and proteins using a Markovian assumption on k-partite graphs. We are also addressing the problems of shared peptides and ambiguous proteins by scoring the encoding gene models. Empirical results on two control datasets with synthetic mixtures of proteins and on complex protein samples of Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana suggest that the results with MIPGEM are competitive with existing tools for protein inference.

  12. Hox gene Ultrabithorax regulates distinct sets of target genes at successive stages of Drosophila haltere morphogenesis.

    PubMed

    Pavlopoulos, Anastasios; Akam, Michael

    2011-02-15

    Hox genes encode highly conserved transcription factors that regionalize the animal body axis by controlling complex developmental processes. Although they are known to operate in multiple cell types and at different stages, we are still missing the batteries of genes targeted by any one Hox gene over the course of a single developmental process to achieve a particular cell and organ morphology. The transformation of wings into halteres by the Hox gene Ultrabithorax (Ubx) in Drosophila melanogaster presents an excellent model system to study the Hox control of transcriptional networks during successive stages of appendage morphogenesis and cell differentiation. We have used an inducible misexpression system to switch on Ubx in the wing epithelium at successive stages during metamorphosis--in the larva, prepupa, and pupa. We have then used extensive microarray expression profiling and quantitative RT-PCR to identify the primary transcriptional responses to Ubx. We find that Ubx targets range from regulatory genes like transcription factors and signaling components to terminal differentiation genes affecting a broad repertoire of cell behaviors and metabolic reactions. Ubx up- and down-regulates hundreds of downstream genes at each stage, mostly in a subtle manner. Strikingly, our analysis reveals that Ubx target genes are largely distinct at different stages of appendage morphogenesis, suggesting extensive interactions between Hox genes and hormone-controlled regulatory networks to orchestrate complex genetic programs during metamorphosis.

  13. SET1A/COMPASS and shadow enhancers in the regulation of homeotic gene expression.

    PubMed

    Cao, Kaixiang; Collings, Clayton K; Marshall, Stacy A; Morgan, Marc A; Rendleman, Emily J; Wang, Lu; Sze, Christie C; Sun, Tianjiao; Bartom, Elizabeth T; Shilatifard, Ali

    2017-04-15

    The homeotic (Hox) genes are highly conserved in metazoans, where they are required for various processes in development, and misregulation of their expression is associated with human cancer. In the developing embryo, Hox genes are activated sequentially in time and space according to their genomic position within Hox gene clusters. Accumulating evidence implicates both enhancer elements and noncoding RNAs in controlling this spatiotemporal expression of Hox genes, but disentangling their relative contributions is challenging. Here, we identify two cis-regulatory elements (E1 and E2) functioning as shadow enhancers to regulate the early expression of the HoxA genes. Simultaneous deletion of these shadow enhancers in embryonic stem cells leads to impaired activation of HoxA genes upon differentiation, while knockdown of a long noncoding RNA overlapping E1 has no detectable effect on their expression. Although MLL/COMPASS (complex of proteins associated with Set1) family of histone methyltransferases is known to activate transcription of Hox genes in other contexts, we found that individual inactivation of the MLL1-4/COMPASS family members has little effect on early Hox gene activation. Instead, we demonstrate that SET1A/COMPASS is required for full transcriptional activation of multiple Hox genes but functions independently of the E1 and E2 cis-regulatory elements. Our results reveal multiple regulatory layers for Hox genes to fine-tune transcriptional programs essential for development. © 2017 Cao et al.; Published by Cold Spring Harbor Laboratory Press.

  14. Gene Set Signature of Reversal Reaction Type I in Leprosy Patients

    PubMed Central

    Orlova, Marianna; Cobat, Aurélie; Huong, Nguyen Thu; Ba, Nguyen Ngoc; Van Thuc, Nguyen; Spencer, John; Nédélec, Yohann; Barreiro, Luis; Thai, Vu Hong; Abel, Laurent; Alcaïs, Alexandre; Schurr, Erwin

    2013-01-01

    Leprosy reversal reactions type 1 (T1R) are acute immune episodes that affect a subset of leprosy patients and remain a major cause of nerve damage. Little is known about the relative importance of innate versus environmental factors in the pathogenesis of T1R. In a retrospective design, we evaluated innate differences in response to Mycobacterium leprae between healthy individuals and former leprosy patients affected or free of T1R by analyzing the transcriptome response of whole blood to M. leprae sonicate. Validation of results was conducted in a subsequent prospective study. We observed the differential expression of 581 genes upon exposure of whole blood to M. leprae sonicate in the retrospective study. We defined a 44 T1R gene set signature of differentially regulated genes. The majority of the T1R set genes were represented by three functional groups: i) pro-inflammatory regulators; ii) arachidonic acid metabolism mediators; and iii) regulators of anti-inflammation. The validity of the T1R gene set signature was replicated in the prospective arm of the study. The T1R genetic signature encompasses genes encoding pro- and anti-inflammatory mediators of innate immunity. This suggests an innate defect in the regulation of the inflammatory response to M. leprae antigens. The identified T1R gene set represents a critical first step towards a genetic profile of leprosy patients who are at increased risk of T1R and concomitant nerve damage. PMID:23874223

  15. Statistical methods in detecting differential expressed genes, analyzing insertion tolerance for genes and group selection for survival data

    NASA Astrophysics Data System (ADS)

    Liu, Fangfang

    The thesis is composed of three independent projects: (i) analyzing transposon-sequencing data to infer functions of genes on bacteria growth (chapter 2), (ii) developing semi-parametric Bayesian method for differential gene expression analysis with RNA-sequencing data (chapter 3), (iii) solving group selection problem for survival data (chapter 4). All projects are motivated by statistical challenges raised in biological research. The first project is motivated by the need to develop statistical models to accommodate the transposon insertion sequencing (Tn-Seq) data, Tn-Seq data consist of sequence reads around each transposon insertion site. The detection of transposon insertion at a given site indicates that the disruption of genomic sequence at this site does not cause essential function loss and the bacteria can still grow. Hence, such measurements have been used to infer the functions of each gene on bacteria growth. We propose a zero-inflated Poisson regression method for analyzing the Tn-Seq count data, and derive an Expectation-Maximization (EM) algorithm to obtain parameter estimates. We also propose a multiple testing procedure that categorizes genes into each of the three states, hypo-tolerant, tolerant, and hyper-tolerant, while controlling false discovery rate. Simulation studies show our method provides good estimation of model parameters and inference on gene functions. In the second project, we model the count data from RNA-sequencing experiment for each gene using a Poisson-Gamma hierarchical model, or equivalently, a negative binomial (NB) model. We derive a full semi-parametric Bayesian approach with Dirichlet process as the prior for the fold changes between two treatment means. An inference strategy using Gibbs algorithm is developed for differential expression analysis. We evaluate our method with several simulation studies, and the results demonstrate that our method outperforms other methods including the popularly applied ones such as edge

  16. Resolving ancient radiations: can complete plastid gene sets elucidate deep relationships among the tropical gingers (Zingiberales)?

    PubMed Central

    Barrett, Craig F.; Specht, Chelsea D.; Leebens-Mack, Jim; Stevenson, Dennis Wm.; Zomlefer, Wendy B.; Davis, Jerrold I.

    2014-01-01

    Background and Aims Zingiberales comprise a clade of eight tropical monocot families including approx. 2500 species and are hypothesized to have undergone an ancient, rapid radiation during the Cretaceous. Zingiberales display substantial variation in floral morphology, and several members are ecologically and economically important. Deep phylogenetic relationships among primary lineages of Zingiberales have proved difficult to resolve in previous studies, representing a key region of uncertainty in the monocot tree of life. Methods Next-generation sequencing was used to construct complete plastid gene sets for nine taxa of Zingiberales, which were added to five previously sequenced sets in an attempt to resolve deep relationships among families in the order. Variation in taxon sampling, process partition inclusion and partition model parameters were examined to assess their effects on topology and support. Key Results Codon-based likelihood analysis identified a strongly supported clade of ((Cannaceae, Marantaceae), (Costaceae, Zingiberaceae)), sister to (Musaceae, (Lowiaceae, Strelitziaceae)), collectively sister to Heliconiaceae. However, the deepest divergences in this phylogenetic analysis comprised short branches with weak support. Additionally, manipulation of matrices resulted in differing deep topologies in an unpredictable fashion. Alternative topology testing allowed statistical rejection of some of the topologies. Saturation fails to explain observed topological uncertainty and low support at the base of Zingiberales. Evidence for conflict among the plastid data was based on a support metric that accounts for conflicting resampled topologies. Conclusions Many relationships were resolved with robust support, but the paucity of character information supporting the deepest nodes and the existence of conflict suggest that plastid coding regions are insufficient to resolve and support the earliest divergences among families of Zingiberales. Whole plastomes

  17. Resolving ancient radiations: can complete plastid gene sets elucidate deep relationships among the tropical gingers (Zingiberales)?

    PubMed

    Barrett, Craig F; Specht, Chelsea D; Leebens-Mack, Jim; Stevenson, Dennis Wm; Zomlefer, Wendy B; Davis, Jerrold I

    2014-01-01

    Zingiberales comprise a clade of eight tropical monocot families including approx. 2500 species and are hypothesized to have undergone an ancient, rapid radiation during the Cretaceous. Zingiberales display substantial variation in floral morphology, and several members are ecologically and economically important. Deep phylogenetic relationships among primary lineages of Zingiberales have proved difficult to resolve in previous studies, representing a key region of uncertainty in the monocot tree of life. Next-generation sequencing was used to construct complete plastid gene sets for nine taxa of Zingiberales, which were added to five previously sequenced sets in an attempt to resolve deep relationships among families in the order. Variation in taxon sampling, process partition inclusion and partition model parameters were examined to assess their effects on topology and support. Codon-based likelihood analysis identified a strongly supported clade of ((Cannaceae, Marantaceae), (Costaceae, Zingiberaceae)), sister to (Musaceae, (Lowiaceae, Strelitziaceae)), collectively sister to Heliconiaceae. However, the deepest divergences in this phylogenetic analysis comprised short branches with weak support. Additionally, manipulation of matrices resulted in differing deep topologies in an unpredictable fashion. Alternative topology testing allowed statistical rejection of some of the topologies. Saturation fails to explain observed topological uncertainty and low support at the base of Zingiberales. Evidence for conflict among the plastid data was based on a support metric that accounts for conflicting resampled topologies. Many relationships were resolved with robust support, but the paucity of character information supporting the deepest nodes and the existence of conflict suggest that plastid coding regions are insufficient to resolve and support the earliest divergences among families of Zingiberales. Whole plastomes will continue to be highly useful in plant

  18. Allele diversity for abiotic stress responsive candidate genes in chickpea reference set using gene based SNP markers

    PubMed Central

    Roorkiwal, Manish; Nayak, Spurthi N.; Thudi, Mahendar; Upadhyaya, Hari D.; Brunel, Dominique; Mournet, Pierre; This, Dominique; Sharma, Prakash C.; Varshney, Rajeev K.

    2014-01-01

    Chickpea is an important food legume crop for the semi-arid regions, however, its productivity is adversely affected by various biotic and abiotic stresses. Identification of candidate genes associated with abiotic stress response will help breeding efforts aiming to enhance its productivity. With this objective, 10 abiotic stress responsive candidate genes were selected on the basis of prior knowledge of this complex trait. These 10 genes were subjected to allele specific sequencing across a chickpea reference set comprising 300 genotypes including 211 genotypes of chickpea mini core collection. A total of 1.3 Mbp sequence data were generated. Multiple sequence alignment (MSA) revealed 79 SNPs and 41 indels in nine genes while the CAP2 gene was found to be conserved across all the genotypes. Among 10 candidate genes, the maximum number of SNPs (34) was observed in abscisic acid stress and ripening (ASR) gene including 22 transitions, 11 transversions and one tri-allelic SNP. Nucleotide diversity varied from 0.0004 to 0.0029 while polymorphism information content (PIC) values ranged from 0.01 (AKIN gene) to 0.43 (CAP2 promoter). Haplotype analysis revealed that alleles were represented by more than two haplotype blocks, except alleles of the CAP2 and sucrose synthase (SuSy) gene, where only one haplotype was identified. These genes can be used for association analysis and if validated, may be useful for enhancing abiotic stress, including drought tolerance, through molecular breeding. PMID:24926299

  19. Allele diversity for abiotic stress responsive candidate genes in chickpea reference set using gene based SNP markers.

    PubMed

    Roorkiwal, Manish; Nayak, Spurthi N; Thudi, Mahendar; Upadhyaya, Hari D; Brunel, Dominique; Mournet, Pierre; This, Dominique; Sharma, Prakash C; Varshney, Rajeev K

    2014-01-01

    Chickpea is an important food legume crop for the semi-arid regions, however, its productivity is adversely affected by various biotic and abiotic stresses. Identification of candidate genes associated with abiotic stress response will help breeding efforts aiming to enhance its productivity. With this objective, 10 abiotic stress responsive candidate genes were selected on the basis of prior knowledge of this complex trait. These 10 genes were subjected to allele specific sequencing across a chickpea reference set comprising 300 genotypes including 211 genotypes of chickpea mini core collection. A total of 1.3 Mbp sequence data were generated. Multiple sequence alignment (MSA) revealed 79 SNPs and 41 indels in nine genes while the CAP2 gene was found to be conserved across all the genotypes. Among 10 candidate genes, the maximum number of SNPs (34) was observed in abscisic acid stress and ripening (ASR) gene including 22 transitions, 11 transversions and one tri-allelic SNP. Nucleotide diversity varied from 0.0004 to 0.0029 while polymorphism information content (PIC) values ranged from 0.01 (AKIN gene) to 0.43 (CAP2 promoter). Haplotype analysis revealed that alleles were represented by more than two haplotype blocks, except alleles of the CAP2 and sucrose synthase (SuSy) gene, where only one haplotype was identified. These genes can be used for association analysis and if validated, may be useful for enhancing abiotic stress, including drought tolerance, through molecular breeding.

  20. Identification of a conserved set of upregulated genes in mouse skeletal muscle hypertrophy and regrowth.

    PubMed

    Chaillou, Thomas; Jackson, Janna R; England, Jonathan H; Kirby, Tyler J; Richards-White, Jena; Esser, Karyn A; Dupont-Versteegden, Esther E; McCarthy, John J

    2015-01-01

    The purpose of this study was to compare the gene expression profile of mouse skeletal muscle undergoing two forms of growth (hypertrophy and regrowth) with the goal of identifying a conserved set of differentially expressed genes. Expression profiling by microarray was performed on the plantaris muscle subjected to 1, 3, 5, 7, 10, and 14 days of hypertrophy or regrowth following 2 wk of hind-limb suspension. We identified 97 differentially expressed genes (≥2-fold increase or ≥50% decrease compared with control muscle) that were conserved during the two forms of muscle growth. The vast majority (∼90%) of the differentially expressed genes was upregulated and occurred at a single time point (64 out of 86 genes), which most often was on the first day of the time course. Microarray analysis from the conserved upregulated genes showed a set of genes related to contractile apparatus and stress response at day 1, including three genes involved in mechanotransduction and four genes encoding heat shock proteins. Our analysis further identified three cell cycle-related genes at day and several genes associated with extracellular matrix (ECM) at both days 3 and 10. In conclusion, we have identified a core set of genes commonly upregulated in two forms of muscle growth that could play a role in the maintenance of sarcomere stability, ECM remodeling, cell proliferation, fast-to-slow fiber type transition, and the regulation of skeletal muscle growth. These findings suggest conserved regulatory mechanisms involved in the adaptation of skeletal muscle to increased mechanical loading. Copyright © 2015 the American Physiological Society.

  1. Identification of a conserved set of upregulated genes in mouse skeletal muscle hypertrophy and regrowth

    PubMed Central

    Chaillou, Thomas; Jackson, Janna R.; England, Jonathan H.; Kirby, Tyler J.; Richards-White, Jena; Esser, Karyn A.; Dupont-Versteegden, Esther E.

    2014-01-01

    The purpose of this study was to compare the gene expression profile of mouse skeletal muscle undergoing two forms of growth (hypertrophy and regrowth) with the goal of identifying a conserved set of differentially expressed genes. Expression profiling by microarray was performed on the plantaris muscle subjected to 1, 3, 5, 7, 10, and 14 days of hypertrophy or regrowth following 2 wk of hind-limb suspension. We identified 97 differentially expressed genes (≥2-fold increase or ≥50% decrease compared with control muscle) that were conserved during the two forms of muscle growth. The vast majority (∼90%) of the differentially expressed genes was upregulated and occurred at a single time point (64 out of 86 genes), which most often was on the first day of the time course. Microarray analysis from the conserved upregulated genes showed a set of genes related to contractile apparatus and stress response at day 1, including three genes involved in mechanotransduction and four genes encoding heat shock proteins. Our analysis further identified three cell cycle-related genes at day and several genes associated with extracellular matrix (ECM) at both days 3 and 10. In conclusion, we have identified a core set of genes commonly upregulated in two forms of muscle growth that could play a role in the maintenance of sarcomere stability, ECM remodeling, cell proliferation, fast-to-slow fiber type transition, and the regulation of skeletal muscle growth. These findings suggest conserved regulatory mechanisms involved in the adaptation of skeletal muscle to increased mechanical loading. PMID:25554798

  2. ChIP-Enrich: gene set enrichment testing for ChIP-seq data.

    PubMed

    Welch, Ryan P; Lee, Chee; Imbriano, Paul M; Patil, Snehal; Weymouth, Terry E; Smith, R Alex; Scott, Laura J; Sartor, Maureen A

    2014-07-01

    Gene set enrichment testing can enhance the biological interpretation of ChIP-seq data. Here, we develop a method, ChIP-Enrich, for this analysis which empirically adjusts for gene locus length (the length of the gene body and its surrounding non-coding sequence). Adjustment for gene locus length is necessary because it is often positively associated with the presence of one or more peaks and because many biologically defined gene sets have an excess of genes with longer or shorter gene locus lengths. Unlike alternative methods, ChIP-Enrich can account for the wide range of gene locus length-to-peak presence relationships (observed in ENCODE ChIP-seq data sets). We show that ChIP-Enrich has a well-calibrated type I error rate using permuted ENCODE ChIP-seq data sets; in contrast, two commonly used gene set enrichment methods, Fisher's exact test and the binomial test implemented in Genomic Regions Enrichment of Annotations Tool (GREAT), can have highly inflated type I error rates and biases in ranking. We identify DNA-binding proteins, including CTCF, JunD and glucocorticoid receptor α (GRα), that show different enrichment patterns for peaks closer to versus further from transcription start sites. We also identify known and potential new biological functions of GRα. ChIP-Enrich is available as a web interface (http://chip-enrich.med.umich.edu) and Bioconductor package. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  3. StemChecker: a web-based tool to discover and explore stemness signatures in gene sets

    PubMed Central

    Pinto, José P.; Kalathur, Ravi K.; Oliveira, Daniel V.; Barata, Tânia; Machado, Rui S.R.; Machado, Susana; Pacheco-Leyva, Ivette; Duarte, Isabel; Futschik, Matthias E.

    2015-01-01

    Stem cells present unique regenerative abilities, offering great potential for treatment of prevalent pathologies such as diabetes, neurodegenerative and heart diseases. Various research groups dedicated significant effort to identify sets of genes—so-called stemness signatures—considered essential to define stem cells. However, their usage has been hindered by the lack of comprehensive resources and easy-to-use tools. For this we developed StemChecker, a novel stemness analysis tool, based on the curation of nearly fifty published stemness signatures defined by gene expression, RNAi screens, Transcription Factor (TF) binding sites, literature reviews and computational approaches. StemChecker allows researchers to explore the presence of stemness signatures in user-defined gene sets, without carrying-out lengthy literature curation or data processing. To assist in exploring underlying regulatory mechanisms, we collected over 80 target gene sets of TFs associated with pluri- or multipotency. StemChecker presents an intuitive graphical display, as well as detailed statistical results in table format, which helps revealing transcriptionally regulatory programs, indicating the putative involvement of stemness-associated processes in diseases like cancer. Overall, StemChecker substantially expands the available repertoire of online tools, designed to assist the stem cell biology, developmental biology, regenerative medicine and human disease research community. StemChecker is freely accessible at http://stemchecker.sysbiolab.eu. PMID:26007653

  4. Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data

    PubMed Central

    Ahn, Byeong-Cheol

    2016-01-01

    This study describes a novel approach to reducing the challenges of highly nonlinear multiclass gene expression values for cancer diagnosis. To build a fruitful system for cancer diagnosis, in this study, we introduced two levels of gene selection such as filtering and embedding for selection of potential genes and the most relevant genes associated with cancer, respectively. The filter procedure was implemented by developing a fuzzy rough set (FR)-based method for redefining the criterion function of f-information (FI) to identify the potential genes without discretizing the continuous gene expression values. The embedded procedure is implemented by means of a water swirl algorithm (WSA), which attempts to optimize the rule set and membership function required to classify samples using a fuzzy-rule-based multiclassification system (FRBMS). Two novel update equations are proposed in WSA, which have better exploration and exploitation abilities while designing a self-learning FRBMS. The efficiency of our new approach was evaluated on 13 multicategory and 9 binary datasets of cancer gene expression. Additionally, the performance of the proposed FRFI-WSA method in designing an FRBMS was compared with existing methods for gene selection and optimization such as genetic algorithm (GA), particle swarm optimization (PSO), and artificial bee colony algorithm (ABC) on all the datasets. In the global cancer map with repeated measurements (GCM_RM) dataset, the FRFI-WSA showed the smallest number of 16 most relevant genes associated with cancer using a minimal number of 26 compact rules with the highest classification accuracy (96.45%). In addition, the statistical validation used in this study revealed that the biological relevance of the most relevant genes associated with cancer and their linguistics detected by the proposed FRFI-WSA approach are better than those in the other methods. The simple interpretable rules with most relevant genes and effectively classified

  5. Automated Detection of Cancer Associated Genes Using a Combined Fuzzy-Rough-Set-Based F-Information and Water Swirl Algorithm of Human Gene Expression Data.

    PubMed

    Ganesh Kumar, Pugalendhi; Kavitha, Muthu Subash; Ahn, Byeong-Cheol

    2016-01-01

    This study describes a novel approach to reducing the challenges of highly nonlinear multiclass gene expression values for cancer diagnosis. To build a fruitful system for cancer diagnosis, in this study, we introduced two levels of gene selection such as filtering and embedding for selection of potential genes and the most relevant genes associated with cancer, respectively. The filter procedure was implemented by developing a fuzzy rough set (FR)-based method for redefining the criterion function of f-information (FI) to identify the potential genes without discretizing the continuous gene expression values. The embedded procedure is implemented by means of a water swirl algorithm (WSA), which attempts to optimize the rule set and membership function required to classify samples using a fuzzy-rule-based multiclassification system (FRBMS). Two novel update equations are proposed in WSA, which have better exploration and exploitation abilities while designing a self-learning FRBMS. The efficiency of our new approach was evaluated on 13 multicategory and 9 binary datasets of cancer gene expression. Additionally, the performance of the proposed FRFI-WSA method in designing an FRBMS was compared with existing methods for gene selection and optimization such as genetic algorithm (GA), particle swarm optimization (PSO), and artificial bee colony algorithm (ABC) on all the datasets. In the global cancer map with repeated measurements (GCM_RM) dataset, the FRFI-WSA showed the smallest number of 16 most relevant genes associated with cancer using a minimal number of 26 compact rules with the highest classification accuracy (96.45%). In addition, the statistical validation used in this study revealed that the biological relevance of the most relevant genes associated with cancer and their linguistics detected by the proposed FRFI-WSA approach are better than those in the other methods. The simple interpretable rules with most relevant genes and effectively classified

  6. Mechanism-based biomarker gene sets for glutathione depletion-related hepatotoxicity in rats

    SciTech Connect

    Gao Weihua; Mizukawa, Yumiko; Nakatsu, Noriyuki; Minowa, Yosuke; Yamada, Hiroshi; Ohno, Yasuo; Urushidani, Tetsuro

    2010-09-15

    Chemical-induced glutathione depletion is thought to be caused by two types of toxicological mechanisms: PHO-type glutathione depletion [glutathione conjugated with chemicals such as phorone (PHO) or diethyl maleate (DEM)], and BSO-type glutathione depletion [i.e., glutathione synthesis inhibited by chemicals such as L-buthionine-sulfoximine (BSO)]. In order to identify mechanism-based biomarker gene sets for glutathione depletion in rat liver, male SD rats were treated with various chemicals including PHO (40, 120 and 400 mg/kg), DEM (80, 240 and 800 mg/kg), BSO (150, 450 and 1500 mg/kg), and bromobenzene (BBZ, 10, 100 and 300 mg/kg). Liver samples were taken 3, 6, 9 and 24 h after administration and examined for hepatic glutathione content, physiological and pathological changes, and gene expression changes using Affymetrix GeneChip Arrays. To identify differentially expressed probe sets in response to glutathione depletion, we focused on the following two courses of events for the two types of mechanisms of glutathione depletion: a) gene expression changes occurring simultaneously in response to glutathione depletion, and b) gene expression changes after glutathione was depleted. The gene expression profiles of the identified probe sets for the two types of glutathione depletion differed markedly at times during and after glutathione depletion, whereas Srxn1 was markedly increased for both types as glutathione was depleted, suggesting that Srxn1 is a key molecule in oxidative stress related to glutathione. The extracted probe sets were refined and verified using various compounds including 13 additional positive or negative compounds, and they established two useful marker sets. One contained three probe sets (Akr7a3, Trib3 and Gstp1) that could detect conjugation-type glutathione depletors any time within 24 h after dosing, and the other contained 14 probe sets that could detect glutathione depletors by any mechanism. These two sets, with appropriate scoring

  7. Single-channel EEG sleep stage classification based on a streamlined set of statistical features in wavelet domain.

    PubMed

    da Silveira, Thiago L T; Kozakevicius, Alice J; Rodrigues, Cesar R

    2017-02-01

    The main objective of this study was to enhance the performance of sleep stage classification using single-channel electroencephalograms (EEGs), which are highly desirable for many emerging technologies, such as telemedicine and home care. The proposed method consists of decomposing EEGs by a discrete wavelet transform and computing the kurtosis, skewness and variance of its coefficients at selected levels. A random forest predictor is trained to classify each epoch into one of the Rechtschaffen and Kales' stages. By performing a comprehensive set of tests on 106,376 epochs available from the Physionet public database, it is demonstrated that the use of these three statistical moments has enhanced performance when compared to their application in the time domain. Furthermore, the chosen set of features has the advantage of exhibiting a stable classification performance for all scoring systems, i.e., from 2- to 6-state sleep stages. The stability of the feature set is confirmed with ReliefF tests which show a performance reduction when any individual feature is removed, suggesting that this group of feature cannot be further reduced. The accuracies and kappa coefficients yield higher than 90 % and 0.8, respectively, for all of the 2- to 6-state sleep stage classification cases.

  8. A prognosis classifier for breast cancer based on conserved gene regulation between mammary gland development and tumorigenesis: a multiscale statistical model.

    PubMed

    Tian, Yingpu; Chen, Baozhen; Guan, Pengfei; Kang, Yujia; Lu, Zhongxian

    2013-01-01

    Identification of novel cancer genes for molecular therapy and diagnosis is a current focus of breast cancer research. Although a few small gene sets were identified as prognosis classifiers, more powerful models are still needed for the definition of effective gene sets for the diagnosis and treatment guidance in breast cancer. In the present study, we have developed a novel statistical approach for systematic analysis of intrinsic correlations of gene expression between development and tumorigenesis in mammary gland. Based on this analysis, we constructed a predictive model for prognosis in breast cancer that may be useful for therapy decisions. We first defined developmentally associated genes from a mouse mammary gland epithelial gene expression database. Then, we found that the cancer modulated genes were enriched in this developmentally associated genes list. Furthermore, the developmentally associated genes had a specific expression profile, which associated with the molecular characteristics and histological grade of the tumor. These result suggested that the processes of mammary gland development and tumorigenesis share gene regulatory mechanisms. Then, the list of regulatory genes both on the developmental and tumorigenesis process was defined an 835-member prognosis classifier, which showed an exciting ability to predict clinical outcome of three groups of breast cancer patients (the predictive accuracy 64∼72%) with a robust prognosis prediction (hazard ratio 3.3∼3.8, higher than that of other clinical risk factors (around 2.0-2.8)). In conclusion, our results identified the conserved molecular mechanisms between mammary gland development and neoplasia, and provided a unique potential model for mining unknown cancer genes and predicting the clinical status of breast tumors. These findings also suggested that developmental roles of genes may be important criteria for selecting genes for prognosis prediction in breast cancer.

  9. A set-based association test identifies sex-specific gene sets associated with type 2 diabetes

    PubMed Central

    He, Tao; Zhong, Ping-Shou; Cui, Yuehua

    2014-01-01

    Single variant analysis in genome-wide association studies (GWAS) has been proven to be successful in identifying thousands of genetic variants associated with hundreds of complex diseases. However, these identified variants only explain a small fraction of inheritable variability in many diseases, suggesting that other resources, such as multilevel genetic variations, may contribute to disease susceptibility. In this work, we proposed to combine genetic variants that belong to a gene set, such as at gene- and pathway-level to form an integrated signal aimed to identify major players that function in a coordinated manner conferring disease risk. The integrated analysis provides novel insight into disease etiology while individual signals could be easily missed by single variant analysis. We applied our approach to a genome-wide association study of type 2 diabetes (T2D) with male and female data analyzed separately. Novel sex-specific genes and pathways were identified to increase the risk of T2D. We also demonstrated the performance of signal integration through simulation studies. PMID:25429300

  10. A 80-gene set potentially predicts the relapse in laryngeal carcinoma optimized by support vector machine.

    PubMed

    Yang, Bo; Guo, Qing; Wang, Fei; Cai, Kemin; Bao, Xueli; Chu, Jiusheng

    2017-01-01

    The present study was performed to identify a gene set for predicting the relapse in laryngeal carcinoma using large data analysis methods. Two gene expression profile data of laryngeal carcinoma (GSE27020 and GSE25727) were downloaded from public database. Genes associated with tumor relapse, namely informative genes, were identified by Cox regression analysis. Then the protein-protein interaction (PPI) network consisting of informative genes was constructed. Afterwards, the optimized support vector machine (SVM) classifier was constructed to classify the relapsed laryngeal carcinoma samples based on genes in specific PPI network. Furthermore, the efficiency of the SVM classifier was verified by other two independent datasets. A total of 331 informative genes were obtained from GSE27020 and GSE25757 datasets. A PPI network specific to laryngeal carcinoma relapse was constructed which contained informative genes and critical non-informative genes. The top 10 genes in specific PPI network were APP, NTRK1, TP53, PTEN, FN1, ELAVL1, HSP90AA1, XPO1, LDHA and CDK2 ranked by BC (betweenness centrality) value. The optimized SVM classifier including top 80 genes showed accuracy of 100% to classify the relapsed cases from laryngeal carcinoma samples. Next, the efficiency of the SVM classifier to predict relapse samples was verified in another independent datasets, which showed accuracy of 97.47%. The informative genes in the optimized SVM classifier were enriched in several pathways associated with tumor progression. A 80-gene set was identified as biomarker to predict the relapse of laryngeal carcinoma, which would be potentially applied in decision of different treatments for patients with different relapse risks.

  11. Identification of a set of genes showing regionally enriched expression in the mouse brain

    PubMed Central

    D'Souza, Cletus A; Chopra, Vikramjit; Varhol, Richard; Xie, Yuan-Yun; Bohacec, Slavita; Zhao, Yongjun; Lee, Lisa LC; Bilenky, Mikhail; Portales-Casamar, Elodie; He, An; Wasserman, Wyeth W; Goldowitz, Daniel; Marra, Marco A; Holt, Robert A; Simpson, Elizabeth M; Jones, Steven JM

    2008-01-01

    Background The Pleiades Promoter Project aims to improve gene therapy by designing human mini-promoters (< 4 kb) that drive gene expression in specific brain regions or cell-types of therapeutic interest. Our goal was to first identify genes displaying regionally enriched expression in the mouse brain so that promoters designed from orthologous human genes can then be tested to drive reporter expression in a similar pattern in the mouse brain. Results We have utilized LongSAGE to identify regionally enriched transcripts in the adult mouse brain. As supplemental strategies, we also performed a meta-analysis of published literature and inspected the Allen Brain Atlas in situ hybridization data. From a set of approximately 30,000 mouse genes, 237 were identified as showing specific or enriched expression in 30 target regions of the mouse brain. GO term over-representation among these genes revealed co-involvement in various aspects of central nervous system development and physiology. Conclusion Using a multi-faceted expression validation approach, we have identified mouse genes whose human orthologs are good candidates for design of mini-promoters. These mouse genes represent molecular markers in several discrete brain regions/cell-types, which could potentially provide a mechanistic explanation of unique functions performed by each region. This set of markers may also serve as a resource for further studies of gene regulatory elements influencing brain expression. PMID:18625066

  12. Associations between DNA methylation and schizophrenia-related intermediate phenotypes - a gene set enrichment analysis.

    PubMed

    Hass, Johanna; Walton, Esther; Wright, Carrie; Beyer, Andreas; Scholz, Markus; Turner, Jessica; Liu, Jingyu; Smolka, Michael N; Roessner, Veit; Sponheim, Scott R; Gollub, Randy L; Calhoun, Vince D; Ehrlich, Stefan

    2015-06-03

    Multiple genetic approaches have identified microRNAs as key effectors in psychiatric disorders as they post-transcriptionally regulate expression of thousands of target genes. However, their role in specific psychiatric diseases remains poorly understood. In addition, epigenetic mechanisms such as DNA methylation, which affect the expression of both microRNAs and coding genes, are critical for our understanding of molecular mechanisms in schizophrenia. Using clinical, imaging, genetic, and epigenetic data of 103 patients with schizophrenia and 111 healthy controls of the Mind Clinical Imaging Consortium (MCIC) study of schizophrenia, we conducted gene set enrichment analysis to identify markers for schizophrenia-associated intermediate phenotypes. Genes were ranked based on the correlation between DNA methylation patterns and each phenotype, and then searched for enrichment in 221 predicted microRNA target gene sets. We found the predicted hsa-miR-219a-5p target gene set to be significantly enriched for genes (EPHA4, PKNOX1, ESR1, among others) whose methylation status is correlated with hippocampal volume independent of disease status. Our results were strengthened by significant associations between hsa-miR-219a-5p target gene methylation patterns and hippocampus-related neuropsychological variables. IPA pathway analysis of the respective predicted hsa-miR-219a-5p target genes revealed associated network functions in behavior and developmental disorders. Altered methylation patterns of predicted hsa-miR-219a-5p target genes are associated with a structural aberration of the brain that has been proposed as a possible biomarker for schizophrenia. The (dys)regulation of microRNA target genes by epigenetic mechanisms may confer additional risk for developing psychiatric symptoms. Further study is needed to understand possible interactions between microRNAs and epigenetic changes and their impact on risk for brain-based disorders such as schizophrenia.

  13. Associations between DNA methylation and schizophrenia-related intermediate phenotypes a gene set enrichment analysis

    PubMed Central

    Hass, Johanna; Walton, Esther; Wright, Carrie; Beyer, Andreas; Scholz, Markus; Turner, Jessica; Liu, Jingyu; Smolka, Michael N.; Roessner, Veit; Sponheim, Scott R.; Gollub, Randy L.; Calhoun, Vince D.; Ehrlich, Stefan

    2015-01-01

    Multiple genetic approaches have identified microRNAs as key effectors in psychiatric disorders as they post-transcriptionally regulate expression of thousands of target genes. However, their role in specific psychiatric diseases remains poorly understood. In addition, epigenetic mechanisms such as DNA methylation, which affect the expression of both microRNAs and coding genes, are critical for our understanding of molecular mechanisms in schizophrenia. Using clinical, imaging, genetic, and epigenetic data of 103 patients with schizophrenia and 111 healthy controls of the Mind Clinical Imaging Consortium (MCIC) study of schizophrenia, we conducted gene set enrichment analysis to identify markers for schizophrenia-associated intermediate phenotypes. Genes were ranked based on the correlation between DNA methylation patterns and each phenotype, and then searched for enrichment in 221 predicted microRNA target gene sets. We found the predicted hsa-miR-219a-5p target gene set to be significantly enriched for genes (EPHA4, PKNOX1, ESR1, amongst others) whose methylation status is correlated with hippocampal volume independent of disease status. Our results were strengthened by significant associations between hsa-miR-219a-5p target gene methylation patterns and hippocampus-related neuropsychological variables. IPA pathway analysis of the respective predicted hsa-miR-219a-5p target genes revealed associated network functions in behaviour and developmental disorders. Altered methylation patterns of predicted hsa-miR-219a-5p target genes are associated with a structural aberration of the brain that has been proposed as a possible biomarker for schizophrenia. The (dys)regulation of microRNA target genes by epigenetic mechanisms may confer additional risk for developing psychiatric symptoms. Further study is needed to understand possible interactions between microRNAs and epigenetic changes and their impact on risk for brain-based disorders such as schizophrenia. PMID

  14. Regulatory Genes Controlling Anthocyanin Pigmentation Are Functionally Conserved among Plant Species and Have Distinct Sets of Target Genes.

    PubMed Central

    Quattrocchio, F; Wing, JF; Leppen, H; Mol, J; Koes, RE

    1993-01-01

    In this study, we demonstrate that in petunia at least four regulatory genes (anthocyanin-1 [an1], an2, an4, and an11) control transcription of a subset of structural genes from the anthocyanin pathway by using a combination of RNA gel blot analysis, transcription run-on assays, and transient expression assays. an2- and an11- mutants could be transiently complemented by the maize regulatory genes Leaf color (Lc) or Colorless-1 (C1), respectively, whereas an1- mutants only by Lc and C1 together. In addition, the combination of Lc and C1 induces pigment accumulation in young leaves. This indicates that Lc and C1 are both necessary and sufficient to produce pigmentation in leaf cells. Regulatory pigmentation genes in maize and petunia control different sets of structural genes. The maize Lc and C1 genes expressed in petunia differentially activate the promoters of the chalcone synthase genes chsA and chsJ in the same way that the homologous petunia genes do. This suggests that the regulatory proteins in both species are functionally similar and that the choice of target genes is determined by their promoter sequences. We present an evolutionary model that explains the differences in regulation of pigmentation pathways of maize, petunia, and snapdragon. PMID:12271045

  15. Gene- and pathway-based association tests for multiple traits with GWAS summary statistics.

    PubMed

    Kwak, Il-Youp; Pan, Wei

    2017-01-01

    To identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene- or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP- and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or P-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods.

  16. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    PubMed Central

    2013-01-01

    Background Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set analysis (GSA) methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human) and 588 (mouse) gene sets from the Comparative Toxicogenomics Database (CTD). We tested for significant differential expression (SDE) (false discovery rate -corrected p-values < 0.05) of the next-gen TM-derived gene sets and the CTD-derived gene sets in gene expression (GE) data sets of five chemicals (from experimental models). We tested for SDE of gene sets for six fibrates in a peroxisome proliferator-activated receptor alpha (PPARA) knock-out GE dataset and compared to results from the Connectivity Map. We tested for SDE of 319 next-gen TM-derived gene sets for environmental toxicants in three GE data sets of triazoles, and tested for SDE of 442 gene sets associated with embryonic structures. We compared the gene sets to triazole effects seen in the Whole Embryo Culture (WEC), and used principal component analysis (PCA) to discriminate triazoles from other chemicals. Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants

  17. Different gene sets contribute to different symptom dimensions of depression and anxiety.

    PubMed

    van Veen, Tineke; Goeman, Jelle J; Monajemi, Ramin; Wardenaar, Klaas J; Hartman, Catharina A; Snieder, Harold; Nolte, Ilja M; Penninx, Brenda W J H; Zitman, Frans G

    2012-07-01

    Although many genetic association studies have been carried out, it remains unclear which genes contribute to depression. This may be due to heterogeneity of the DSM-IV category of depression. Specific symptom-dimensions provide a more homogenous phenotype. Furthermore, as effects of individual genes are small, analysis of genetic data at the pathway-level provides more power to detect associations and yield valuable biological insight. In 1,398 individuals with a Major Depressive Disorder, the symptom dimensions of the tripartite model of anxiety and depression, General Distress, Anhedonic Depression, and Anxious Arousal, were measured with the Mood and Anxiety Symptoms Questionnaire (30-item Dutch adaptation; MASQ-D30). Association of these symptom dimensions with candidate gene sets and gene sets from two public pathway databases was tested using the Global test. One pathway was associated with General Distress, and concerned molecules expressed in the endoplasmatic reticulum lumen. Seven pathways were associated with Anhedonic Depression. Important themes were neurodevelopment, neurodegeneration, and cytoskeleton. Furthermore, three gene sets associated with Anxious Arousal regarded development, morphology, and genetic recombination. The individual pathways explained up to 1.7% of the variance. These data demonstrate mechanisms that influence the specific dimensions. Moreover, they show the value of using dimensional phenotypes on one hand and gene sets on the other hand.

  18. A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments.

    PubMed

    Broët, Philippe; Lewin, Alex; Richardson, Sylvia; Dalmasso, Cyril; Magdelenat, Henri

    2004-11-01

    Multiclass response (MCR) experiments are those in which there are more than two classes to be compared. In these experiments, though the null hypothesis is simple, there are typically many patterns of gene expression changes across the different classes that led to complex alternatives. In this paper, we propose a new strategy for selecting genes in MCR that is based on a flexible mixture model for the marginal distribution of a modified F-statistic. Using this model, false positive and negative discovery rates can be estimated and combined to produce a rule for selecting a subset of genes. Moreover, the method proposed allows calculation of these rates for any predefined subset of genes. We illustrate the performance our approach using simulated datasets and a real breast cancer microarray dataset. In this latter study, we investigate predefined subset of genes and point out interesting differences between three distinct biological pathways. http://www.bgx.org.uk/software.html

  19. Protein interaction networks reveal novel autism risk genes within GWAS statistical noise.

    PubMed

    Correia, Catarina; Oliveira, Guiomar; Vicente, Astrid M

    2014-01-01

    Genome-wide association studies (GWAS) for Autism Spectrum Disorder (ASD) thus far met limited success in the identification of common risk variants, consistent with the notion that variants with small individual effects cannot be detected individually in single SNP analysis. To further capture disease risk gene information from ASD association studies, we applied a network-based strategy to the Autism Genome Project (AGP) and the Autism Genetics Resource Exchange GWAS datasets, combining family-based association data with Human Protein-Protein interaction (PPI) data. Our analysis showed that autism-associated proteins at higher than conventional levels of significance (P<0.1) directly interact more than random expectation and are involved in a limited number of interconnected biological processes, indicating that they are functionally related. The functionally coherent networks generated by this approach contain ASD-relevant disease biology, as demonstrated by an improved positive predictive value and sensitivity in retrieving known ASD candidate genes relative to the top associated genes from either GWAS, as well as a higher gene overlap between the two ASD datasets. Analysis of the intersection between the networks obtained from the two ASD GWAS and six unrelated disease datasets identified fourteen genes exclusively present in the ASD networks. These are mostly novel genes involved in abnormal nervous system phenotypes in animal models, and in fundamental biological processes previously implicated in ASD, such as axon guidance, cell adhesion or cytoskeleton organization. Overall, our results highlighted novel susceptibility genes previously hidden within GWAS statistical "noise" that warrant further analysis for causal variants.

  20. Protein Interaction Networks Reveal Novel Autism Risk Genes within GWAS Statistical Noise

    PubMed Central

    Correia, Catarina; Oliveira, Guiomar; Vicente, Astrid M.

    2014-01-01

    Genome-wide association studies (GWAS) for Autism Spectrum Disorder (ASD) thus far met limited success in the identification of common risk variants, consistent with the notion that variants with small individual effects cannot be detected individually in single SNP analysis. To further capture disease risk gene information from ASD association studies, we applied a network-based strategy to the Autism Genome Project (AGP) and the Autism Genetics Resource Exchange GWAS datasets, combining family-based association data with Human Protein-Protein interaction (PPI) data. Our analysis showed that autism-associated proteins at higher than conventional levels of significance (P<0.1) directly interact more than random expectation and are involved in a limited number of interconnected biological processes, indicating that they are functionally related. The functionally coherent networks generated by this approach contain ASD-relevant disease biology, as demonstrated by an improved positive predictive value and sensitivity in retrieving known ASD candidate genes relative to the top associated genes from either GWAS, as well as a higher gene overlap between the two ASD datasets. Analysis of the intersection between the networks obtained from the two ASD GWAS and six unrelated disease datasets identified fourteen genes exclusively present in the ASD networks. These are mostly novel genes involved in abnormal nervous system phenotypes in animal models, and in fundamental biological processes previously implicated in ASD, such as axon guidance, cell adhesion or cytoskeleton organization. Overall, our results highlighted novel susceptibility genes previously hidden within GWAS statistical “noise” that warrant further analysis for causal variants. PMID:25409314

  1. Deciphering causal and statistical relations of molecular aberrations and gene expressions in NCI-60 cell lines

    PubMed Central

    2011-01-01

    Background Cancer cells harbor a large number of molecular alterations such as mutations, amplifications and deletions on DNA sequences and epigenetic changes on DNA methylations. These aberrations may dysregulate gene expressions, which in turn drive the malignancy of tumors. Deciphering the causal and statistical relations of molecular aberrations and gene expressions is critical for understanding the molecular mechanisms of clinical phenotypes. Results In this work, we proposed a computational method to reconstruct association modules containing driver aberrations, passenger mRNA or microRNA expressions, and putative regulators that mediate the effects from drivers to passengers. By applying the module-finding algorithm to the integrated datasets of NCI-60 cancer cell lines, we found that gene expressions were driven by diverse molecular aberrations including chromosomal segments' copy number variations, gene mutations and DNA methylations, microRNA expressions, and the expressions of transcription factors. In-silico validation indicated that passenger genes were enriched with the regulator binding motifs, functional categories or pathways where the drivers were involved, and co-citations with the driver/regulator genes. Moreover, 6 of 11 predicted MYB targets were down-regulated in an MYB-siRNA treated leukemia cell line. In addition, microRNA expressions were driven by distinct mechanisms from mRNA expressions. Conclusions The results provide rich mechanistic information regarding molecular aberrations and gene expressions in cancer genomes. This kind of integrative analysis will become an important tool for the diagnosis and treatment of cancer in the era of personalized medicine. PMID:22051105

  2. Analysis of a genome-wide set of gene deletions in the fission yeast Schizosaccharomyces pombe.

    PubMed

    Kim, Dong-Uk; Hayles, Jacqueline; Kim, Dongsup; Wood, Valerie; Park, Han-Oh; Won, Misun; Yoo, Hyang-Sook; Duhig, Trevor; Nam, Miyoung; Palmer, Georgia; Han, Sangjo; Jeffery, Linda; Baek, Seung-Tae; Lee, Hyemi; Shim, Young Sam; Lee, Minho; Kim, Lila; Heo, Kyung-Sun; Noh, Eun Joo; Lee, Ah-Reum; Jang, Young-Joo; Chung, Kyung-Sook; Choi, Shin-Jung; Park, Jo-Young; Park, Youngwoo; Kim, Hwan Mook; Park, Song-Kyu; Park, Hae-Joon; Kang, Eun-Jung; Kim, Hyong Bai; Kang, Hyun-Sam; Park, Hee-Moon; Kim, Kyunghoon; Song, Kiwon; Song, Kyung Bin; Nurse, Paul; Hoe, Kwang-Lae

    2010-06-01

    We report the construction and analysis of 4,836 heterozygous diploid deletion mutants covering 98.4% of the fission yeast genome providing a tool for studying eukaryotic biology. Comprehensive gene dispensability comparisons with budding yeast--the only other eukaryote for which a comprehensive knockout library exists--revealed that 83% of single-copy orthologs in the two yeasts had conserved dispensability. Gene dispensability differed for certain pathways between the two yeasts, including mitochondrial translation and cell cycle checkpoint control. We show that fission yeast has more essential genes than budding yeast and that essential genes are more likely than nonessential genes to be present in a single copy, to be broadly conserved and to contain introns. Growth fitness analyses determined sets of haploinsufficient and haploproficient genes for fission yeast, and comparisons with budding yeast identified specific ribosomal proteins and RNA polymerase subunits, which may act more generally to regulate eukaryotic cell growth.

  3. A PLSPM-based test statistic for detecting gene-gene co-association in genome-wide association study with case-control design.

    PubMed

    Zhang, Xiaoshuai; Yang, Xiaowei; Yuan, Zhongshang; Liu, Yanxun; Li, Fangyu; Peng, Bin; Zhu, Dianwen; Zhao, Jinghua; Xue, Fuzhong

    2013-01-01

    For genome-wide association data analysis, two genes in any pathway, two SNPs in the two linked gene regions respectively or in the two linked exons respectively within one gene are often correlated with each other. We therefore proposed the concept of gene-gene co-association, which refers to the effects not only due to the traditional interaction under nearly independent condition but the correlation between two genes. Furthermore, we constructed a novel statistic for detecting gene-gene co-association based on Partial Least Squares Path Modeling (PLSPM). Through simulation, the relationship between traditional interaction and co-association was highlighted under three different types of co-association. Both simulation and real data analysis demonstrated that the proposed PLSPM-based statistic has better performance than single SNP-based logistic model, PCA-based logistic model, and other gene-based methods.

  4. Global adaptive rank truncated product method for gene-set analysis in association studies.

    PubMed

    Vilor-Tejedor, Natalia; Calle, M Luz

    2014-09-01

    Gene set analysis (GSA) aims to assess the overall association of a set of genetic variants with a phenotype and has the potential to detect subtle effects of variants in a gene or a pathway that might be missed when assessed individually. We present a new implementation of the Adaptive Rank Truncated Product method (ARTP) for analyzing the association of a set of Single Nucleotide Polymorphisms (SNPs) in a gene or pathway. The new implementation, referred to as globalARTP, improves the original one by allowing the different SNPs in the set to have different modes of inheritance. We perform a simulation study for exploring the power of the proposed methodology in a set of scenarios with different numbers of causal SNPs with different effect sizes. Moreover, we show the advantage of using the gene set approach in the context of an Alzheimer's disease case-control study where we explore the endocytosis pathway. The new method is implemented in the R function globalARTP of the globalGSA package available at http://cran.r-project.org. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  5. Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

    PubMed

    Datta, Susmita; Datta, Somnath

    2006-08-31

    A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species. In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically

  6. The Core Mouse Response to Infection by Neospora Caninum Defined by Gene Set Enrichment Analyses

    PubMed Central

    Ellis, John; Goodswen, Stephen; Kennedy, Paul J; Bush, Stephen

    2012-01-01

    In this study, the BALB/c and Qs mouse responses to infection by the parasite Neospora caninum were investigated in order to identify host response mechanisms. Investigation was done using gene set (enrichment) analyses of microarray data. GSEA, MANOVA, Romer, subGSE and SAM-GS were used to study the contrasts Neospora strain type, Mouse type (BALB/c and Qs) and time post infection (6 hours post infection and 10 days post infection). The analyses show that the major signal in the core mouse response to infection is from time post infection and can be defined by gene ontology terms Protein Kinase Activity, Cell Proliferation and Transcription Initiation. Several terms linked to signaling, morphogenesis, response and fat metabolism were also identified. At 10 days post infection, genes associated with fatty acid metabolism were identified as up regulated in expression. The value of gene set (enrichment) analyses in the analysis of microarray data is discussed. PMID:23012496

  7. Intervene: a tool for intersection and visualization of multiple gene or genomic region sets.

    PubMed

    Khan, Aziz; Mathelier, Anthony

    2017-05-31

    A common task for scientists relies on comparing lists of genes or genomic regions derived from high-throughput sequencing experiments. While several tools exist to intersect and visualize sets of genes, similar tools dedicated to the visualization of genomic region sets are currently limited. To address this gap, we have developed the Intervene tool, which provides an easy and automated interface for the effective intersection and visualization of genomic region or list sets, thus facilitating their analysis and interpretation. Intervene contains three modules: venn to generate Venn diagrams of up to six sets, upset to generate UpSet plots of multiple sets, and pairwise to compute and visualize intersections of multiple sets as clustered heat maps. Intervene, and its interactive web ShinyApp companion, generate publication-quality figures for the interpretation of genomic region and list sets. Intervene and its web application companion provide an easy command line and an interactive web interface to compute intersections of multiple genomic and list sets. They have the capacity to plot intersections using easy-to-interpret visual approaches. Intervene is developed and designed to meet the needs of both computer scientists and biologists. The source code is freely available at https://bitbucket.org/CBGR/intervene , with the web application available at https://asntech.shinyapps.io/intervene .

  8. Investigating the effect of paralogs on microarray gene-set analysis

    PubMed Central

    2011-01-01

    Background In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge to group genes into sets and then aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. We suspect that the presence of paralogs affects the ability of GSA methods to accurately identify the most important sets of genes for subsequent research. Results We show that paralogs, which typically have high sequence identity and similar molecular functions, also exhibit high correlation in their expression patterns. We investigate this correlation as a potential confounding factor common to current GSA methods using Indygene http://www.cbio.uct.ac.za/indygene, a web tool that reduces a supplied list of genes so that it includes no pairwise paralogy relationships above a specified sequence similarity threshold. We use the tool to reanalyse previously published microarray datasets and determine the potential utility of accounting for the presence of paralogs. Conclusions The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses. This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies. PMID:21261946

  9. rapidGSEA: Speeding up gene set enrichment analysis on multi-core CPUs and CUDA-enabled GPUs.

    PubMed

    Hundt, Christian; Hildebrandt, Andreas; Schmidt, Bertil

    2016-09-23

    Gene Set Enrichment Analysis (GSEA) is a popular method to reveal significant dependencies between predefined sets of gene symbols and observed phenotypes by evaluating the deviation of gene expression values between cases and controls. An established measure of inter-class deviation, the enrichment score, is usually computed using a weighted running sum statistic over the whole set of gene symbols. Due to the lack of analytic expressions the significance of enrichment scores is determined using a non-parametric estimation of their null distribution by permuting the phenotype labels of the probed patients. Accordingly, GSEA is a time-consuming task due to the large number of required permutations to accurately estimate the nominal p-value - a circumstance that is even more pronounced during multiple hypothesis testing since its estimate is lower-bounded by the inverse number of samples in permutation space. We present rapidGSEA - a software suite consisting of two tools for facilitating permutation-based GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDA-accelerated tool using fine-grained parallelization schemes on massively parallel architectures while ompGSEA is a coarse-grained multi-threaded tool for multi-core CPUs. Nominal p-value estimation of 4,725 gene sets on a data set consisting of 20,639 unique gene symbols and 200 patients (183 cases + 17 controls) each probing one million permutations takes 19 hours on a Xeon CPU and less than one hour on a GeForce Titan X GPU while the established GSEA tool from the Broad Institute (broadGSEA) takes roughly 13 days. cudaGSEA outperforms broadGSEA by around two orders-of-magnitude on a single Tesla K40c or GeForce Titan X GPU. ompGSEA provides around one order-of-magnitude speedup to broadGSEA on a standard Xeon CPU. The rapidGSEA suite is open-source software and can be downloaded at https://github.com/gravitino/cudaGSEA as standalone application or package for the R framework.

  10. Microarray data and gene expression statistics for Saccharomyces cerevisiae exposed to simulated asbestos mine drainage.

    PubMed

    Driscoll, Heather E; Murray, Janet M; English, Erika L; Hunter, Timothy C; Pivarski, Kara; Dolci, Elizabeth D

    2017-08-01

    Here we describe microarray expression data (raw and normalized), experimental metadata, and gene-level data with expression statistics from Saccharomyces cerevisiae exposed to simulated asbestos mine drainage from the Vermont Asbestos Group (VAG) Mine on Belvidere Mountain in northern Vermont, USA. For nearly 100 years (between the late 1890s and 1993), chrysotile asbestos fibers were extracted from serpentinized ultramafic rock at the VAG Mine for use in construction and manufacturing industries. Studies have shown that water courses and streambeds nearby have become contaminated with asbestos mine tailings runoff, including elevated levels of magnesium, nickel, chromium, and arsenic, elevated pH, and chrysotile asbestos-laden mine tailings, due to leaching and gradual erosion of massive piles of mine waste covering approximately 9 km(2). We exposed yeast to simulated VAG Mine tailings leachate to help gain insight on how eukaryotic cells exposed to VAG Mine drainage may respond in the mine environment. Affymetrix GeneChip® Yeast Genome 2.0 Arrays were utilized to assess gene expression after 24-h exposure to simulated VAG Mine tailings runoff. The chemistry of mine-tailings leachate, mine-tailings leachate plus yeast extract peptone dextrose media, and control yeast extract peptone dextrose media is also reported. To our knowledge this is the first dataset to assess global gene expression patterns in a eukaryotic model system simulating asbestos mine tailings runoff exposure. Raw and normalized gene expression data are accessible through the National Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO) Database Series GSE89875 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89875).

  11. A predictive signature gene set for discriminating active from latent tuberculosis in Warao Amerindian children

    PubMed Central

    2013-01-01

    Background Tuberculosis (TB) continues to cause a high toll of disease and death among children worldwide. The diagnosis of childhood TB is challenged by the paucibacillary nature of the disease and the difficulties in obtaining specimens. Whereas scientific and clinical research efforts to develop novel diagnostic tools have focused on TB in adults, childhood TB has been relatively neglected. Blood transcriptional profiling has improved our understanding of disease pathogenesis of adult TB and may offer future leads for diagnosis and treatment. No studies applying gene expression profiling of children with TB have been published so far. Results We identified a 116-gene signature set that showed an average prediction error of 11% for TB vs. latent TB infection (LTBI) and for TB vs. LTBI vs. healthy controls (HC) in our dataset. A minimal gene set of only 9 genes showed the same prediction error of 11% for TB vs. LTBI in our dataset. Furthermore, this minimal set showed a significant discriminatory value for TB vs. LTBI for all previously published adult studies using whole blood gene expression, with average prediction errors between 17% and 23%. In order to identify a robust representative gene set that would perform well in populations of different genetic backgrounds, we selected ten genes that were highly discriminative between TB, LTBI and HC in all literature datasets as well as in our dataset. Functional annotation of these genes highlights a possible role for genes involved in calcium signaling and calcium metabolism as biomarkers for active TB. These ten genes were validated by quantitative real-time polymerase chain reaction in an additional cohort of 54 Warao Amerindian children with LTBI, HC and non-TB pneumonia. Decision tree analysis indicated that five of the ten genes were sufficient to classify 78% of the TB cases correctly with no LTBI subjects wrongly classified as TB (100% specificity). Conclusions Our data justify the further exploration of our

  12. Soft truncation thresholding for gene set analysis of RNA-seq data: Application to a vaccine study

    PubMed Central

    Fridley, Brooke L.; Jenkins, Gregory D.; Grill, Diane E.; Kennedy, Richard B.; Poland, Gregory A.; Oberg, Ann L.

    2013-01-01

    Gene set analysis (GSA) has been used for analysis of microarray data to aid the interpretation and to increase statistical power. With the advent of next-generation sequencing, the use of GSA is even more relevant, as studies are often conducted on a small number of samples. We propose the use of soft truncation thresholding and the Gamma Method (GM) to determine significant gene set (GS), where a generalized linear model is used to assess per-gene significance. The approach was compared to other methods using an extensive simulation study and RNA-seq data from smallpox vaccine study. The GM was found to outperform other proposed methods. Application of the GM to the smallpox vaccine study found the GSs to be moderately associated with response, including focal adhesion (p = 0.04) and extracellular matrix receptor interaction (p = 0.05). The application of GSA to RNA-seq data will provide new insights into the genomic basis of complex traits. PMID:24104466

  13. Statistical identification of gene association by CID in application of constructing ER regulatory network.

    PubMed

    Liu, Li-Yu D; Chen, Chien-Yu; Chen, Mei-Ju M; Tsai, Ming-Shian; Lee, Cho-Han S; Phang, Tzu L; Chang, Li-Yun; Kuo, Wen-Hung; Hwa, Hsiao-Lin; Lien, Huang-Chun; Jung, Shih-Ming; Lin, Yi-Shing; Chang, King-Jen; Hsieh, Fon-Jou

    2009-03-17

    A variety of high-throughput techniques are now available for constructing comprehensive gene regulatory networks in systems biology. In this study, we report a new statistical approach for facilitating in silico inference of regulatory network structure. The new measure of association, coefficient of intrinsic dependence (CID), is model-free and can be applied to both continuous and categorical distributions. When given two variables X and Y, CID answers whether Y is dependent on X by examining the conditional distribution of Y given X. In this paper, we apply CID to analyze the regulatory relationships between transcription factors (TFs) (X) and their downstream genes (Y) based on clinical data. More specifically, we use estrogen receptor alpha (ERalpha) as the variable X, and the analyses are based on 48 clinical breast cancer gene expression arrays (48A). The analytical utility of CID was evaluated in comparison with four commonly used statistical methods, Galton-Pearson's correlation coefficient (GPCC), Student's t-test (STT), coefficient of determination (CoD), and mutual information (MI). When being compared to GPCC, CoD, and MI, CID reveals its preferential ability to discover the regulatory association where distribution of the mRNA expression levels on X and Y does not fit linear models. On the other hand, when CID is used to measure the association of a continuous variable (Y) against a discrete variable (X), it shows similar performance as compared to STT, and appears to outperform CoD and MI. In addition, this study established a two-layer transcriptional regulatory network to exemplify the usage of CID, in combination with GPCC, in deciphering gene networks based on gene expression profiles from patient arrays. CID is shown to provide useful information for identifying associations between genes and transcription factors of interest in patient arrays. When coupled with the relationships detected by GPCC, the association predicted by CID are applicable

  14. Statistical identification of gene association by CID in application of constructing ER regulatory network

    PubMed Central

    Liu, Li-Yu D; Chen, Chien-Yu; Chen, Mei-Ju M; Tsai, Ming-Shian; Lee, Cho-Han S; Phang, Tzu L; Chang, Li-Yun; Kuo, Wen-Hung; Hwa, Hsiao-Lin; Lien, Huang-Chun; Jung, Shih-Ming; Lin, Yi-Shing; Chang, King-Jen; Hsieh, Fon-Jou

    2009-01-01

    Background A variety of high-throughput techniques are now available for constructing comprehensive gene regulatory networks in systems biology. In this study, we report a new statistical approach for facilitating in silico inference of regulatory network structure. The new measure of association, coefficient of intrinsic dependence (CID), is model-free and can be applied to both continuous and categorical distributions. When given two variables X and Y, CID answers whether Y is dependent on X by examining the conditional distribution of Y given X. In this paper, we apply CID to analyze the regulatory relationships between transcription factors (TFs) (X) and their downstream genes (Y) based on clinical data. More specifically, we use estrogen receptor α (ERα) as the variable X, and the analyses are based on 48 clinical breast cancer gene expression arrays (48A). Results The analytical utility of CID was evaluated in comparison with four commonly used statistical methods, Galton-Pearson's correlation coefficient (GPCC), Student's t-test (STT), coefficient of determination (CoD), and mutual information (MI). When being compared to GPCC, CoD, and MI, CID reveals its preferential ability to discover the regulatory association where distribution of the mRNA expression levels on X and Y does not fit linear models. On the other hand, when CID is used to measure the association of a continuous variable (Y) against a discrete variable (X), it shows similar performance as compared to STT, and appears to outperform CoD and MI. In addition, this study established a two-layer transcriptional regulatory network to exemplify the usage of CID, in combination with GPCC, in deciphering gene networks based on gene expression profiles from patient arrays. Conclusion CID is shown to provide useful information for identifying associations between genes and transcription factors of interest in patient arrays. When coupled with the relationships detected by GPCC, the association

  15. Yeast genome-wide screen reveals dissimilar sets of host genes affecting replication of RNA viruses

    PubMed Central

    Panavas, Tadas; Serviene, Elena; Brasher, Jeremy; Nagy, Peter D.

    2005-01-01

    Viruses are devastating pathogens of humans, animals, and plants. To further our understanding of how viruses use the resources of infected cells, we systematically tested the yeast single-gene-knockout library for the effect of each host gene on the replication of tomato bushy stunt virus (TBSV), a positive-strand RNA virus of plants. The genome-wide screen identified 96 host genes whose absence either reduced or increased the accumulation of the TBSV replicon. The identified genes are involved in the metabolism of nucleic acids, lipids, proteins, and other compounds and in protein targeting/transport. Comparison with published genome-wide screens reveals that the replication of TBSV and brome mosaic virus (BMV), which belongs to a different supergroup among plus-strand RNA viruses, is affected by vastly different yeast genes. Moreover, a set of yeast genes involved in vacuolar targeting of proteins and vesicle-mediated transport both affected replication of the TBSV replicon and enhanced the cytotoxicity of the Parkinson's disease-related α-synuclein when this protein was expressed in yeast. In addition, a set of host genes involved in ubiquitin-dependent protein catabolism affected both TBSV replication and the cytotoxicity of a mutant huntingtin protein, a candidate agent in Huntington's disease. This finding suggests that virus infection and disease-causing proteins might use or alter similar host pathways and may suggest connections between chronic diseases and prior virus infection. PMID:15883361

  16. Gene regulatory network inference using fused LASSO on multiple data sets.

    PubMed

    Omranian, Nooshin; Eloundou-Mbebi, Jeanne M O; Mueller-Roeber, Bernd; Nikoloski, Zoran

    2016-02-11

    Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions.

  17. Primer Sets Developed for Functional Genes Reveal Shifts in Functionality of Fungal Community in Soils.

    PubMed

    Hannula, S Emilia; van Veen, Johannes A

    2016-01-01

    Phylogenetic diversity of soil microbes is a hot topic at the moment. However, the molecular tools for the assessment of functional diversity in the fungal community are less developed than tools based on genes encoding the ribosomal operon. Here 20 sets of primers targeting genes involved mainly in carbon cycling were designed and/or validated and the functioning of soil fungal communities along a chronosequence of land abandonment from agriculture was evaluated using them. We hypothesized that changes in fungal community structure during secondary succession would lead to difference in the types of genes present in soils and that these changes would be directional. We expected an increase in genes involved in degradation of recalcitrant organic matter in time since agriculture. Out of the investigated genes, the richness of the genes related to carbon cycling was significantly higher in fields abandoned for longer time. The composition of six of the genes analyzed revealed significant differences between fields abandoned for shorter and longer time. However, all genes revealed significant variance over the fields studied, and this could be related to other parameters than the time since agriculture such as pH, organic matter, and the amount of available nitrogen. Contrary to our initial hypothesis, the genes significantly different between fields were not related to the decomposition of more recalcitrant matter but rather involved in degradation of cellulose and hemicellulose.

  18. Primer Sets Developed for Functional Genes Reveal Shifts in Functionality of Fungal Community in Soils

    PubMed Central

    Hannula, S. Emilia; van Veen, Johannes A.

    2016-01-01

    Phylogenetic diversity of soil microbes is a hot topic at the moment. However, the molecular tools for the assessment of functional diversity in the fungal community are less developed than tools based on genes encoding the ribosomal operon. Here 20 sets of primers targeting genes involved mainly in carbon cycling were designed and/or validated and the functioning of soil fungal communities along a chronosequence of land abandonment from agriculture was evaluated using them. We hypothesized that changes in fungal community structure during secondary succession would lead to difference in the types of genes present in soils and that these changes would be directional. We expected an increase in genes involved in degradation of recalcitrant organic matter in time since agriculture. Out of the investigated genes, the richness of the genes related to carbon cycling was significantly higher in fields abandoned for longer time. The composition of six of the genes analyzed revealed significant differences between fields abandoned for shorter and longer time. However, all genes revealed significant variance over the fields studied, and this could be related to other parameters than the time since agriculture such as pH, organic matter, and the amount of available nitrogen. Contrary to our initial hypothesis, the genes significantly different between fields were not related to the decomposition of more recalcitrant matter but rather involved in degradation of cellulose and hemicellulose. PMID:27965632

  19. Meta gene set enrichment analyses link miR-137-regulated pathways with schizophrenia risk

    PubMed Central

    Wright, Carrie; Calhoun, Vince D.; Ehrlich, Stefan; Wang, Lei; Turner, Jessica A.; Bizzozero, Nora I. Perrone-

    2015-01-01

    Background: A single nucleotide polymorphism (SNP) within MIR137, the host gene for miR-137, has been identified repeatedly as a risk factor for schizophrenia. Previous genetic pathway analyses suggest that potential targets of this microRNA (miRNA) are also highly enriched in schizophrenia-relevant biological pathways, including those involved in nervous system development and function. Methods: In this study, we evaluated the schizophrenia risk of miR-137 target genes within these pathways. Gene set enrichment analysis of pathway-specific miR-137 targets was performed using the stage 1 (21,856 subjects) schizophrenia genome wide association study data from the Psychiatric Genomics Consortium and a small independent replication cohort (244 subjects) from the Mind Clinical Imaging Consortium and Northwestern University. Results: Gene sets of potential miR-137 targets were enriched with variants associated with schizophrenia risk, including target sets involved in axonal guidance signaling, Ephrin receptor signaling, long-term potentiation, PKA signaling, and Sertoli cell junction signaling. The schizophrenia-risk association of SNPs in PKA signaling targets was replicated in the second independent cohort. Conclusions: These results suggest that these biological pathways may be involved in the mechanisms by which this MIR137 variant enhances schizophrenia risk. SNPs in targets and the miRNA host gene may collectively lead to dysregulation of target expression and aberrant functioning of such implicated pathways. Pathway-guided gene set enrichment analyses should be useful in evaluating the impact of other miRNAs and target genes in different diseases. PMID:25941532

  20. The Use of Multi-Component Statistical Techniques in Understanding Subduction Zone Arc Granitic Geochemical Data Sets

    NASA Astrophysics Data System (ADS)

    Pompe, L.; Clausen, B. L.; Morton, D. M.

    2015-12-01

    Multi-component statistical techniques and GIS visualization are emerging trends in understanding large data sets. Our research applies these techniques to a large igneous geochemical data set from southern California to better understand magmatic and plate tectonic processes. A set of 480 granitic samples collected by Baird from this area were analyzed for 39 geochemical elements. Of these samples, 287 are from the Peninsular Ranges Batholith (PRB) and 164 from part of the Transverse Ranges (TR). Principal component analysis (PCA) summarized the 39 variables into 3 principal components (PC) by matrix multiplication and for the PRB are interpreted as follows: PC1 with about 30% of the variation included mainly compatible elements and SiO2 and indicates extent of differentation; PC2 with about 20% of the variation included HFS elements and may indicate crustal contamination as usually identified by Sri; PC3 with about 20% of the variation included mainly HRE elements and may indicate magma source depth as often diplayed using REE spider diagrams and possibly Sr/Y. Several elements did not fit well in any of the three components: Cr, Ni, U, and Na2O.For the PRB, the PC1 correlation with SiO2 was r=-0.85, the PC2 correlation with Sri was r=0.80, and the PC3 correlation with Gd/Yb was r=-0.76 and with Sr/Y was r=-0.66 . Extending this method to the TR, correlations were r=-0.85, -0.21, -0.06, and -0.64, respectively. A similar extent of correlation for both areas was visually evident using GIS interpolation.PC1 seems to do well at indicating differentiation index for both the PRB and TR and correlates very well with SiO2, Al2O3, MgO, FeO*, CaO, K2O, Sc, V, and Co, but poorly with Na2O and Cr. If the crustal component is represented by Sri, PC2 correlates well and less expesively with this indicator in the PRB, but not in the TR. Source depth has been related to the slope on REE spidergrams, and PC3 based on only the HREE and using the Sr/Y ratios gives a reasonable

  1. Application of a statistical software package for analysis of large patient dose data sets obtained from RIS.

    PubMed

    Fazakerley, J; Charnock, P; Wilde, R; Jones, R; Ward, M

    2010-01-01

    For the purpose of patient dose audit, clinical audit and radiology workload analysis, data from Radiology Information Systems (RIS) at many hospitals are collected using a database and the analysis was automated using a statistical package and Visual Basic coding. The database is a Structured Query Language database, which can be queried using an off-the-shelf statistical package, Statistica. Macros were created to automatically format the data to a consistent format between different hospitals ready for analysis. These macros can also be used to automate further analysis such as detailing mean kV, mAs and entrance surface dose per room and per gender. Standard deviation and standard error of the mean are also generated. Graphs can also be generated to illustrate the trends in doses between different variables such as room and gender. Collectively, this information can be used to generate a report. A process that once could take up to 1 d to complete now takes around 1 h. A major benefit in providing the service to hospital trusts is that less resource is now required to report on RIS data, making the possibility of continuous dose audit more likely. Time that was spent on sorting through data can now be spent on improving the analysis to provide benefit to the customer. Using data sets from RIS is a good way to perform dose audits as the huge numbers of data available provide the bases for very accurate analysis. Using macros written in Statistica Visual Basic has helped sort and consistently analyse these data. Being able to analyse by exposure factors has provided a more detailed report to the customer.

  2. Infrequently transcribed long genes depend on the Set2/Rpd3S pathway for accurate transcription

    PubMed Central

    Li, Bing; Gogol, Madelaine; Carey, Mike; Pattenden, Samantha G.; Seidel, Chris; Workman, Jerry L.

    2007-01-01

    The presence of Set2-mediated methylation of H3K36 (K36me) correlates with transcription frequency throughout the yeast genome. K36me targets the Rpd3S complex to deacetylate transcribed regions and suppress cryptic transcription initiation at certain genes. Here, using a genome-wide approach, we report that the Set2–Rpd3S pathway is generally required for controlling acetylation at coding regions. When using acetylation as a functional readout for this pathway, we discovered that longer genes and, surprisingly, genes transcribed at lower frequency exhibit a stronger dependency. Moreover, a systematic screen using high-resolution tiling microarrays allowed us to identify a group of genes that rely on Set2–Rpd3S to suppress spurious transcripts. Interestingly, most of these genes are within the group that depend on the same pathway to maintain a hypoacetylated state at coding regions. These data highlight the importance of using the functional readout of histone codes to define the roles of specific pathways. PMID:17545470

  3. A small set of extra-embryonic genes defines a new landmark for bovine embryo staging.

    PubMed

    Degrelle, Séverine A; Lê Cao, Kim-Anh; Heyman, Yvan; Everts, Robin E; Campion, Evelyne; Richard, Christophe; Ducroix-Crépy, Céline; Tian, X Cindy; Lewin, Harris A; Renard, Jean-Paul; Robert-Granié, Christèle; Hue, Isabelle

    2011-01-01

    Axis specification in mouse is determined by a sequence of reciprocal interactions between embryonic and extra-embryonic tissues so that a few extra-embryonic genes appear as 'patterning' the embryo. Considering these interactions as essential, but lacking in most mammals the genetically driven approaches used in mouse and the corresponding patterning mutants, we examined whether a molecular signature originating from extra-embryonic tissues could relate to the developmental stage of the embryo proper and predict it. To this end, we have profiled bovine extra-embryonic tissues at peri-implantation stages, when gastrulation and early neurulation occur, and analysed the subsequent expression profiles through the use of predictive methods as previously reported for tumour classification. A set of six genes (CALM1, CPA3, CITED1, DLD, HNRNPDL, and TGFB3), half of which had not been previously associated with any extra-embryonic feature, appeared significantly discriminative and mainly dependent on embryonic tissues for its faithful expression. The predictive value of this set of genes for gastrulation and early neurulation stages, as assessed on naive samples, was remarkably high (93%). In silico connected to the bovine orthologues of the mouse patterning genes, this gene set is proposed as a new trait for embryo staging. As such, this will allow saving the bovine embryo proper for molecular or cellular studies. To us, it offers as well new perspectives for developmental phenotyping and modelling of embryonic/extra-embryonic co-differentiation.

  4. Transcriptomic sequencing reveals a set of unique genes activated by butyrate-induced histone modification

    USDA-ARS?s Scientific Manuscript database

    Butyrate is a nutritional element with strong epigenetic regulatory activity as an inhibitor of histone deacetylases (HDACs). Based on the analysis of differentially expressed genes induced by butyrate in the bovine epithelial cell using deep RNA-sequencing technology (RNA-seq), a set of unique gen...

  5. Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms.

    PubMed

    Guo, Yu; Graber, Armin; McBurney, Robert N; Balasubramanian, Raji

    2010-09-03

    data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques. the analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper. no single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.

  6. Expansion and diversification of the SET domain gene family following whole-genome duplications in Populus trichocarpa

    PubMed Central

    2012-01-01

    Background Histone lysine methylation modifies chromatin structure and regulates eukaryotic gene transcription and a variety of developmental and physiological processes. SET domain proteins are lysine methyltransferases containing the evolutionarily-conserved SET domain, which is known to be the catalytic domain. Results We identified 59 SET genes in the Populus genome. Phylogenetic analyses of 106 SET genes from Populus and Arabidopsis supported the clustering of SET genes into six distinct subfamilies and identified 19 duplicated gene pairs in Populus. The chromosome locations of these gene pairs and the distribution of synonymous substitution rates showed that the expansion of the SET gene family might be caused by large-scale duplications in Populus. Comparison of gene structures and domain architectures of each duplicate pair indicated that divergence took place at the 3'- and 5'-terminal transcribed regions and at the N- and C-termini of the predicted proteins, respectively. Expression profile analysis of Populus SET genes suggested that most Populus SET genes were expressed widely, many with the highest expression in young leaves. In particular, the expression profiles of 12 of the 19 duplicated gene pairs fell into two types of expression patterns. Conclusions The 19 duplicated SET genes could have originated from whole genome duplication events. The differences in SET gene structure, domain architecture, and expression profiles in various tissues of Populus suggest that members of the SET gene family have a variety of developmental and physiological functions. Our study provides clues about the evolution of epigenetic regulation of chromatin structure and gene expression. PMID:22497662

  7. Learning contextual gene set interaction networks of cancer with condition specificity

    PubMed Central

    2013-01-01

    Background Identifying similarities and differences in the molecular constitutions of various types of cancer is one of the key challenges in cancer research. The appearances of a cancer depend on complex molecular interactions, including gene regulatory networks and gene-environment interactions. This complexity makes it challenging to decipher the molecular origin of the cancer. In recent years, many studies reported methods to uncover heterogeneous depictions of complex cancers, which are often categorized into different subtypes. The challenge is to identify diverse molecular contexts within a cancer, to relate them to different subtypes, and to learn underlying molecular interactions specific to molecular contexts so that we can recommend context-specific treatment to patients. Results In this study, we describe a novel method to discern molecular interactions specific to certain molecular contexts. Unlike conventional approaches to build modular networks of individual genes, our focus is to identify cancer-generic and subtype-specific interactions between contextual gene sets, of which each gene set share coherent transcriptional patterns across a subset of samples, termed contextual gene set. We then apply a novel formulation for quantitating the effect of the samples from each subtype on the calculated strength of interactions observed. Two cancer data sets were analyzed to support the validity of condition-specificity of identified interactions. When compared to an existing approach, the proposed method was much more sensitive in identifying condition-specific interactions even in heterogeneous data set. The results also revealed that network components specific to different types of cancer are related to different biological functions than cancer-generic network components. We found not only the results that are consistent with previous studies, but also new hypotheses on the biological mechanisms specific to certain cancer types that warrant further

  8. Repeated observation of immune gene sets enrichment in women with non-small cell lung cancer

    PubMed Central

    Araujo, Jhajaira M.; Prado, Alexandra; Cardenas, Nadezhda K.; Zaharia, Mayer; Dyer, Richard; Doimi, Franco; Bravo, Leny; Pinillos, Luis; Morante, Zaida; Aguilar, Alfredo; Mas, Luis A.; Gomez, Henry L.; Vallejos, Carlos S.; Rolfo, Christian; Pinto, Joseph A.

    2016-01-01

    There are different biological and clinical patterns of lung cancer between genders indicating intrinsic differences leading to increased sensitivity to cigarette smoke-induced DNA damage, mutational patterns of KRAS and better clinical outcomes in women while differences between genders at gene-expression levels was not previously reported. Here we show an enrichment of immune genes in NSCLC in women compared to men. We found in a GSEA analysis (by biological processes annotated from Gene Ontology) of six public datasets a repeated observation of immune gene sets enrichment in women. “Immune system process”, “immune response”, “defense response”, “cellular defense response” and “regulation of immune system process” were the gene sets most over-represented while APOBEC3G, APOBEC3F, LAT, CD1D and CCL5 represented the top-five core genes. Characterization of immune cell composition with the platform CIBERSORT showed no differences between genders; however, there were differences when tumor tissues were compared to normal tissues. Our results suggest different immune responses in NSCLC between genders that could be related with the different clinical outcome. PMID:26958810

  9. Comprehensive set of integrative plasmid vectors for copper-inducible gene expression in Myxococcus xanthus.

    PubMed

    Gómez-Santos, Nuria; Treuner-Lange, Anke; Moraleda-Muñoz, Aurelio; García-Bravo, Elena; García-Hernández, Raquel; Martínez-Cayuela, Marina; Pérez, Juana; Søgaard-Andersen, Lotte; Muñoz-Dorado, José

    2012-04-01

    Myxococcus xanthus is widely used as a model system for studying gliding motility, multicellular development, and cellular differentiation. Moreover, M. xanthus is a rich source of novel secondary metabolites. The analysis of these processes has been hampered by the limited set of tools for inducible gene expression. Here we report the construction of a set of plasmid vectors to allow copper-inducible gene expression in M. xanthus. Analysis of the effect of copper on strain DK1622 revealed that copper concentrations of up to 500 μM during growth and 60 μM during development do not affect physiological processes such as cell viability, motility, or aggregation into fruiting bodies. Of the copper-responsive promoters in M. xanthus reported so far, the multicopper oxidase cuoA promoter was used to construct expression vectors, because no basal expression is observed in the absence of copper and induction linearly depends on the copper concentration in the culture medium. Four different plasmid vectors have been constructed, with different marker selection genes and sites of integration in the M. xanthus chromosome. The vectors have been tested and gene expression quantified using the lacZ gene. Moreover, we demonstrate the functional complementation of the motility defect caused by lack of PilB by the copper-induced expression of the pilB gene. These versatile vectors are likely to deepen our understanding of the biology of M. xanthus and may also have biotechnological applications.

  10. Gowinda: unbiased analysis of gene set enrichment for genome-wide association studies.

    PubMed

    Kofler, Robert; Schlötterer, Christian

    2012-08-01

    An analysis of gene set [e.g. Gene Ontology (GO)] enrichment assumes that all genes are sampled independently from each other with the same probability. These assumptions are violated in genome-wide association (GWA) studies since (i) longer genes typically have more single-nucleotide polymorphisms resulting in a higher probability of being sampled and (ii) overlapping genes are sampled in clusters. Herein, we introduce Gowinda, a software specifically designed to test for enrichment of gene sets in GWA studies. We show that GO tests on GWA data could result in a substantial number of false-positive GO terms. Permutation tests implemented in Gowinda eliminate these biases, but maintain sufficient power to detect enrichment of GO terms. Since sufficient resolution for large datasets requires millions of permutations, we use multi-threading to keep computation times reasonable. Gowinda is implemented in Java (v1.6) and freely available on http://code.google.com/p/gowinda/ christian.schloetterer@vetmeduni.ac.at Manual: http://code.google.com/p/gowinda/wiki/Manual. Test data and tutorial: http://code.google.com/p/gowinda/wiki/Tutorial. http://code.google.com/p/gowinda/wiki/VALIDATION.

  11. Repeated observation of immune gene sets enrichment in women with non-small cell lung cancer.

    PubMed

    Araujo, Jhajaira M; Prado, Alexandra; Cardenas, Nadezhda K; Zaharia, Mayer; Dyer, Richard; Doimi, Franco; Bravo, Leny; Pinillos, Luis; Morante, Zaida; Aguilar, Alfredo; Mas, Luis A; Gomez, Henry L; Vallejos, Carlos S; Rolfo, Christian; Pinto, Joseph A

    2016-04-12

    There are different biological and clinical patterns of lung cancer between genders indicating intrinsic differences leading to increased sensitivity to cigarette smoke-induced DNA damage, mutational patterns of KRAS and better clinical outcomes in women while differences between genders at gene-expression levels was not previously reported. Here we show an enrichment of immune genes in NSCLC in women compared to men. We found in a GSEA analysis (by biological processes annotated from Gene Ontology) of six public datasets a repeated observation of immune gene sets enrichment in women. "Immune system process", "immune response", "defense response", "cellular defense response" and "regulation of immune system process" were the gene sets most over-represented while APOBEC3G, APOBEC3F, LAT, CD1D and CCL5 represented the top-five core genes. Characterization of immune cell composition with the platform CIBERSORT showed no differences between genders; however, there were differences when tumor tissues were compared to normal tissues. Our results suggest different immune responses in NSCLC between genders that could be related with the different clinical outcome.

  12. det1, cop1, and cop9 mutations cause inappropriate expression of several gene sets.

    PubMed Central

    Mayer, R; Raventos, D; Chua, N H

    1996-01-01

    Genetic studies using Arabidopsis offer a promising approach to investigate the mechanisms of light signal transduction during seedling development. Several mutants, called det/cop, have been isolated based on their deetiolated/constitutive photomorphogenic phenotypes in the dark. This study examines the specificity of the det/cop mutations with respect to their effects on genes regulated by other signal transduction pathways. Steady state mRNA levels of a number of differently regulated gene sets were compared between mutants and the wild type. We found that det2, cop2, cop3, and cop4 mutants displayed a gene expression pattern similar to that of the wild type. By contrast, det1, cop1, and cop9 mutations exhibited pleiotropic effects. In addition to light-responsive genes, genes normally inducible by plant pathogens, hypoxia, and developmental programs were inappropriately expressed in these mutants. Our data provide evidence that DET1, COP1, and COP9 most likely act as negative regulators of several sets of genes, not just those involved in light-regulated seedling development. PMID:8953766

  13. det1, cop1, and cop9 mutations cause inappropriate expression of several gene sets.

    PubMed

    Mayer, R; Raventos, D; Chua, N H

    1996-11-01

    Genetic studies using Arabidopsis offer a promising approach to investigate the mechanisms of light signal transduction during seedling development. Several mutants, called det/cop, have been isolated based on their deetiolated/constitutive photomorphogenic phenotypes in the dark. This study examines the specificity of the det/cop mutations with respect to their effects on genes regulated by other signal transduction pathways. Steady state mRNA levels of a number of differently regulated gene sets were compared between mutants and the wild type. We found that det2, cop2, cop3, and cop4 mutants displayed a gene expression pattern similar to that of the wild type. By contrast, det1, cop1, and cop9 mutations exhibited pleiotropic effects. In addition to light-responsive genes, genes normally inducible by plant pathogens, hypoxia, and developmental programs were inappropriately expressed in these mutants. Our data provide evidence that DET1, COP1, and COP9 most likely act as negative regulators of several sets of genes, not just those involved in light-regulated seedling development.

  14. Identification of Gene Markers Associated with Aggressive Meningioma by Filtering across Multiple Sets of Gene Expression Arrays

    PubMed Central

    Stuart, Jourdan E.; Lusis, Eriks A.; Scheck, Adrienne C.; Coons, Stephen W.; Lal, Anita; Perry, Arie; Gutmann, David H.

    2013-01-01

    Meningiomasare common intracranial tumors but relatively little is known about the genetic events responsible for their clinical diversity. While recent genomic studies have provided clues, the genes identified often differ among publications. We used microarray expression profiling to identify genes that are differentially expressed, with at least a 4-fold change, between grade I and grade III meningiomas. We filtered this initial set of potential biomarkers through a second cohort of meningiomas and then verified the remaining genes by quantitative polymerase chain reaction followed by examination using a third microarray expression cohort. Using this approach, we identified 9 overexpressed (TPX2, RRM2, TOP2A, PI3, BIRC5, CDC2, NUSAP1, DLG7, SOX11) and 2 underexpressed (TIMP3, KCNMA1) genes in grade III vs. grade I meningiomas. As a further validation step, we analyzed these genes in a fourth cohort and found that patients with grade II meningiomas with high topoisomerase 2-α protein expression (greater than 5% labeling-index) had shorter times to death than patients with low expression. We believe that this multistep, multi-cohort approach provides a robust method for reducing false positives while generating a list of reproducible candidate genes that are associated with clinically aggressive meningioma and are suitable for analysis for their potential prognostic value. PMID:21157382

  15. Evaluation of statistical treatments of left-censored environmental data using coincident uncensored data sets. II. Group comparisons

    USGS Publications Warehouse

    Antweiler, Ronald C.

    2015-01-01

    The main classes of statistical treatments that have been used to determine if two groups of censored environmental data arise from the same distribution are substitution methods, maximum likelihood (MLE) techniques, and nonparametric methods. These treatments along with using all instrument-generated data (IN), even those less than the detection limit, were evaluated by examining 550 data sets in which the true values of the censored data were known, and therefore “true” probabilities could be calculated and used as a yardstick for comparison. It was found that technique “quality” was strongly dependent on the degree of censoring present in the groups. For low degrees of censoring (<25% in each group), the Generalized Wilcoxon (GW) technique and substitution of √2/2 times the detection limit gave overall the best results. For moderate degrees of censoring, MLE worked best, but only if the distribution could be estimated to be normal or log-normal prior to its application; otherwise, GW was a suitable alternative. For higher degrees of censoring (each group >40% censoring), no technique provided reliable estimates of the true probability. Group size did not appear to influence the quality of the result, and no technique appeared to become better or worse than other techniques relative to group size. Finally, IN appeared to do very well relative to the other techniques regardless of censoring or group size.

  16. Robust nuclei segmentation in cyto-histopathological images using statistical level set approach with topology preserving constraint

    NASA Astrophysics Data System (ADS)

    Taheri, Shaghayegh; Fevens, Thomas; Bui, Tien D.

    2017-02-01

    Computerized assessments for diagnosis or malignancy grading of cyto-histopathological specimens have drawn increased attention in the field of digital pathology. Automatic segmentation of cell nuclei is a fundamental step in such automated systems. Despite considerable research, nuclei segmentation is still a challenging task due noise, nonuniform illumination, and most importantly, in 2D projection images, overlapping and touching nuclei. In most published approaches, nuclei refinement is a post-processing step after segmentation, which usually refers to the task of detaching the aggregated nuclei or merging the over-segmented nuclei. In this work, we present a novel segmentation technique which effectively addresses the problem of individually segmenting touching or overlapping cell nuclei during the segmentation process. The proposed framework is a region-based segmentation method, which consists of three major modules: i) the image is passed through a color deconvolution step to extract the desired stains; ii) then the generalized fast radial symmetry transform is applied to the image followed by non-maxima suppression to specify the initial seed points for nuclei, and their corresponding GFRS ellipses which are interpreted as the initial nuclei borders for segmentation; iii) finally, these nuclei border initial curves are evolved through the use of a statistical level-set approach along with topology preserving criteria for segmentation and separation of nuclei at the same time. The proposed method is evaluated using Hematoxylin and Eosin, and fluorescent stained images, performing qualitative and quantitative analysis, showing that the method outperforms thresholding and watershed segmentation approaches.

  17. A quality improvement project using statistical process control methods for type 2 diabetes control in a resource-limited setting.

    PubMed

    Flood, David; Douglas, Kate; Goldberg, Vera; Martinez, Boris; Garcia, Pablo; Arbour, MaryCatherine; Rohloff, Peter

    2017-05-09

    Quality improvement (QI) is a key strategy for improving diabetes care in low- and middle-income countries (LMICs). This study reports on a diabetes QI project in rural Guatemala whose primary aim was to improve glycemic control of a panel of adult diabetes patients. Formative research suggested multiple areas for programmatic improvement in ambulatory diabetes care. This project utilized the Model for Improvement and Agile Global Health, our organization's complementary healthcare implementation framework. A bundle of improvement activities were implemented at the home, clinic and institutional level. Control charts of mean hemoglobin A1C (HbA1C) and proportion of patients meeting target HbA1C showed improvement as special cause variation was identified 3 months after the intervention began. Control charts for secondary process measures offered insights into the value of different components of the intervention. Intensity of home-based diabetes education emerged as an important driver of panel glycemic control. Diabetes QI work is feasible in resource-limited settings in LMICs and can improve glycemic control. Statistical process control charts are a promising methodology for use with panels or registries of diabetes patients.

  18. Cross-species gene expression analysis identifies a novel set of genes implicated in human insulin sensitivity.

    PubMed

    Chaudhuri, Rima; Khoo, Poh Sim; Tonks, Katherine; Junutula, Jagath R; Kolumam, Ganesh; Modrusan, Zora; Samocha-Bonet, Dorit; Meoli, Christopher C; Hocking, Samantha; Fazakerley, Daniel J; Stöckli, Jacqueline; Hoehn, Kyle L; Greenfield, Jerry R; Yang, Jean Yee Hwa; James, David E

    2015-01-01

    Insulin resistance (IR) is one of the earliest predictors of type 2 diabetes. However, diagnosis of IR is limited. High fat fed mouse models provide key insights into IR. We hypothesized that early features of IR are associated with persistent changes in gene expression (GE) and endeavored to (a) develop novel methods for improving signal:noise in analysis of human GE using mouse models; (b) identify a GE motif that accurately diagnoses IR in humans; and (c) identify novel biology associated with IR in humans. We integrated human muscle GE data with longitudinal mouse GE data and developed an unbiased three-level cross-species analysis platform (single gene, gene set, and networks) to generate a gene expression motif (GEM) indicative of IR. A logistic regression classification model validated GEM in three independent human data sets (n=115). This GEM of 93 genes substantially improved diagnosis of IR compared with routine clinical measures across multiple independent data sets. Individuals misclassified by GEM possessed other metabolic features raising the possibility that they represent a separate metabolic subclass. The GEM was enriched in pathways previously implicated in insulin action and revealed novel associations between β-catenin and Jak1 and IR. Functional analyses using small molecule inhibitors showed an important role for these proteins in insulin action. This study shows that systems approaches for identifying molecular signatures provides a powerful way to stratify individuals into discrete metabolic groups. Moreover, we speculate that the β-catenin pathway may represent a novel biomarker for IR in humans that warrant future investigation.

  19. A simple gene set-based method accurately predicts the synergy of drug pairs.

    PubMed

    Hsu, Yu-Ching; Chiu, Yu-Chiao; Chen, Yidong; Hsiao, Tzu-Hung; Chuang, Eric Y

    2016-08-26

    The advance in targeted therapy has greatly increased the effectiveness of clinical cancer therapy and reduced the cytotoxicity of treatments to normal cells. However, patients still suffer from cancer relapse due to the occurrence of drug resistance. It is of great need to explore potential combinatorial drug therapy since individual drug alone may not be sufficient to inhibit continuous activation of cancer-addicted genes or pathways. The DREAM challenge has confirmed the potentiality of computational methods for predicting synergistic drug combinations, while the prediction accuracy can be further improved. Based on previous reports, we hypothesized the similarity in biological functions or genes perturbed by two drugs can determine their synergistic effects. To test the feasibility of the hypothesis, we proposed three scoring systems: co-gene score, co-GS score, and co-gene/GS score, measuring the similarities in genes with significant expressional changes, enriched gene sets, and significantly changed genes within an enriched gene sets between a pair of drugs, respectively. Performances of these scoring systems were evaluated by the probabilistic c-index (PC-index) devised by the DREAM consortium. We also applied the proposed method to the Connectivity Map dataset to explore more potential synergistic drug combinations. Using a gold standard derived by the DREAM consortium, we confirmed the prediction power of the three scoring systems (all P-values < 0.05). The co-gene/GS score achieved the best prediction of drug synergy (PC-index = 0.663, P-value < 0.0001), outperforming all methods proposed during DREAM challenge. Furthermore, a binary classification test showed that co-gene/GS scoring was highly accurate and specific. Since our method is constructed on a gene set-based analysis, in addition to synergy prediction, it provides insights into the functional relevance of drug combinations and the underlying mechanisms by which drugs achieve synergy

  20. Developmental Control of Stress Stimulons in Streptomyces coelicolor Revealed by Statistical Analyses of Global Gene Expression Patterns

    PubMed Central

    Vohradsky, J.; Li, X.-M.; Dale, G.; Folcher, M.; Nguyen, L.; Viollier, P. H.; Thompson, C. J.

    2000-01-01

    Stress-induced regulatory networks coordinated with a procaryotic developmental program were revealed by two-dimensional gel analyses of global gene expression. Four developmental stages were identified by their distinctive protein synthesis patterns using principal component analysis. Statistical analyses focused on five stress stimulons (induced by heat, cold, salt, ethanol, or antibiotic shock) and their synthesis during development. Unlike other bacteria, for which various stresses induce expression of similar sets of protein spots, in Streptomyces coelicolor heat, salt, and ethanol stimulons were composed of independent sets of proteins. This suggested independent control by different physiological stress signals and their corresponding regulatory systems. These stress proteins were also under developmental control. Cluster analysis of stress protein synthesis profiles identified 10 different developmental patterns or “synexpression groups.” Proteins induced by cold, heat, or salt shock were enriched in three developmental synexpression groups. In addition, certain proteins belonging to the heat and salt shock stimulons were coregulated during development. Thus, stress regulatory systems controlling these stimulons were implicated as integral parts of the developmental program. This correlation suggested that thermal shock and salt shock stress response regulatory systems either allow the cell to adapt to stresses associated with development or directly control the developmental program. PMID:10940043

  1. Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms.

    PubMed

    Burleigh, J Gordon; Hilu, Khidir W; Soltis, Douglas E

    2009-03-17

    Phylogenetic analyses of angiosperm relationships have used only a small percentage of available sequence data, but phylogenetic data matrices often can be augmented with existing data, especially if one allows missing characters. We explore the effects on phylogenetic analyses of adding 378 matK sequences and 240 26S rDNA sequences to the complete 3-gene, 567-taxon angiosperm phylogenetic matrix of Soltis et al. We performed maximum likelihood bootstrap analyses of the complete, 3-gene 567-taxon data matrix and the incomplete, 5-gene 567-taxon data matrix. Although the 5-gene matrix has more missing data (27.5%) than the 3-gene data matrix (2.9%), the 5-gene analysis resulted in higher levels of bootstrap support. Within the 567-taxon tree, the increase in support is most evident for relationships among the 170 taxa for which both matK and 26S rDNA sequences were added, and there is little gain in support for relationships among the 119 taxa having neither matK nor 26S rDNA sequences. The 5-gene analysis also places the enigmatic Hydrostachys in Lamiales (BS = 97%) rather than in Cornales (BS = 100% in 3-gene analysis). The placement of Hydrostachys in Lamiales is unprecedented in molecular analyses, but it is consistent with embryological and morphological data. Adding available, and often incomplete, sets of sequences to existing data sets can be a fast and inexpensive way to increase support for phylogenetic relationships and produce novel and credible new phylogenetic hypotheses.

  2. Use of the gamma method for self-contained gene-set analysis of SNP data

    PubMed Central

    Biernacka, Joanna M; Jenkins, Gregory D; Wang, Liewei; Moyer, Ann M; Fridley, Brooke L

    2012-01-01

    Gene-set analysis (GSA) evaluates the overall evidence of association between a phenotype and all genotyped single nucleotide polymorphisms (SNPs) in a set of genes, as opposed to testing for association between a phenotype and each SNP individually. We propose using the Gamma Method (GM) to combine gene-level P-values for assessing the significance of GS association. We performed simulations to compare the GM with several other self-contained GSA strategies, including both one-step and two-step GSA approaches, in a variety of scenarios. We denote a ‘one-step' GSA approach to be one in which all SNPs in a GS are used to derive a test of GS association without consideration of gene-level effects, and a ‘two-step' approach to be one in which all genotyped SNPs in a gene are first used to evaluate association of the phenotype with all measured variation in the gene and then the gene-level tests of association are aggregated to assess the GS association with the phenotype. The simulations suggest that, overall, two-step methods provide higher power than one-step approaches and that combining gene-level P-values using the GM with a soft truncation threshold between 0.05 and 0.20 is a powerful approach for conducting GSA, relative to the competing approaches assessed. We also applied all of the considered GSA methods to data from a pharmacogenomic study of cisplatin, and obtained evidence suggesting that the glutathione metabolism GS is associated with cisplatin drug response. PMID:22166939

  3. RAMONA: a Web application for gene set analysis on multilevel omics data.

    PubMed

    Sass, Steffen; Buettner, Florian; Mueller, Nikola S; Theis, Fabian J

    2015-01-01

    Decreasing costs of modern high-throughput experiments allow for the simultaneous analysis of altered gene activity on various molecular levels. However, these multi-omics approaches lead to a large amount of data, which is hard to interpret for a non-bioinformatician. Here, we present the remotely accessible multilevel ontology analysis (RAMONA). It offers an easy-to-use interface for the simultaneous gene set analysis of combined omics datasets and is an extension of the previously introduced MONA approach. RAMONA is based on a Bayesian enrichment method for the inference of overrepresented biological processes among given gene sets. Overrepresentation is quantified by interpretable term probabilities. It is able to handle data from various molecular levels, while in parallel coping with redundancies arising from gene set overlaps and related multiple testing problems. The comprehensive output of RAMONA is easy to interpret and thus allows for functional insight into the affected biological processes. With RAMONA, we provide an efficient implementation of the Bayesian inference problem such that ontologies consisting of thousands of terms can be processed in the order of seconds. RAMONA is implemented as ASP.NET Web application and publicly available at http://icb.helmholtz-muenchen.de/ramona. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. A rough set based rational clustering framework for determining correlated genes.

    PubMed

    Jeyaswamidoss, Jeba Emilyn; Thangaraj, Kesavan; Ramar, Kadarkarai; Chitra, Muthusamy

    2016-06-01

    Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters.

  5. dslice: an R package for nonparametric testing of associations with application in QTL and gene set analysis.

    PubMed

    Ye, Chao; Jiang, Bo; Zhang, Xuegong; Liu, Jun S

    2015-06-01

    Many statistical problems in bioinformatics and genetics can be formulated as the testing of associations between a categorical variable and a continuous variable. A dynamic slicing method was proposed for non-parametric dependence testing, which has been demonstrated to have higher powers compared with traditional methods such as Kolmogorov-Smirnov test. We introduce an R package dslice to facilitate the use of dynamic slicing method in bioinformatic applications such as quantitative trait loci study and gene set enrichment analysis. dslice is implemented in Rcpp and available in the Comprehensive R Archive Network. The package is distributed under the GNU General Public License (version 2 or later). © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. Genome-Wide Identification, Phylogenetic and Co-Expression Analysis of OsSET Gene Family in Rice

    PubMed Central

    Lu, Zhanhua; Huang, Xiaolong; Ouyang, Yidan; Yao, Jialing

    2013-01-01

    Background SET domain is responsible for the catalytic activity of histone lysine methyltransferases (HKMTs) during developmental process. Histone lysine methylation plays a crucial and diverse regulatory function in chromatin organization and genome function. Although several SET genes have been identified and characterized in plants, the understanding of OsSET gene family in rice is still very limited. Methodology/Principal Findings In this study, a systematic analysis was performed and revealed the presence of at least 43 SET genes in rice genome. Phylogenetic and structural analysis grouped SET proteins into five classes, and supposed that the domains out of SET domain were significant for the specific of histone lysine methylation, as well as the recognition of methylated histone lysine. Based on the global microarray, gene expression profile revealed that the transcripts of OsSET genes were accumulated differentially during vegetative and reproductive developmental stages and preferentially up or down-regulated in different tissues. Cis-elements identification, co-expression analysis and GO analysis of expression correlation of 12 OsSET genes suggested that OsSET genes might be involved in cell cycle regulation and feedback. Conclusions/Significance This study will facilitate further studies on OsSET family and provide useful clues for functional validation of OsSETs. PMID:23762371

  7. Isolation and characterization of a diverse set of genes from carrot somatic embryos.

    PubMed Central

    Lin, X; Hwang, G J; Zimmerman, J L

    1996-01-01

    The early events in plant embryogenesis are critical for pattern formation, since it is during this process that the primary apical meristems and the embryo polarity axis are established. However, little is known about the molecular events that are unique to the early stages of embryogenesis. This study of gene expression during plant embryogenesis is focused on identifying molecular markers from carrot (Daucus carota) somatic embryos and characterizing the expression and regulation of these genes through embryo development. A cDNA library, prepared from polysomal mRNA of globular embryos, was screened using a subtracted probe; 49 clones were isolated and preliminarily characterized. Sequence analysis revealed a large set of genes, including many new genes, that are expressed in a variety of patterns during embryogenesis and may be regulated by different molecular mechanisms. To our knowledge, this group of clones represents the largest collection of embryo-enhanced genes isolated thus far, and demonstrates the utility of the subtracted-probe approach to the somatic embryo system. It is anticipated that many of these genes may serve as useful molecular markers for early embryo development. PMID:8938424

  8. The complete set of predicted genes from Saccharomyces cerevisiae in a readily usable form.

    PubMed

    Hudson, J R; Dawson, E P; Rushing, K L; Jackson, C H; Lockshon, D; Conover, D; Lanciault, C; Harris, J R; Simmons, S J; Rothstein, R; Fields, S

    1997-12-01

    Nearly all of the open reading frames (ORFs) of the yeast Saccharomyces cerevisiae have been synthesized by PCR using a set of approximately 6000 primer pairs. Each of the forward primers has a common 22-base sequence at its 5' end, and each of the back primers has a common 20-base sequence at its 5' end. These common termini allow reamplification of the entire set of original PCR products using a single pair of longer primers-in our case, 70 bases. The resulting 70-base elements that flank each ORF can be used for rapid and efficient cloning into a linearized yeast vector that contains these same elements at its termini. This cloning by genetic recombination obviates the need for ligations or bacterial manipulations and should permit convenient global approaches to gene function that require the assay of each putative yeast gene.

  9. Construction of a Bacterial Cell that Contains Only the Set of Essential Genes Necessary to Impart Life

    DTIC Science & Technology

    2013-08-16

    gene and gene cluster deletions. To date, we have removed approximately 234 kb from the Mycoplasma mycoides JCVI-syn1.0 genome. The resultant 844 kb...categories to make steady progress with gene and gene cluster deletions. To date, we have removed approximately 234 kb from the Mycoplasma mycoides JCVI...only the set of genes that are essential for life under ideal laboratory conditions. We are working to minimize Mycoplasma mycoides JCVI-syn1.0

  10. Detection of RTX toxin genes in gram-negative bacteria with a set of specific probes.

    PubMed Central

    Kuhnert, P; Heyberger-Meyer, B; Burnens, A P; Nicolet, J; Frey, J

    1997-01-01

    The family of RTX (RTX representing repeats in the structural toxin) toxins is composed of several protein toxins with a characteristic nonapeptide glycine-rich repeat motif. Most of its members were shown to have cytolytic activity. By comparing the genetic relationships of the RTX toxin genes we established a set of 10 gene probes to be used for screening as-yet-unknown RTX toxin genes in bacterial species. The probes include parts of apxIA, apxIIA, and apxIIIA from Actinobacillus pleuropneumoniae, cyaA from Bordetella pertusis, frpA from Neisseria meningitidis, prtC from Erwinia chrysanthemi, hlyA and elyA from Escherichia coli, aaltA from Actinobacillus actinomycetemcomitans and lktA from Pasteurella haemolytica. A panel of pathogenic and nonpathogenic gram-negative bacteria were investigated for the presence of RTX toxin genes. The probes detected all known genes for RTX toxins. Moreover, we found potential RTX toxin genes in several pathogenic bacterial species for which no such toxins are known yet. This indicates that RTX or RTX-like toxins are widely distributed among pathogenic gram-negative bacteria. The probes generated by PCR and the hybridization method were optimized to allow broad-range screening for RTX toxin genes in one step. This included the binding of unlabelled probes to a nylon filter and subsequent hybridization of the filter with labelled genomic DNA of the strain to be tested. The method constitutes a powerful tool for the assessment of the potential pathogenicity of poorly characterized strains intended to be used in biotechnological applications. Moreover, it is useful for the detection of already-known or new RTX toxin genes in bacteria of medical importance. PMID:9172345

  11. Identification of a core set of rhizobial infection genes using data from single cell-types

    PubMed Central

    Chen, Da-Song; Liu, Cheng-Wu; Roy, Sonali; Cousins, Donna; Stacey, Nicola; Murray, Jeremy D.

    2015-01-01

    Genome-wide expression studies on nodulation have varied in their scale from entire root systems to dissected nodules or root sections containing nodule primordia (NP). More recently efforts have focused on developing methods for isolation of root hairs from infected plants and the application of laser-capture microdissection technology to nodules. Here we analyze two published data sets to identify a core set of infection genes that are expressed in the nodule and in root hairs during infection. Among the genes identified were those encoding phenylpropanoid biosynthesis enzymes including Chalcone-O-Methyltransferase which is required for the production of the potent Nod gene inducer 4′,4-dihydroxy-2-methoxychalcone. A promoter-GUS analysis in transgenic hairy roots for two genes encoding Chalcone-O-Methyltransferase isoforms revealed their expression in rhizobially infected root hairs and the nodule infection zone but not in the nitrogen fixation zone. We also describe a group of Rhizobially Induced Peroxidases whose expression overlaps with the production of superoxide in rhizobially infected root hairs and in nodules and roots. Finally, we identify a cohort of co-regulated transcription factors as candidate regulators of these processes. PMID:26284091

  12. Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function

    PubMed Central

    2009-01-01

    Background A central task in contemporary biosciences is the identification of biological processes showing response in genome-wide differential gene expression experiments. Two types of analysis are common. Either, one generates an ordered list based on the differential expression values of the probed genes and examines the tail areas of the list for over-representation of various functional classes. Alternatively, one monitors the average differential expression level of genes belonging to a given functional class. So far these two types of method have not been combined. Results We introduce a scoring function, Gene Set Z-score (GSZ), for the analysis of functional class over-representation that combines two previous analysis methods. GSZ encompasses popular functions such as correlation, hypergeometric test, Max-Mean and Random Sets as limiting cases. GSZ is stable against changes in class size as well as across different positions of the analysed gene list in tests with randomized data. GSZ shows the best overall performance in a detailed comparison to popular functions using artificial data. Likewise, GSZ stands out in a cross-validation of methods using split real data. A comparison of empirical p-values further shows a strong difference in favour of GSZ, which clearly reports better p-values for top classes than the other methods. Furthermore, GSZ detects relevant biological themes that are missed by the other methods. These observations also hold when comparing GSZ with popular program packages. Conclusion GSZ and improved versions of earlier methods are a useful contribution to the analysis of differential gene expression. The methods and supplementary material are available from the website http://ekhidna.biocenter.helsinki.fi/users/petri/public/GSZ/GSZscore.html. PMID:19775443

  13. Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

    PubMed Central

    Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

    2015-01-01

    Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data

  14. Analyzing large gene expression and methylation data profiles using StatBicRM: statistical biclustering-based rule mining.

    PubMed

    Maulik, Ujjwal; Mallik, Saurav; Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra

    2015-01-01

    Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data

  15. Meta-Analysis of Tumor Stem-Like Breast Cancer Cells Using Gene Set and Network Analysis.

    PubMed

    Lee, Won Jun; Kim, Sang Cheol; Yoon, Jung-Ho; Yoon, Sang Jun; Lim, Johan; Kim, You-Sun; Kwon, Sung Won; Park, Jeong Hill

    2016-01-01

    Generally, cancer stem cells have epithelial-to-mesenchymal-transition characteristics and other aggressive properties that cause metastasis. However, there have been no confident markers for the identification of cancer stem cells and comparative methods examining adherent and sphere cells are widely used to investigate mechanism underlying cancer stem cells, because sphere cells have been known to maintain cancer stem cell characteristics. In this study, we conducted a meta-analysis that combined gene expression profiles from several studies that utilized tumorsphere technology to investigate tumor stem-like breast cancer cells. We used our own gene expression profiles along with the three different gene expression profiles from the Gene Expression Omnibus, which we combined using the ComBat method, and obtained significant gene sets using the gene set analysis of our datasets and the combined dataset. This experiment focused on four gene sets such as cytokine-cytokine receptor interaction that demonstrated significance in both datasets. Our observations demonstrated that among the genes of four significant gene sets, six genes were consistently up-regulated and satisfied the p-value of < 0.05, and our network analysis showed high connectivity in five genes. From these results, we established CXCR4, CXCL1 and HMGCS1, the intersecting genes of the datasets with high connectivity and p-value of < 0.05, as significant genes in the identification of cancer stem cells. Additional experiment using quantitative reverse transcription-polymerase chain reaction showed significant up-regulation in MCF-7 derived sphere cells and confirmed the importance of these three genes. Taken together, using meta-analysis that combines gene set and network analysis, we suggested CXCR4, CXCL1 and HMGCS1 as candidates involved in tumor stem-like breast cancer cells. Distinct from other meta-analysis, by using gene set analysis, we selected possible markers which can explain the biological

  16. Expression map of a complete set of gustatory receptor genes in chemosensory organs of Bombyx mori.

    PubMed

    Guo, Huizhen; Cheng, Tingcai; Chen, Zhiwei; Jiang, Liang; Guo, Youbing; Liu, Jianqiu; Li, Shenglong; Taniai, Kiyoko; Asaoka, Kiyoshi; Kadono-Okuda, Keiko; Arunkumar, Kallare P; Wu, Jiaqi; Kishino, Hirohisa; Zhang, Huijie; Seth, Rakesh K; Gopinathan, Karumathil P; Montagné, Nicolas; Jacquin-Joly, Emmanuelle; Goldsmith, Marian R; Xia, Qingyou; Mita, Kazuei

    2017-03-01

    Most lepidopteran species are herbivores, and interaction with host plants affects their gene expression and behavior as well as their genome evolution. Gustatory receptors (Grs) are expected to mediate host plant selection, feeding, oviposition and courtship behavior. However, due to their high diversity, sequence divergence and extremely low level of expression it has been difficult to identify precisely a complete set of Grs in Lepidoptera. By manual annotation and BAC sequencing, we improved annotation of 43 gene sequences compared with previously reported Grs in the most studied lepidopteran model, the silkworm, Bombyx mori, and identified 7 new tandem copies of BmGr30 on chromosome 7, bringing the total number of BmGrs to 76. Among these, we mapped 68 genes to chromosomes in a newly constructed chromosome distribution map and 8 genes to scaffolds; we also found new evidence for large clusters of BmGrs, especially from the bitter receptor family. RNA-seq analysis of diverse BmGr expression patterns in chemosensory organs of larvae and adults enabled us to draw a precise organ specific map of BmGr expression. Interestingly, most of the clustered genes were expressed in the same tissues and more than half of the genes were expressed in larval maxillae, larval thoracic legs and adult legs. For example, BmGr63 showed high expression levels in all organs in both larval and adult stages. By contrast, some genes showed expression limited to specific developmental stages or organs and tissues. BmGr19 was highly expressed in larval chemosensory organs (especially antennae and thoracic legs), the single exon genes BmGr53 and BmGr67 were expressed exclusively in larval tissues, the BmGr27-BmGr31 gene cluster on chr7 displayed a high expression level limited to adult legs and the candidate CO2 receptor BmGr2 was highly expressed in adult antennae, where few other Grs were expressed. Transcriptional analysis of the Grs in B. mori provides a valuable new reference for

  17. Defining the optimal animal model for translational research using gene set enrichment analysis.

    PubMed

    Weidner, Christopher; Steinfath, Matthias; Opitz, Elisa; Oelgeschläger, Michael; Schönfelder, Gilbert

    2016-08-01

    The mouse is the main model organism used to study the functions of human genes because most biological processes in the mouse are highly conserved in humans. Recent reports that compared identical transcriptomic datasets of human inflammatory diseases with datasets from mouse models using traditional gene-to-gene comparison techniques resulted in contradictory conclusions regarding the relevance of animal models for translational research. To reduce susceptibility to biased interpretation, all genes of interest for the biological question under investigation should be considered. Thus, standardized approaches for systematic data analysis are needed. We analyzed the same datasets using gene set enrichment analysis focusing on pathways assigned to inflammatory processes in either humans or mice. The analyses revealed a moderate overlap between all human and mouse datasets, with average positive and negative predictive values of 48 and 57% significant correlations. Subgroups of the septic mouse models (i.e., Staphylococcus aureus injection) correlated very well with most human studies. These findings support the applicability of targeted strategies to identify the optimal animal model and protocol to improve the success of translational research. © 2016 The Authors. Published under the terms of the CC BY 4.0 license.

  18. Identification of the Core Set of Carbon-Associated Genes in a Bioenergy Grassland Soil

    PubMed Central

    Howe, Adina; Yang, Fan; Williams, Ryan J.; Meyer, Folker; Hofmockel, Kirsten S.

    2016-01-01

    Despite the central role of soil microbial communities in global carbon (C) cycling, little is known about soil microbial community structure and even less about their metabolic pathways. Efforts to characterize soil communities often focus on identifying differences in gene content across environmental gradients, but an alternative question is what genes are similar in soils. These genes may indicate critical species or potential functions that are required in all soils. Here we identified the “core” set of C cycling sequences widely present in multiple soil metagenomes from a fertilized prairie (FP). Of 226,887 sequences associated with known enzymes involved in the synthesis, metabolism, and transport of carbohydrates, 843 were identified to be consistently prevalent across four replicate soil metagenomes. This core metagenome was functionally and taxonomically diverse, representing five enzyme classes and 99 enzyme families within the CAZy database. Though it only comprised 0.4% of all CAZy-associated genes identified in FP metagenomes, the core was found to be comprised of functions similar to those within cumulative soils. The FP CAZy-associated core sequences were present in multiple publicly available soil metagenomes and most similar to soils sharing geographic proximity. In soil ecosystems, where high diversity remains a key challenge for metagenomic investigations, these core genes represent a subset of critical functions necessary for carbohydrate metabolism, which can be targeted to evaluate important C fluxes in these and other similar soils. PMID:27855202

  19. Powerful Set-Based Gene-Environment Interaction Testing Framework for Complex Diseases.

    PubMed

    Jiao, Shuo; Peters, Ulrike; Berndt, Sonja; Bézieau, Stéphane; Brenner, Hermann; Campbell, Peter T; Chan, Andrew T; Chang-Claude, Jenny; Lemire, Mathieu; Newcomb, Polly A; Potter, John D; Slattery, Martha L; Woods, Michael O; Hsu, Li

    2015-12-01

    Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. Based on our previously developed Set Based gene EnviRonment InterAction test (SBERIA), in this paper we propose a powerful framework for enhanced set-based G × E testing (eSBERIA). The major challenge of signal aggregation within a set is how to tell signals from noise. eSBERIA tackles this challenge by adaptively aggregating the interaction signals within a set weighted by the strength of the marginal and correlation screening signals. eSBERIA then combines the screening-informed aggregate test with a variance component test to account for the residual signals. Additionally, we develop a case-only extension for eSBERIA (coSBERIA) and an existing set-based method, which boosts the power not only by exploiting the G-E independence assumption but also by avoiding the need to specify main effects for a large number of variants in the set. Through extensive simulation, we show that coSBERIA and eSBERIA are considerably more powerful than existing methods within the case-only and the case-control method categories across a wide range of scenarios. We conduct a genome-wide G × E search by applying our methods to Illumina HumanExome Beadchip data of 10,446 colorectal cancer cases and 10,191 controls and identify two novel interactions between nonsteroidal anti-inflammatory drugs (NSAIDs) and MINK1 and PTCHD3. © 2015 WILEY PERIODICALS, INC.

  20. Transcriptomic Analysis Identifies Candidate Genes and Gene Sets Controlling the Response of Porcine Peripheral Blood Mononuclear Cells to Poly I:C Stimulation

    PubMed Central

    Wang, Jiying; Wang, Yanping; Wang, Huaizhong; Wang, Haifei; Liu, Jian-Feng; Wu, Ying; Guo, Jianfeng

    2016-01-01

    Polyinosinic-polycytidylic acid (poly I:C), a synthetic dsRNA analog, has been demonstrated to have stimulatory effects similar to viral dsRNA. To gain deep knowledge of the host transcriptional response of pigs to poly I:C stimulation, in the present study, we cultured and stimulated peripheral blood mononuclear cells (PBMC) of piglets of one Chinese indigenous breed (Dapulian) and one modern commercial breed (Landrace) with poly I:C, and compared their transcriptional profiling using RNA-sequencing (RNA-seq). Our results indicated that poly I:C stimulation can elicit significantly differentially expressed (DE) genes in Dapulian (g = 290) as well as Landrace (g = 85). We also performed gene set analysis using the Gene Set Enrichment Analysis (GSEA) package, and identified some significantly enriched gene sets in Dapulian (g = 18) and Landrace (g = 21). Most of the shared DE genes and gene sets were immune-related, and may play crucial rules in the immune response of poly I:C stimulation. In addition, we detected large sets of significantly DE genes and enriched gene sets when comparing the gene expression profile between the two breeds, including control and poly I:C stimulation groups. Besides immune-related functions, some of the DE genes and gene sets between the two breeds were involved in development and growth of various tissues, which may be correlated with the different characteristics of the two breeds. The DE genes and gene sets detected herein provide crucial information towards understanding the immune regulation of antiviral responses, and the molecular mechanisms of different genetic resistance to viral infection, in modern and indigenous pigs. PMID:26935416

  1. Transcriptomic Analysis Identifies Candidate Genes and Gene Sets Controlling the Response of Porcine Peripheral Blood Mononuclear Cells to Poly I:C Stimulation.

    PubMed

    Wang, Jiying; Wang, Yanping; Wang, Huaizhong; Wang, Haifei; Liu, Jian-Feng; Wu, Ying; Guo, Jianfeng

    2016-05-03

    Polyinosinic-polycytidylic acid (poly I:C), a synthetic dsRNA analog, has been demonstrated to have stimulatory effects similar to viral dsRNA. To gain deep knowledge of the host transcriptional response of pigs to poly I:C stimulation, in the present study, we cultured and stimulated peripheral blood mononuclear cells (PBMC) of piglets of one Chinese indigenous breed (Dapulian) and one modern commercial breed (Landrace) with poly I:C, and compared their transcriptional profiling using RNA-sequencing (RNA-seq). Our results indicated that poly I:C stimulation can elicit significantly differentially expressed (DE) genes in Dapulian (g = 290) as well as Landrace (g = 85). We also performed gene set analysis using the Gene Set Enrichment Analysis (GSEA) package, and identified some significantly enriched gene sets in Dapulian (g = 18) and Landrace (g = 21). Most of the shared DE genes and gene sets were immune-related, and may play crucial rules in the immune response of poly I:C stimulation. In addition, we detected large sets of significantly DE genes and enriched gene sets when comparing the gene expression profile between the two breeds, including control and poly I:C stimulation groups. Besides immune-related functions, some of the DE genes and gene sets between the two breeds were involved in development and growth of various tissues, which may be correlated with the different characteristics of the two breeds. The DE genes and gene sets detected herein provide crucial information towards understanding the immune regulation of antiviral responses, and the molecular mechanisms of different genetic resistance to viral infection, in modern and indigenous pigs.

  2. Gene-set meta-analysis of lung cancer identifies pathway related to systemic lupus erythematosus.

    PubMed

    Rosenberger, Albert; Sohns, Melanie; Friedrichs, Stefanie; Hung, Rayjean J; Fehringer, Gord; McLaughlin, John; Amos, Christopher I; Brennan, Paul; Risch, Angela; Brüske, Irene; Caporaso, Neil E; Landi, Maria Teresa; Christiani, David C; Wei, Yongyue; Bickeböller, Heike

    2017-01-01

    Gene-set analysis (GSA) is an approach using the results of single-marker genome-wide association studies when investigating pathways as a whole with respect to the genetic basis of a disease. We performed a meta-analysis of seven GSAs for lung cancer, applying the method META-GSA. Overall, the information taken from 11,365 cases and 22,505 controls from within the TRICL/ILCCO consortia was used to investigate a total of 234 pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. META-GSA reveals the systemic lupus erythematosus KEGG pathway hsa05322, driven by the gene region 6p21-22, as also implicated in lung cancer (p = 0.0306). This gene region is known to be associated with squamous cell lung carcinoma. The most important genes driving the significance of this pathway belong to the genomic areas HIST1-H4L, -1BN, -2BN, -H2AK, -H4K and C2/C4A/C4B. Within these areas, the markers most significantly associated with LC are rs13194781 (located within HIST12BN) and rs1270942 (located between C2 and C4A). We have discovered a pathway currently marked as specific to systemic lupus erythematosus as being significantly implicated in lung cancer. The gene region 6p21-22 in this pathway appears to be more extensively associated with lung cancer than previously assumed. Given wide-stretched linkage disequilibrium to the area APOM/BAG6/MSH5, there is currently simply not enough information or evidence to conclude whether the potential pleiotropy of lung cancer and systemic lupus erythematosus is spurious, biological, or mediated. Further research into this pathway and gene region will be necessary.

  3. Gene-set meta-analysis of lung cancer identifies pathway related to systemic lupus erythematosus

    PubMed Central

    Sohns, Melanie; Friedrichs, Stefanie; Hung, Rayjean J.; Fehringer, Gord; McLaughlin, John; Amos, Christopher I.; Brennan, Paul; Risch, Angela; Brüske, Irene; Caporaso, Neil E.; Landi, Maria Teresa; Christiani, David C.; Wei, Yongyue; Bickeböller, Heike

    2017-01-01

    Introduction Gene-set analysis (GSA) is an approach using the results of single-marker genome-wide association studies when investigating pathways as a whole with respect to the genetic basis of a disease. Methods We performed a meta-analysis of seven GSAs for lung cancer, applying the method META-GSA. Overall, the information taken from 11,365 cases and 22,505 controls from within the TRICL/ILCCO consortia was used to investigate a total of 234 pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Results META-GSA reveals the systemic lupus erythematosus KEGG pathway hsa05322, driven by the gene region 6p21-22, as also implicated in lung cancer (p = 0.0306). This gene region is known to be associated with squamous cell lung carcinoma. The most important genes driving the significance of this pathway belong to the genomic areas HIST1-H4L, -1BN, -2BN, -H2AK, -H4K and C2/C4A/C4B. Within these areas, the markers most significantly associated with LC are rs13194781 (located within HIST12BN) and rs1270942 (located between C2 and C4A). Conclusions We have discovered a pathway currently marked as specific to systemic lupus erythematosus as being significantly implicated in lung cancer. The gene region 6p21-22 in this pathway appears to be more extensively associated with lung cancer than previously assumed. Given wide-stretched linkage disequilibrium to the area APOM/BAG6/MSH5, there is currently simply not enough information or evidence to conclude whether the potential pleiotropy of lung cancer and systemic lupus erythematosus is spurious, biological, or mediated. Further research into this pathway and gene region will be necessary. PMID:28273134

  4. gsSKAT: Rapid gene set analysis and multiple testing correction for rare-variant association studies using weighted linear kernels.

    PubMed

    Larson, Nicholas B; McDonnell, Shannon; Cannon Albright, Lisa; Teerlink, Craig; Stanford, Janet; Ostrander, Elaine A; Isaacs, William B; Xu, Jianfeng; Cooney, Kathleen A; Lange, Ethan; Schleutker, Johanna; Carpten, John D; Powell, Isaac; Bailey-Wilson, Joan E; Cussenot, Olivier; Cancel-Tassin, Geraldine; Giles, Graham G; MacInnis, Robert J; Maier, Christiane; Whittemore, Alice S; Hsieh, Chih-Lin; Wiklund, Fredrik; Catolona, William J; Foulkes, William; Mandal, Diptasri; Eeles, Rosalind; Kote-Jarai, Zsofia; Ackerman, Michael J; Olson, Timothy M; Klein, Christopher J; Thibodeau, Stephen N; Schaid, Daniel J

    2017-02-16

    Next-generation sequencing technologies have afforded unprecedented characterization of low-frequency and rare genetic variation. Due to low power for single-variant testing, aggregative methods are commonly used to combine observed rare variation within a single gene. Causal variation may also aggregate across multiple genes within relevant biomolecular pathways. Kernel-machine regression and adaptive testing methods for aggregative rare-variant association testing have been demonstrated to be powerful approaches for pathway-level analysis, although these methods tend to be computationally intensive at high-variant dimensionality and require access to complete data. An additional analytical issue in scans of large pathway definition sets is multiple testing correction. Gene set definitions may exhibit substantial genic overlap, and the impact of the resultant correlation in test statistics on Type I error rate control for large agnostic gene set scans has not been fully explored. Herein, we first outline a statistical strategy for aggregative rare-variant analysis using component gene-level linear kernel score test summary statistics as well as derive simple estimators of the effective number of tests for family-wise error rate control. We then conduct extensive simulation studies to characterize the behavior of our approach relative to direct application of kernel and adaptive methods under a variety of conditions. We also apply our method to two case-control studies, respectively, evaluating rare variation in hereditary prostate cancer and schizophrenia. Finally, we provide open-source R code for public use to facilitate easy application of our methods to existing rare-variant analysis results.

  5. Genome-Wide Temporal Expression Profiling in Caenorhabditis elegans Identifies a Core Gene Set Related to Long-Term Memory.

    PubMed

    Freytag, Virginie; Probst, Sabine; Hadziselimovic, Nils; Boglari, Csaba; Hauser, Yannick; Peter, Fabian; Gabor Fenyves, Bank; Milnik, Annette; Demougin, Philippe; Vukojevic, Vanja; de Quervain, Dominique J-F; Papassotiropoulos, Andreas; Stetak, Attila

    2017-07-12

    The identification of genes related to encoding, storage, and retrieval of memories is a major interest in neuroscience. In the current study, we analyzed the temporal gene expression changes in a neuronal mRNA pool during an olfactory long-term associative memory (LTAM) in Caenorhabditis elegans hermaphrodites. Here, we identified a core set of 712 (538 upregulated and 174 downregulated) genes that follows three distinct temporal peaks demonstrating multiple gene regulation waves in LTAM. Compared with the previously published positive LTAM gene set (Lakhina et al., 2015), 50% of the identified upregulated genes here overlap with the previous dataset, possibly representing stimulus-independent memory-related genes. On the other hand, the remaining genes were not previously identified in positive associative memory and may specifically regulate aversive LTAM. Our results suggest a multistep gene activation process during the formation and retrieval of long-term memory and define general memory-implicated genes as well as conditioning-type-dependent gene sets.SIGNIFICANCE STATEMENT The identification of genes regulating different steps of memory is of major interest in neuroscience. Identification of common memory genes across different learning paradigms and the temporal activation of the genes are poorly studied. Here, we investigated the temporal aspects of Caenorhabditis elegans gene expression changes using aversive olfactory associative long-term memory (LTAM) and identified three major gene activation waves. Like in previous studies, aversive LTAM is also CREB dependent, and CREB activity is necessary immediately after training. Finally, we define a list of memory paradigm-independent core gene sets as well as conditioning-dependent genes. Copyright © 2017 the authors 0270-6474/17/376661-12$15.00/0.

  6. Functional classification of genes using semantic distance and fuzzy clustering approach: evaluation with reference sets and overlap analysis.

    PubMed

    Devignes, Marie-Dominique; Benabderrahmane, Sidahmed; Smaïl-Tabbone, Malika; Napoli, Amedeo; Poch, Olivier

    2012-01-01

    Functional classification aims at grouping genes according to their molecular function or the biological process they participate in. Evaluating the validity of such unsupervised gene classification remains a challenge given the variety of distance measures and classification algorithms that can be used. We evaluate here functional classification of genes with the help of reference sets: KEGG (Kyoto Encyclopaedia of Genes and Genomes) pathways and Pfam clans. These sets represent ground truth for any distance based on GO (Gene Ontology) biological process and molecular function annotations respectively. Overlaps between clusters and reference sets are estimated by the F-score method. We test our previously described IntelliGO semantic distance with hierarchical and fuzzy C-means clustering and we compare results with the state-of-the-art DAVID (Database for Annotation Visualisation and Integrated Discovery) functional classification method. Finally, study of best matching clusters to reference sets leads us to propose a set-difference method for discovering missing information.

  7. Detecting gene-environment interactions in human birth defects: Study designs and statistical methods.

    PubMed

    Tai, Caroline G; Graff, Rebecca E; Liu, Jinghua; Passarelli, Michael N; Mefford, Joel A; Shaw, Gary M; Hoffmann, Thomas J; Witte, John S

    2015-08-01

    The National Birth Defects Prevention Study (NBDPS) contains a wealth of information on affected and unaffected family triads, and thus provides numerous opportunities to study gene-environment interactions (G×E) in the etiology of birth defect outcomes. Depending on the research objective, several analytic options exist to estimate G×E effects that use varying combinations of individuals drawn from available triads. In this study, we discuss important considerations in the collection of genetic data and environmental exposures. We will also present several population- and family-based approaches that can be applied to data from the NBDPS including case-control, case-only, family-based trio, and maternal versus fetal effects. For each, we describe the data requirements, applicable statistical methods, advantages, and disadvantages. A range of approaches can be used to evaluate potentially important G×E effects in the NBDPS. Investigators should be aware of the limitations inherent to each approach when choosing a study design and interpreting results. © 2015 Wiley Periodicals, Inc.

  8. Feature selection in gene expression data using principal component analysis and rough set theory.

    PubMed

    Mishra, Debahuti; Dash, Rajashree; Rath, Amiya Kumar; Acharya, Milu

    2011-01-01

    In many fields such as data mining, machine learning, pattern recognition and signal processing, data sets containing huge number of features are often involved. Feature selection is an essential data preprocessing technique for such high-dimensional data classification tasks. Traditional dimensionality reduction approach falls into two categories: Feature Extraction (FE) and Feature Selection (FS). Principal component analysis is an unsupervised linear FE method for projecting high-dimensional data into a low-dimensional space with minimum loss of information. It discovers the directions of maximal variances in the data. The Rough set approach to feature selection is used to discover the data dependencies and reduction in the number of attributes contained in a data set using the data alone, requiring no additional information. For selecting discriminative features from principal components, the Rough set theory can be applied jointly with PCA, which guarantees that the selected principal components will be the most adequate for classification. We call this method Rough PCA. The proposed method is successfully applied for choosing the principal features and then applying the Upper and Lower Approximations to find the reduced set of features from a gene expression data.

  9. A reference gene set for sex pheromone biosynthesis and degradation genes from the diamondback moth, Plutella xylostella, based on genome and transcriptome digital gene expression analyses.

    PubMed

    He, Peng; Zhang, Yun-Fei; Hong, Duan-Yang; Wang, Jun; Wang, Xing-Liang; Zuo, Ling-Hua; Tang, Xian-Fu; Xu, Wei-Ming; He, Ming

    2017-03-01

    comprehensive gene data set of sex pheromone biosynthesis and degradation enzyme related genes in DBM created by genome- and transcriptome-wide identification, characterization and expression profiling. Our findings provide a basis to better understand the function of genes with tissue enriched expression. The results also provide information on the genes involved in sex pheromone biosynthesis and degradation, and may be useful to identify potential gene targets for pest control strategies by disrupting the insect-insect communication using pheromone-based behavioral antagonists.

  10. Primer Sets Developed To Amplify Conserved Genes from Filamentous Ascomycetes Are Useful in Differentiating Fusarium Species Associated with Conifers

    PubMed Central

    Donaldson, G. C.; Ball, L. A.; Axelrood, P. E.; Glass, N. L.

    1995-01-01

    We examined the usefulness of primer sets designed to amplify introns within conserved genes in filamentous ascomycetes to differentiate 35 isolates representing six different species of Fusarium commonly found in association with conifer seedlings. We analyzed restriction fragment length polymorphisms (RFLP) in five amplified PCR products from each Fusarium isolate. The primers used in this study were constructed on the basis of sequence information from the H3, H4, and (beta)-tubulin genes in Neurospora crassa. Primers previously developed for the intergenic transcribed spacer region of the ribosomal DNA were also used. The degree of interspecific polymorphism observed in the PCR products from the six Fusarium species allowed differentiation by a limited number of amplifications and restriction endonuclease digestions. The level of intraspecific RFLP variation in the five PCR products was low in both Fusarium proliferatum and F. avenaceum but was high in a population sample of F. oxysporum isolates. Clustering of the 35 isolates by statistical analyses gave similar dendrograms for H3, H4, and (beta)-tubulin RFLP analysis, but a dendrogram produced by intergenic transcribed spacer analysis varied in the placement of some F. oxysporum isolates. PMID:16534991

  11. A Minimal Set of Glycolytic Genes Reveals S